type
status
date
slug
summary
tags
category
icon
password
这么看代码可能效率很低 但是我还是想先仔细逐行分析一下完整的project(半记忆背诵的性质) 然后尝试复刻做一个完整的自己的东西 也是因为这个代码量比较小才会这么做
-parser Finetune argument
parser.add_argument(
- dataset_name, dataset_config
- train, test, validation file(csv, json)
- max_source_length(max input sequence length after tokenization)
- source_prefix(T5)
- number of processes
- max_target_length, max_validation_target
- num_beams()
- model, config, tokenizer name(pretrained config表示?)
- text, summary_column(datasets中 包含full text和summary的column in str)
- slow tokenizer option(不用huggingface tokenizer)
- batch size, learning rate, weight decay, #epoch(与max_train_steps区别)
- gradient_accumulation_steps(#updates steps to accumulate before performing a backward/update pass)
- out_put dir(for store trained model), seed, model type(for training from scratch)
- max_length, pad_to_max_length(与max_source_length区别?), ignore_pad_token_for_loss
- logging, save steps, checkpoint_save
- max_grad_norm
)
args = parser.parse_args()
-Main
使用wandb 去initializes Weights & Biases (WandB) for tracking experiment metrics
Get dataset
1.load_dataset(args.dataset_name, args.dataset_config_name)
with dataset_name from public dataset in huggingface hub
- load_dataset(extension, data_files=data_files)
data_files = dict[”train”, “validation”…]
extension = csv ^ json
Load config
config =
AutoConfig.from_pretrained(config_name ^ from model_name_path)
^ CONFIG_MAPPING(model_type) (Create new config)
Load tokenizer
tokenizer =
AutoTokenizer.from_pretrained(tokenizer_name^ from model_name_path)
Load model
model =
AutoModelForSeq2SeqLM.from_pretrained(model_name_path, config)
^ AutoModelForSeq2SeqLM.from_config(config)
Add Padding token “PAD” to tokenizer →
model.resize_token_embedding(len(tokenizer))
Set up model.config.decoder_start_token_id
Dataset mapping method
inputs.append(context + (knowledgeBase))
model_input = tokenizer(inputs, max_length, padding, truncation = True)
labels = tokenizer(responses, max_target_length, padding, truncation = True)
model_inputs["labels"] = labels["labels"]
若是 max_length padding(ignore padding in the loss)
Convert raw dataset and initialize data loader
#用到了multiprocessing
lm_datasets = raw_dataset.map(mapping_func, remove_column, num_proc…)
train_dataset = lm_datasets["train"]…
label_pad_token_id = -100 if args.ignore_pad_token_for_loss else tokenizer.pad_token_id
Build Collator for data loader
Optimizer
Scheduling
num_update_per_epoch = ceil(len(dataloader)(# of batch per epoch) / gradient_accumulate_step)
max_train_steps(schedular参数) = epochs * num_update_per_epoch
Training
for epoch in range(num_epoch):
for step, batch in enumerate(dataloader):
global_steps += 1
outputs = model(**batch)
loss = outputs.loss / gradient_accumulation_steps
accelerator.backward(loss)
if step % args.gradient_accumulate_step == 0 or step == len(dataloader) - 1:
optimizer.step()
scheduler.step()
optimizer.zero_grad()
completed_step += 1
#Total update count
if completed_step ≥ max_train_steps:
break
#模型保存
if output_dir and global_steps % steps_to_save == 0:
accelerator.wait_for_everyone()
unwrapped_model = accelerator.unwrap_model(model)
unwrapped_model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
torch.save(arts, output_dir + training_args.bin)
- Author:ran2323
- URL:https://www.blueif.me//article/12b71a79-6e22-8005-a5ad-fa21cde4703d
- Copyright:All articles in this blog, except for special statements, adopt BY-NC-SA agreement. Please indicate the source!