type
status
date
slug
summary
tags
category
icon
password
这么看代码可能效率很低 但是我还是想先仔细逐行分析一下完整的project(半记忆背诵的性质) 然后尝试复刻做一个完整的自己的东西 也是因为这个代码量比较小才会这么做

-parser Finetune argument

parser.add_argument(
  1. dataset_name, dataset_config
  1. train, test, validation file(csv, json)
  1. max_source_length(max input sequence length after tokenization)
  1. source_prefix(T5)
  1. number of processes
  1. max_target_length, max_validation_target
  1. num_beams()
  1. model, config, tokenizer name(pretrained config表示?)
  1. text, summary_column(datasets中 包含full text和summary的column in str)
  1. slow tokenizer option(不用huggingface tokenizer)
  1. batch size, learning rate, weight decay, #epoch(与max_train_steps区别)
  1. gradient_accumulation_steps(#updates steps to accumulate before performing a backward/update pass)
  1. out_put dir(for store trained model), seed, model type(for training from scratch)
  1. max_length, pad_to_max_length(与max_source_length区别?), ignore_pad_token_for_loss
  1. logging, save steps, checkpoint_save
  1. max_grad_norm
)
args = parser.parse_args()
 
 

-Main

 
使用wandb 去initializes Weights & Biases (WandB) for tracking experiment metrics
 

Get dataset

1.load_dataset(args.dataset_name, args.dataset_config_name)
with dataset_name from public dataset in huggingface hub
 
  1. load_dataset(extension, data_files=data_files)
data_files = dict[”train”, “validation”…]
extension = csv ^ json
 

Load config

config =
AutoConfig.from_pretrained(config_name ^ from model_name_path)
^ CONFIG_MAPPING(model_type) (Create new config)
 

Load tokenizer

tokenizer =
AutoTokenizer.from_pretrained(tokenizer_name^ from model_name_path)
 

Load model

model =
AutoModelForSeq2SeqLM.from_pretrained(model_name_path, config)
^ AutoModelForSeq2SeqLM.from_config(config)
 
Add Padding token “PAD” to tokenizer →
model.resize_token_embedding(len(tokenizer))
 
Set up model.config.decoder_start_token_id
 

Dataset mapping method

inputs.append(context + (knowledgeBase))
model_input = tokenizer(inputs, max_length, padding, truncation = True)
 
labels = tokenizer(responses, max_target_length, padding, truncation = True)
 
model_inputs["labels"] = labels["labels"]
 
若是 max_length padding(ignore padding in the loss)
 

Convert raw dataset and initialize data loader

 
#用到了multiprocessing
lm_datasets = raw_dataset.map(mapping_func, remove_column, num_proc…)
train_dataset = lm_datasets["train"]…
 
label_pad_token_id = -100 if args.ignore_pad_token_for_loss else tokenizer.pad_token_id
 
Build Collator for data loader
 

Optimizer

 

Scheduling

 
num_update_per_epoch = ceil(len(dataloader)(# of batch per epoch) / gradient_accumulate_step)
max_train_steps(schedular参数) = epochs * num_update_per_epoch
 

Training

 
for epoch in range(num_epoch):
for step, batch in enumerate(dataloader):
global_steps += 1
outputs = model(**batch)
 
loss = outputs.loss / gradient_accumulation_steps
 
accelerator.backward(loss)
 
if step % args.gradient_accumulate_step == 0 or step == len(dataloader) - 1:
optimizer.step()
scheduler.step()
optimizer.zero_grad()
completed_step += 1
 
#Total update count
if completed_step ≥ max_train_steps:
break
 
#模型保存
if output_dir and global_steps % steps_to_save == 0:
accelerator.wait_for_everyone()
 
unwrapped_model = accelerator.unwrap_model(model)
unwrapped_model.save_pretrained(output_dir)
 
tokenizer.save_pretrained(output_dir)
torch.save(arts, output_dir + training_args.bin)
 
 
Torch 常用方法记录Leetcode记录「1」
Loading...
ran2323
ran2323
我们再来一次, 这一次, 好好来!
Latest posts
Leetcode记录「2」
2024-12-27
Flutter 基础 记录
2024-12-25
Flutter tutorial 记录
2024-12-25
Privicy policy for GitHub To Text (Chrome Extension)
2024-12-22
一些 Kubernetes 笔记
2024-12-21
一些 docker 笔记
2024-12-20
Announcement
 
 
 
 
暂时没有新的内容