mirror of
https://github.com/tatsu-lab/stanford_alpaca.git
synced 2024-10-01 05:35:37 -04:00
Merge pull request #216 from tatsu-lab/hf-migrate
let training code run with huggingface transformers main
This commit is contained in:
commit
e408b27bfd
70
README.md
70
README.md
@ -15,6 +15,7 @@ This is the repo for the Stanford Alpaca project, which aims to build and share
|
|||||||
- The [52K data](#data-release) used for fine-tuning the model.
|
- The [52K data](#data-release) used for fine-tuning the model.
|
||||||
- The code for [generating the data](#data-generation-process).
|
- The code for [generating the data](#data-generation-process).
|
||||||
- The code for [fine-tuning the model](#fine-tuning).
|
- The code for [fine-tuning the model](#fine-tuning).
|
||||||
|
- The code for [recovering Alpaca-7B weights from our released weight diff](#recovering-alpaca-weights).
|
||||||
|
|
||||||
Note: We thank the community for feedback on Stanford-Alpaca and supporting our research. Our live demo is suspended until further notice.
|
Note: We thank the community for feedback on Stanford-Alpaca and supporting our research. Our live demo is suspended until further notice.
|
||||||
|
|
||||||
@ -115,10 +116,7 @@ We fine-tune LLaMA-7B and LLaMA-13B with the following hyperparameters:
|
|||||||
| Max length | 512 | 512 |
|
| Max length | 512 | 512 |
|
||||||
| Weight decay | 0 | 0 |
|
| Weight decay | 0 | 0 |
|
||||||
|
|
||||||
We have also fine-tuned larger variants of LLaMA and are in the process of evaluating those models.
|
We have also fine-tuned larger variants of LLaMA and performed subsequent RLHF and are in the process of evaluating those models.
|
||||||
|
|
||||||
Given Hugging Face hasn't officially supported the LLaMA models, we fine-tuned LLaMA with Hugging Face's transformers library by installing it from a particular fork (i.e. this [PR](https://github.com/huggingface/transformers/pull/21955) to be merged).
|
|
||||||
The hash of the specific commit we installed was `68d640f7c368bcaaaecfc678f11908ebbd3d6176`.
|
|
||||||
|
|
||||||
To reproduce our fine-tuning runs for LLaMA, first install the requirements
|
To reproduce our fine-tuning runs for LLaMA, first install the requirements
|
||||||
|
|
||||||
@ -153,20 +151,10 @@ torchrun --nproc_per_node=4 --master_port=<your_random_port> train.py \
|
|||||||
--lr_scheduler_type "cosine" \
|
--lr_scheduler_type "cosine" \
|
||||||
--logging_steps 1 \
|
--logging_steps 1 \
|
||||||
--fsdp "full_shard auto_wrap" \
|
--fsdp "full_shard auto_wrap" \
|
||||||
--fsdp_transformer_layer_cls_to_wrap 'LLaMADecoderLayer' \
|
--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
|
||||||
--tf32 True
|
--tf32 True
|
||||||
```
|
```
|
||||||
|
|
||||||
### Warning
|
|
||||||
|
|
||||||
`fsdp_transformer_layer_cls_to_wrap` must be set to the name of the specific decoder layer.
|
|
||||||
The LLaMA Hugging Face PR is not stable.
|
|
||||||
Earlier commits used the name `LLaMADecoderLayer` for their decoder layer (the commit hash our code is based on this).
|
|
||||||
More recent commits use `LlamaDecoderLayer` (notice the small case difference).
|
|
||||||
Not setting `fsdp_transformer_layer_cls_to_wrap` to the correct name will lead to drastic slowdowns in training.
|
|
||||||
|
|
||||||
### Side notes
|
|
||||||
|
|
||||||
The same script also works for OPT fine-tuning. Here's an example for fine-tuning OPT-6.7B
|
The same script also works for OPT fine-tuning. Here's an example for fine-tuning OPT-6.7B
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
@ -196,6 +184,58 @@ torchrun --nproc_per_node=4 --master_port=<your_random_port> train.py \
|
|||||||
Note the given training script is meant to be simple and easy to use, and is not particularly optimized.
|
Note the given training script is meant to be simple and easy to use, and is not particularly optimized.
|
||||||
To run on more gpus, you may prefer to turn down `gradient_accumulation_steps` to keep a global batch size of 128. Global batch size has not been tested for optimality.
|
To run on more gpus, you may prefer to turn down `gradient_accumulation_steps` to keep a global batch size of 128. Global batch size has not been tested for optimality.
|
||||||
|
|
||||||
|
### Addressing OOM
|
||||||
|
|
||||||
|
Naively, fine-tuning a 7B model requires about 7 x 4 x 4 = 112 GB of VRAM. Commands given above enable parameter sharding, so no redundant model copy is stored on any GPU.
|
||||||
|
If you'd like to further reduce the memory footprint, here are some options:
|
||||||
|
|
||||||
|
- Turn on CPU offload for FSDP with `--fsdp "full_shard auto_wrap offload"`. This saves VRAM at the cost longer runtime.
|
||||||
|
- In our experience, DeepSpeed stage-3 (with offload) can at times be more memory efficient than FSDP. Here's an example to use DeepSpeed stage-3 with 4 GPUs with both parameter and optimizer offload:
|
||||||
|
```bash
|
||||||
|
pip install deepspeed
|
||||||
|
torchrun --nproc_per_node=4 --master_port=<your_random_port> train.py \
|
||||||
|
--model_name_or_path <your_path_to_hf_converted_llama_ckpt_and_tokenizer> \
|
||||||
|
--data_path ./alpaca_data.json \
|
||||||
|
--bf16 True \
|
||||||
|
--output_dir <your_output_dir> \
|
||||||
|
--num_train_epochs 3 \
|
||||||
|
--per_device_train_batch_size 4 \
|
||||||
|
--per_device_eval_batch_size 4 \
|
||||||
|
--gradient_accumulation_steps 8 \
|
||||||
|
--evaluation_strategy "no" \
|
||||||
|
--save_strategy "steps" \
|
||||||
|
--save_steps 2000 \
|
||||||
|
--save_total_limit 1 \
|
||||||
|
--learning_rate 2e-5 \
|
||||||
|
--weight_decay 0. \
|
||||||
|
--warmup_ratio 0.03 \
|
||||||
|
--deepspeed "./configs/default_offload_opt_param.json" \
|
||||||
|
--tf32 True
|
||||||
|
```
|
||||||
|
- The DeepSpeed library also provides some [helpful functions](https://deepspeed.readthedocs.io/en/latest/memory.html) to estimate memory usage.
|
||||||
|
- [LoRA](https://arxiv.org/abs/2106.09685) fine-tunes low-rank slices of the query, key, and value embeddings. This can reduce the total memory footprint from 112GB to about 7x4=28GB. We may release our re-implemention of this in the future, but for now the [peft](https://github.com/huggingface/peft) codebase can be a useful resource.
|
||||||
|
|
||||||
|
## Recovering Alpaca Weights
|
||||||
|
|
||||||
|
The weight diff between Alpaca-7B and LLaMA-7B is located [here](https://huggingface.co/tatsu-lab/alpaca-7b-wdiff/tree/main).
|
||||||
|
To recover the original Alpaca-7B weights, follow these steps:
|
||||||
|
```text
|
||||||
|
1. Convert Meta's released weights into huggingface format. Follow this guide:
|
||||||
|
https://huggingface.co/docs/transformers/main/model_doc/llama
|
||||||
|
2. Make sure you cloned the released weight diff into your local machine. The weight diff is located at:
|
||||||
|
https://huggingface.co/tatsu-lab/alpaca-7b/tree/main
|
||||||
|
3. Run this function with the correct paths. E.g.,
|
||||||
|
python weight_diff.py recover --path_raw <path_to_step_1_dir> --path_diff <path_to_step_2_dir> --path_tuned <path_to_store_recovered_weights>
|
||||||
|
```
|
||||||
|
|
||||||
|
Once step 3 completes, you should have a directory with the recovered weights, from which you can load the model like the following
|
||||||
|
|
||||||
|
```python
|
||||||
|
import transformers
|
||||||
|
alpaca_model = transformers.AutoModelForCausalLM.from_pretrained("<path_to_store_recovered_weights>")
|
||||||
|
alpaca_tokenizer = transformers.AutoTokenizer.from_pretrained("<path_to_store_recovered_weights>")
|
||||||
|
```
|
||||||
|
|
||||||
### Authors
|
### Authors
|
||||||
|
|
||||||
All grad students below contributed equally and the order is determined by random draw.
|
All grad students below contributed equally and the order is determined by random draw.
|
||||||
|
49
configs/default_offload_opt_param.json
Normal file
49
configs/default_offload_opt_param.json
Normal file
@ -0,0 +1,49 @@
|
|||||||
|
{
|
||||||
|
"bf16": {
|
||||||
|
"enabled": "auto"
|
||||||
|
},
|
||||||
|
"optimizer": {
|
||||||
|
"type": "AdamW",
|
||||||
|
"params": {
|
||||||
|
"lr": "auto",
|
||||||
|
"betas": "auto",
|
||||||
|
"eps": "auto",
|
||||||
|
"weight_decay": "auto"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"scheduler": {
|
||||||
|
"type": "WarmupDecayLR",
|
||||||
|
"params": {
|
||||||
|
"total_num_steps": "auto",
|
||||||
|
"warmup_min_lr": "auto",
|
||||||
|
"warmup_max_lr": "auto",
|
||||||
|
"warmup_num_steps": "auto"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"zero_optimization": {
|
||||||
|
"stage": 3,
|
||||||
|
"offload_optimizer": {
|
||||||
|
"device": "cpu",
|
||||||
|
"pin_memory": true
|
||||||
|
},
|
||||||
|
"offload_param": {
|
||||||
|
"device": "cpu",
|
||||||
|
"pin_memory": true
|
||||||
|
},
|
||||||
|
"overlap_comm": true,
|
||||||
|
"contiguous_gradients": true,
|
||||||
|
"sub_group_size": 1e9,
|
||||||
|
"reduce_bucket_size": "auto",
|
||||||
|
"stage3_prefetch_bucket_size": "auto",
|
||||||
|
"stage3_param_persistence_threshold": "auto",
|
||||||
|
"stage3_max_live_parameters": 1e9,
|
||||||
|
"stage3_max_reuse_distance": 1e9,
|
||||||
|
"stage3_gather_16bit_weights_on_model_save": false
|
||||||
|
},
|
||||||
|
"gradient_accumulation_steps": "auto",
|
||||||
|
"gradient_clipping": "auto",
|
||||||
|
"steps_per_print": 5,
|
||||||
|
"train_batch_size": "auto",
|
||||||
|
"train_micro_batch_size_per_gpu": "auto",
|
||||||
|
"wall_clock_breakdown": false
|
||||||
|
}
|
@ -2,8 +2,8 @@ numpy
|
|||||||
rouge_score
|
rouge_score
|
||||||
fire
|
fire
|
||||||
openai
|
openai
|
||||||
transformers>=4.26.1
|
transformers>=4.28.1
|
||||||
torch
|
torch
|
||||||
sentencepiece
|
sentencepiece
|
||||||
tokenizers==0.12.1
|
tokenizers>=0.13.3
|
||||||
wandb
|
wandb
|
||||||
|
47
train.py
47
train.py
@ -15,20 +15,19 @@
|
|||||||
import copy
|
import copy
|
||||||
import logging
|
import logging
|
||||||
from dataclasses import dataclass, field
|
from dataclasses import dataclass, field
|
||||||
from typing import Optional, Dict, Sequence
|
from typing import Dict, Optional, Sequence
|
||||||
|
|
||||||
import torch
|
import torch
|
||||||
import transformers
|
import transformers
|
||||||
|
import utils
|
||||||
from torch.utils.data import Dataset
|
from torch.utils.data import Dataset
|
||||||
from transformers import Trainer
|
from transformers import Trainer
|
||||||
|
|
||||||
import utils
|
|
||||||
|
|
||||||
IGNORE_INDEX = -100
|
IGNORE_INDEX = -100
|
||||||
DEFAULT_PAD_TOKEN = "[PAD]"
|
DEFAULT_PAD_TOKEN = "[PAD]"
|
||||||
DEFAULT_EOS_TOKEN = "</s>"
|
DEFAULT_EOS_TOKEN = "</s>"
|
||||||
DEFAULT_BOS_TOKEN = "</s>"
|
DEFAULT_BOS_TOKEN = "<s>"
|
||||||
DEFAULT_UNK_TOKEN = "</s>"
|
DEFAULT_UNK_TOKEN = "<unk>"
|
||||||
PROMPT_DICT = {
|
PROMPT_DICT = {
|
||||||
"prompt_input": (
|
"prompt_input": (
|
||||||
"Below is an instruction that describes a task, paired with an input that provides further context. "
|
"Below is an instruction that describes a task, paired with an input that provides further context. "
|
||||||
@ -63,15 +62,6 @@ class TrainingArguments(transformers.TrainingArguments):
|
|||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
def safe_save_model_for_hf_trainer(trainer: transformers.Trainer, output_dir: str):
|
|
||||||
"""Collects the state dict and dump to disk."""
|
|
||||||
state_dict = trainer.model.state_dict()
|
|
||||||
if trainer.args.should_save:
|
|
||||||
cpu_state_dict = {key: value.cpu() for key, value in state_dict.items()}
|
|
||||||
del state_dict
|
|
||||||
trainer._save(output_dir, state_dict=cpu_state_dict) # noqa
|
|
||||||
|
|
||||||
|
|
||||||
def smart_tokenizer_and_embedding_resize(
|
def smart_tokenizer_and_embedding_resize(
|
||||||
special_tokens_dict: Dict,
|
special_tokens_dict: Dict,
|
||||||
tokenizer: transformers.PreTrainedTokenizer,
|
tokenizer: transformers.PreTrainedTokenizer,
|
||||||
@ -205,26 +195,27 @@ def train():
|
|||||||
padding_side="right",
|
padding_side="right",
|
||||||
use_fast=False,
|
use_fast=False,
|
||||||
)
|
)
|
||||||
|
special_tokens_dict = dict()
|
||||||
if tokenizer.pad_token is None:
|
if tokenizer.pad_token is None:
|
||||||
smart_tokenizer_and_embedding_resize(
|
special_tokens_dict["pad_token"] = DEFAULT_PAD_TOKEN
|
||||||
special_tokens_dict=dict(pad_token=DEFAULT_PAD_TOKEN),
|
if tokenizer.eos_token is None:
|
||||||
tokenizer=tokenizer,
|
special_tokens_dict["eos_token"] = DEFAULT_EOS_TOKEN
|
||||||
model=model,
|
if tokenizer.bos_token is None:
|
||||||
)
|
special_tokens_dict["bos_token"] = DEFAULT_BOS_TOKEN
|
||||||
if "llama" in model_args.model_name_or_path:
|
if tokenizer.unk_token is None:
|
||||||
tokenizer.add_special_tokens(
|
special_tokens_dict["unk_token"] = DEFAULT_UNK_TOKEN
|
||||||
{
|
|
||||||
"eos_token": DEFAULT_EOS_TOKEN,
|
smart_tokenizer_and_embedding_resize(
|
||||||
"bos_token": DEFAULT_BOS_TOKEN,
|
special_tokens_dict=special_tokens_dict,
|
||||||
"unk_token": DEFAULT_UNK_TOKEN,
|
tokenizer=tokenizer,
|
||||||
}
|
model=model,
|
||||||
)
|
)
|
||||||
|
|
||||||
data_module = make_supervised_data_module(tokenizer=tokenizer, data_args=data_args)
|
data_module = make_supervised_data_module(tokenizer=tokenizer, data_args=data_args)
|
||||||
trainer = Trainer(model=model, tokenizer=tokenizer, args=training_args, **data_module)
|
trainer = Trainer(model=model, tokenizer=tokenizer, args=training_args, **data_module)
|
||||||
trainer.train()
|
trainer.train()
|
||||||
trainer.save_state()
|
trainer.save_state()
|
||||||
safe_save_model_for_hf_trainer(trainer=trainer, output_dir=training_args.output_dir)
|
trainer.save_model(output_dir=training_args.output_dir)
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
|
@ -133,6 +133,7 @@ def recover(
|
|||||||
|
|
||||||
if path_tuned is not None:
|
if path_tuned is not None:
|
||||||
model_recovered.save_pretrained(path_tuned)
|
model_recovered.save_pretrained(path_tuned)
|
||||||
|
tokenizer_recovered.save_pretrained(path_tuned)
|
||||||
|
|
||||||
if test_inference:
|
if test_inference:
|
||||||
input_text = (
|
input_text = (
|
||||||
|
Loading…
Reference in New Issue
Block a user