AI/CodeT5

mirror of https://github.com/salesforce/CodeT5.git synced 2024-10-01 06:35:38 -04:00

History

WANG Yue ebf3075b24 update readme		2023-07-12 16:00:02 +08:00
..
humaneval	add model predictions	2023-06-27 12:42:28 +08:00
codet5p_architecture.png	reorganize the repo	2023-05-17 17:34:00 +08:00
codet5p_overview.png	update readme	2023-05-15 18:33:57 +08:00
deepspeed_config.json	add instruction tuning script	2023-06-24 16:05:51 +08:00
instruct_tune_codet5p.py	add instruction tuning script	2023-06-24 16:05:51 +08:00
README.md	update readme	2023-07-12 16:00:02 +08:00
tune_codet5p_seq2seq.py	add seq2seq finetuning example	2023-06-05 19:59:27 +08:00

README.md

CodeT5+

Official research release for the CodeT5+ models (220M, 770M, 2B, 6B 16B) for a wide range of Code Understanding and Generation tasks. Find out more via our blog post.

Title: CodeT5+: Open Code Large Language Models for Code Understanding and Generation

Authors: Yue Wang*, Hung Le*, Akhilesh Deepak Gotmare, Nghi D.Q. Bui, Junnan Li, Steven C.H. Hoi (* indicates equal contribution)

What is this about?

CodeT5+ is a new family of open code large language models with an encoder-decoder architecture that can flexibly operate in different modes (i.e. encoder-only, decoder-only, and encoder-decoder) to support a wide range of code understanding and generation tasks.

To train CodeT5+, we introduce a diverse set of pretraining tasks including span denoising, causal language modeling, contrastive learning, and text-code matching to learn rich representations from both unimodal code data and bimodal code-text data. Additionally, to efficiently scale up the model, we propose a simple yet effective compute-efficient pretraining method to initialize our model with frozen off-the-shelf LLMs such as CodeGen. Furthermore, we explore instruction tuning to align the model with natural language instructions following Code Alpaca. See the below overview of CodeT5+.

Released Models
How to Use?
Instruction Tuning to Align with Natural Language Instructions
How to Finetune Using Your Own Data?
Reproduce the Results
Citation

Released Models

We implemented a family of CodeT5+ models, with model size ranging from 220M to 16B. Note that CodeT5+ 220M and 770M employ the same architecture of CodeT5-base and large respectively and are pretrained from scratch, while CodeT5+ 2B, 6B, 16B employ a "shallow encoder and deep decoder" architecture with the shallow encoder initialized from CodeGen-mono 350M and the deep decoder initialized from CodeGen-mono 2B, 6B, 16B, respectively. InstructCodeT5+ 16B is our instruction-tuned model from CodeT5+ 16B. Note that as this model utilizes instruction tuning data curated using OpenAI API, the checkpoint of InstructCodeT5+ 16B is licensed for research and non-commercial use only.

We release the following CodeT5+ models at Huggingface:

CodeT5+ 220M and 770M: codet5p-220m and codet5p-770m.
CodeT5+ 220M and 770M that are further tuned on Python subset: codet5p-220m-py and codet5p-770m-py.
CodeT5+ 2B, 6B, 16B: codet5p-2b, codet5p-6b, and codet5p-16b.
InstructCodeT5+ 16B: instructcodet5p-16b.

How to Use?

All CodeT5+ models and tokenizers can be easily loaded using the AutoModelForSeq2SeqLM and AutoTokenizer functionality. For tokenizers, CodeT5+ 220M and 770M employ the same tokenizer as the original CodeT5 while CodeT5+ 2B, 6B, 16B employ the same tokenizer as CodeGen.

For CodeT5+ 2B, 6B, 16B, and InstructCodeT5+ 16B, please set trust_remote_code=True when loading the models as the model class is defined in the Huggingface repo. Besides, these models would benefit from passing additional prompts to the decoder via decoder_input_ids to achieve better generation performance.

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch

checkpoint = "Salesforce/instructcodet5p-16b"
device = "cuda" # for GPU usage or "cpu" for CPU usage

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint,
                                              torch_dtype=torch.float16,
                                              low_cpu_mem_usage=True,
                                              trust_remote_code=True).to(device)

encoding = tokenizer("def print_hello_world():", return_tensors="pt").to(device)
encoding['decoder_input_ids'] = encoding['input_ids'].clone()
outputs = model.generate(**encoding, max_length=15)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Instruction Tuning to Align with Natural Language Instructions

We explore instruction tuning to align CodeT5+ with natural language instructions following Code Alpaca. First download the instruction data code_alpaca_20k.json from here. Then, you can run the following command to finetune CodeT5+ 16B on the instruction data.

MODEL=Salesforce/codet5p-16b
SAVE_DIR=saved_models/instructcodet5p-16b

deepspeed instruct_tune_codet5p.py \
  --load $MODEL --save-dir $SAVE_DIR --instruct-data-path code_alpaca_20k.json \
  --fp16 --deepspeed deepspeed_config.json

How to Finetune Using Your Own Data?

We provide an example finetuning script tune_codet5p_seq2seq.py for CodeT5+ models on Seq2Seq LM task. After installing the transformers and datasets libraries, you can run python tune_codet5p_seq2seq.py to finetune CodeT5+ models on any Seq2Seq LM tasks such as Python code summarization as illustrated in the script. To finetune on your own data, you just need to prepare your customized data in the datasets format and pass its path to --cache-data.

Besides, you can specify --load to select the specific CodeT5+ model (e.g., Salesforce/codet5p-220m) to finetune from. To tune the hyper-parameter setting that suit your task the best, you can customize other finetuning arguments such as --epochs, --lr, --lr-warmup-steps, --max-source-len, --max-target-len, --batch-size-per-replica, --grad-acc-steps, etc. This script naturally supports both single-GPU and multi-GPU training. If you have limited GPU memory issue and want to improve the training throughput, please consider to specify --fp16 to enable mixed-precision training and use DeepSpeed for further optimization by passing a deedspeed config file to --deepspeed (see here for an example config file).

Reproduce the Results

HumanEval

Our CodeT5+ models achieve very strong results on HumanEval benchmark in zero-shot setting. We follow common practices to employ nucleus sampling with different temperature T for computing Pass@k (T=0.2,0.6,0.8 for k=1,10,100 respectively).

Model	Pass@1	Pass@10	Pass@100
LLaMA 7B	10.5	-	36.5
LaMDA 137B	14.0	-	47.3
InCoder 6B	15.2	27.8	47.0
GPT-NeoX 20B	15.4	25.6	41.2
CodeT5+ 770M	15.5	27.2	42.7
LLaMA 13B	15.8	-	52.5
PaLM 62B	15.9	-	46.3
AlphaCode 1.1B	17.1	28.2	45.3
LLaMA 33B	21.7	-	70.7
Replit 3B	21.9	-	-
CodeGeeX 13B	22.9	39.6	60.9
LLaMA 65B	23.7	-	79.3
PaLM 540B	26.2	-	76.2
CodeGen-mono 16B	29.3	49.9	75.0
CodeT5+ 16B	30.9	51.6	76.7
code-cushman-001	33.5	54.3	77.4
StarCoder 15B	33.6	-	-
InstructCodeT5+ 16B	36.1	57.1	80.7

Please follow the instructions below to reproduce the results.

Installation

Install the official HumanEval evaluation tool released by OpenAI following the instructions in this repo.
Install the Pytorch (version 1.13.1) and transformers (version 4.21.3) libraries.

Generating programs from CodeT5+ models

cd humaneval then run the inference via bash run_generate.sh. You can select the model to generate from by changing the model variable in the script. Following the original setting in the HumanEval paper, we generate 200 programs (pred_num=200) for each problem and employs nucleus sampling with different temperature T for computing Pass@k (T=0.2,0.6,0.8 for k=1,10,100 respectively). The generated programs will be saved in preds/${model}_T${T}_N${pred_num}.

model=instructcodet5p-16b
temp=0.2
max_len=800
pred_num=200
num_seqs_per_iter=2 # 25 for 350M and 770M, 10 for 2B, 8 for 6B, 2 for 16B on A100-40G

output_path=preds/${model}_T${temp}_N${pred_num}

mkdir -p ${output_path}
echo 'Output path: '$output_path
echo 'Model to eval: '$model

# 164 problems, 21 per GPU if GPU=8
index=0
gpu_num=8
for ((i = 0; i < $gpu_num; i++)); do
  start_index=$((i * 21))
  end_index=$(((i + 1) * 21))

  gpu=$((i))
  echo 'Running process #' ${i} 'from' $start_index 'to' $end_index 'on GPU' ${gpu}
  ((index++))
  (
    CUDA_VISIBLE_DEVICES=$gpu python generate_codet5p.py --model Salesforce/${model} \
      --start_index ${start_index} --end_index ${end_index} --temperature ${temp} \
      --num_seqs_per_iter ${num_seqs_per_iter} --N ${pred_num} --max_len ${max_len} --output_path ${output_path}
  ) &
  if (($index % $gpu_num == 0)); then wait; fi
done

Evaluating Pass@k

cd humaneval then run the evaluation via bash run_eval.sh.

output_path=preds/instructcodet5p-16b_T0.2_N200

echo 'Output path: '$output_path
python process_preds.py --path ${output_path} --out_path ${output_path}.jsonl

evaluate_functional_correctness ${output_path}.jsonl

Note that the reproduced results might be slightly different from the reported ones due to the randomness of the sampling process. We also released the model predictions for our InstructCodeT5+ 16B at humaneval/instructcodet5p-16b_T0.2_N200.jsonl for your reference. It can reproduce the results of 36.1% Pass@1 with the following command.

evaluate_functional_correctness humaneval/instructcodet5p-16b_T0.2_N200.jsonl

Citation

@article{wang2023codet5plus,
  title={CodeT5+: Open Code Large Language Models for Code Understanding and Generation},
  author={Wang, Yue and Le, Hung and Gotmare, Akhilesh Deepak and Bui, Nghi D.Q. and Li, Junnan and Hoi, Steven C. H.},
  journal={arXiv preprint},
  year={2023}
}