add humaneval evaluation

This commit is contained in:
WANG Yue 2023-05-20 18:27:46 +08:00
parent aaeb477b89
commit ce36447d85
7 changed files with 283 additions and 10 deletions

View File

@ -1,11 +1,13 @@
# CodeT5+
Official research release for the **CodeT5+** models (`220M`, `770M`, `2B`, `6B` `16B`) for a wide range of **Code Understanding and Generation** tasks.
Find out more via our [blog post](https://blog.salesforceairesearch.com/codet5-open-code-large-language-models/).
*Title*: [CodeT5+: Open Code Large Language Models for Code Understanding and Generation](https://arxiv.org/pdf/2305.07922.pdf)
*Authors*: [Yue Wang](https://yuewang-cuhk.github.io/)\*, [Hung Le](https://sites.google.com/view/henryle2018/home?pli=1)\*, [Akhilesh Deepak Gotmare](https://akhileshgotmare.github.io/), [Nghi D.Q. Bui](https://bdqnghi.github.io/), [Junnan Li](https://sites.google.com/site/junnanlics), [Steven C.H. Hoi](https://sites.google.com/view/stevenhoi/home) (* indicates equal contribution)
# What is this about?
CodeT5+ is a new family of open code large language models with an encoder-decoder architecture that can flexibly operate in different modes (i.e. _encoder-only_, _decoder-only_, and _encoder-decoder_) to support a wide range of code understanding and generation tasks.
@ -32,7 +34,8 @@ We release the following CodeT5+ models at Huggingface:
# How to Use?
All CodeT5+ models and tokenizers can be easily loaded using the `AutoModelForSeq2SeqLM` and `AutoTokenizer` functionality.
For tokenizers, CodeT5+ `220M` and `770M` employ the same tokenizer as the original [CodeT5](https://github.com/salesforce/CodeT5) while CodeT5+ `2B`, `6B`, `16B` employ the same tokenizer as [CodeGen]( https://github.com/salesforce/CodeGen).
To load CodeT5+ `2B`, `6B`, `16B`, please set `trust_remote_code=True` as the [model class](https://huggingface.co/Salesforce/codet5p-16b/blob/main/modeling_codet5p.py) is defined in the Huggingface repo.
To load CodeT5+ `2B`, `6B`, `16B`, and InstructCodeT5+ `16B`, please set `trust_remote_code=True` as the [model class](https://huggingface.co/Salesforce/codet5p-16b/blob/main/modeling_codet5p.py) is defined in the Huggingface repo.
Besides, these models would benefit from passing additional prompts to the decoder via `decoder_input_ids` to achieve better generation performance.
```python
@ -48,17 +51,28 @@ model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint,
low_cpu_mem_usage=True,
trust_remote_code=True).to(device)
inputs = tokenizer.encode("def print_hello():", return_tensors="pt").to(device)
outputs = model.generate(inputs, max_length=12)
encoding = tokenizer("def print_hello_world():", return_tensors="pt").to(device)
encoding['decoder_input_ids'] = encoding['input_ids'].clone()
outputs = model.generate(**encoding, max_length=15)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
# Reproduce the Results
## HumanEval
TBA
### Installation
* Install the official HumanEval evaluation tool released by OpenAI following the instructions in ihis [repo](https://github.com/openai/human-eval).
* Install the Pytorch (version `1.13.1`) and transformers (version `4.21.3`) libraries.
### Generating programs from CodeT5+ models
`cd humaneval` then run the inference via `bash run_generate.sh`.
You can select the model to generate from by changing the `model` variable in the script.
Following the original setting in the HumanEval paper, we generate 200 programs (`pred_num=200`) for each problem and employs nucleus sampling with different temperature `T` for computing `pass@k` (`T=0.2,0.6,0.8` for `k=1,10,100` respectively).
The generated programs will be saved in `preds/${model}_T${temp}_N${pred_num}`.
### Evaluating pass@k
`cd humaneval` then run the evaluation via `bash run_eval.sh`.
## Citation

View File

@ -0,0 +1,162 @@
import argparse
import pprint
import os
import re
from tqdm import tqdm
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from human_eval.data import write_jsonl, read_problems, stream_jsonl
def extract_text(prompt, remove_lines=True):
token = '\"\"\"'
start = token
end = '>>>'
start_idx = prompt.find(start) + len(start)
end_idx = prompt.find(end)
output = prompt[start_idx: end_idx]
if remove_lines:
output = output.replace('\n', ' ')
output = re.sub(r"\s+", " ", output).strip()
return output
INSTRUCTION = """Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
Create a Python script for this problem:
{}
### Response:"""
def main():
parser = argparse.ArgumentParser()
parser.add_argument('--model', type=str, default='Salesforce/instructcodet5p-16b', help="")
parser.add_argument('--output_path', type=str, help="")
parser.add_argument('--start_index', type=int, default=0, help="")
parser.add_argument('--end_index', type=int, default=164, help="")
parser.add_argument('--temperature', type=float, default=0.8, help="")
parser.add_argument('--N', type=int, default=200, help="")
parser.add_argument('--max_len', type=int, default=600, help="")
parser.add_argument('--decoding_style', type=str, default='sampling', help="")
parser.add_argument('--num_seqs_per_iter', type=int, default=50, help='')
parser.add_argument('--overwrite', action='store_true', help='')
args = parser.parse_args()
argsdict = vars(args)
print(pprint.pformat(argsdict))
STOP_SEQS = ['\nclass', '\ndef', '\n#', '\nif', '\nprint']
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
problems = read_problems()
task_ids = sorted(problems.keys())[args.start_index: args.end_index]
prompts = [problems[task_id]['prompt'] for task_id in task_ids]
num_samples = len(prompts)
print("Number of samples: {}".format(num_samples))
tokenizer = AutoTokenizer.from_pretrained(args.model)
model = AutoModelForSeq2SeqLM.from_pretrained(args.model,
trust_remote_code=True, # False for 220m and 770m models
torch_dtype=torch.float16,
low_cpu_mem_usage=True)
model.eval()
model.to(device)
# for larger LLMs such as 2B, 6B, and 16B, we need to pass the text prompt to the decoder
prompt_to_decoder = True if any([size in args.model for size in ['2b', '6b', '16b']]) else False
print(f"Loaded {args.model}.")
for i in tqdm(range(num_samples), ncols=0, total=num_samples):
output_file = args.output_path + '/{}.jsonl'.format(args.start_index + i)
if os.path.exists(output_file) and not args.overwrite:
print(f'Skip {output_file} as it already exists')
continue
prompt = prompts[i].replace(' ', '\t')
if args.model == 'Salesforce/instructcodet5p-16b':
prompt_batch = [INSTRUCTION.format(extract_text(prompt))]
prompt_batch_decoder = [INSTRUCTION.format(extract_text(prompt)) + prompt]
else:
prompt_batch = [prompt]
prompt_batch_decoder = [prompt]
ids_batch = [task_ids[i]]
completion_seqs = []
encoding = tokenizer(prompt_batch, return_tensors="pt", truncation=True, max_length=args.max_len).to(device)
encoding_decoder = tokenizer(prompt_batch_decoder, return_tensors="pt", truncation=True,
max_length=args.max_len).to(device)
if args.decoding_style == 'sampling':
loops = int(args.N / args.num_seqs_per_iter)
else:
loops = 1
for _ in tqdm(range(loops), total=loops, leave=False, ncols=0):
with torch.no_grad():
if args.decoding_style == 'sampling':
if prompt_to_decoder:
gen_tokens = model.generate(**encoding,
decoder_input_ids=encoding_decoder['input_ids'],
do_sample=True,
temperature=args.temperature,
max_length=args.max_len,
num_return_sequences=args.num_seqs_per_iter,
decoder_start_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
top_p=0.95)
else:
gen_tokens = model.generate(**encoding,
do_sample=True,
temperature=args.temperature,
max_length=args.max_len,
num_return_sequences=args.num_seqs_per_iter,
eos_token_id=tokenizer.eos_token_id,
top_p=0.95)
if gen_tokens is not None:
if prompt_to_decoder:
gen_tokens = gen_tokens[:, encoding_decoder['input_ids'].shape[-1]:]
gen_seqs = tokenizer.batch_decode(gen_tokens, skip_special_tokens=True)
else:
gen_seqs = None
if gen_seqs is not None:
assert len(ids_batch) == 1
task_id = ids_batch[0]
for seq_idx, gen_seq in enumerate(gen_seqs):
completion_seq = gen_seq
for stop_seq in STOP_SEQS:
index = completion_seq.find(stop_seq)
if index != -1:
completion_seq = completion_seq[:index]
completion_seq = completion_seq.replace('\t', ' ')
all_code = prompt.replace('\t', ' ') + completion_seq
completion_seqs.append(
{'task_id': task_id,
'completion': completion_seq,
'all_code': all_code # final code for evaluation with unit tests
}
)
print("Saving results to {}".format(output_file))
write_jsonl(output_file, completion_seqs)
if __name__ == '__main__':
main()

View File

@ -0,0 +1,47 @@
from human_eval.data import read_problems, write_jsonl, stream_jsonl
import glob
from tqdm import tqdm
import argparse
parser = argparse.ArgumentParser()
# Inputs
parser.add_argument(
'--path',
type=str,
help="")
parser.add_argument(
'--out_path',
type=str,
help="")
parser.add_argument(
'--add_prompt',
action='store_true',
help='')
args = parser.parse_args()
files = sorted(glob.glob(args.path + '/*.jsonl'))
print("{} files in {}".format(len(files), args.path))
problems = read_problems('data/HumanEval.jsonl.gz')
output = []
for code_file in tqdm(files, total=len(files)):
codes = [c for c in stream_jsonl(code_file)]
if args.add_prompt:
for code in codes:
task_id = code['task_id']
prompt = problems[task_id]['prompt']
if 'def' in code['completion']:
def_line = code['completion'].index('def')
completion = code['completion'][def_line:]
next_line = completion.index('\n')
completion = code['completion'][def_line+next_line+1:]
code['all_code'] = prompt + completion
output += codes
print("save to {}".format(args.out_path))
write_jsonl(args.out_path, output)

View File

@ -0,0 +1,29 @@
model=instructcodet5p-16b
temp=0.2
max_len=800
pred_num=200
num_seqs_per_iter=2 # 25 for 350M and 770M, 10 for 2B, 8 for 6B, 2 for 16B on A100-40G
output_path=preds/${model}_T${temp}_N${pred_num}
mkdir -p ${output_path}
echo 'Output path: '$output_path
echo 'Model to eval: '$model
# 164 problems, 21 per GPU if GPU=8
index=0
gpu_num=8
for ((i = 0; i < $gpu_num; i++)); do
start_index=$((i * 21))
end_index=$(((i + 1) * 21))
gpu=$((i))
echo 'Running process #' ${i} 'from' $start_index 'to' $end_index 'on GPU' ${gpu}
((index++))
(
CUDA_VISIBLE_DEVICES=$gpu python generate_codet5p.py --model Salesforce/${model} \
--start_index ${start_index} --end_index ${end_index} --temperature ${temp} \
--num_seqs_per_iter ${num_seqs_per_iter} --N ${pred_num} --max_len ${max_len} --output_path ${output_path}
) &
if (($index % $gpu_num == 0)); then wait; fi
done

View File

@ -0,0 +1,6 @@
output_path=preds/instructcodet5p-16b_T0.2_N200
echo 'Output path: '$output_path
python process_preds.py --path ${output_path} --out_path ${output_path}.jsonl
evaluate_functional_correctness ${output_path}.jsonl

View File

@ -197,7 +197,23 @@ Note that we employ one A100 GPU for all fine-tuning experiments.
### How to fine-tune on your own task and dataset?
If you want to fine-tune on your dataset, you can add your own task and sub_task in `configs.py` ([here](https://github.com/salesforce/CodeT5/blob/d27512d23ba6130e089e571d8c3e399760db1c31/configs.py#L11)) and add your data path and the function to read in `utils.py` ([here](https://github.com/salesforce/CodeT5/blob/5bb41e21b07fee73f310476a91ded00e385290d7/utils.py#L103) and [here](https://github.com/salesforce/CodeT5/blob/5bb41e21b07fee73f310476a91ded00e385290d7/utils.py#L149)). The read function can be implemented in `_utils.py` similar to [this one](https://github.com/salesforce/CodeT5/blob/aaf9c4a920c4986abfd54a74f5456b056b6409e0/_utils.py#L213). If your task to add is a generation task, you can simply reuse or customize the `run_gen.py`. For understanding tasks, please refer to `run_defect.py` and `run_clone.py`.
## Get Involved
Please create a GitHub issue if you have any questions, suggestions, requests or bug-reports. We welcome PRs!
## Citation
```bibtex
@inproceedings{
wang2021codet5,
title={CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation},
author={Yue Wang, Weishi Wang, Shafiq Joty, Steven C.H. Hoi},
booktitle={EMNLP},
year={2021},
}
@inproceedings{
le2022coderl,
title={CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning},
author={Le, Hung and Wang, Yue and Gotmare, Akhilesh Deepak and Savarese, Silvio and Hoi, Steven C. H.},
booktitle={NeurIPS},
year={2022}
}
```

View File

@ -26,7 +26,7 @@ At Salesforce, we build an AI coding assistant demo using CodeT5 as a VS Code pl
**May 2023**
**CodeT5+** paper and models are released🔥 <br>
[paper](https://arxiv.org/pdf/2305.07922.pdf) | [code](https://github.com/salesforce/CodeT5/tree/main/CodeT5+) | [model](https://huggingface.co/models?sort=downloads&search=codet5p)
[paper](https://arxiv.org/pdf/2305.07922.pdf) | [code](https://github.com/salesforce/CodeT5/tree/main/CodeT5+) | [model](https://huggingface.co/models?sort=downloads&search=codet5p) | [blog](https://blog.salesforceairesearch.com/codet5-open-code-large-language-models/)
**Sep 2022**
@ -56,7 +56,6 @@ multilingual code summarization.
## Citation
If you find this code to be useful for your research, please consider citing:
@ -74,7 +73,7 @@ If you find this code to be useful for your research, please consider citing:
le2022coderl,
title={CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning},
author={Le, Hung and Wang, Yue and Gotmare, Akhilesh Deepak and Savarese, Silvio and Hoi, Steven C. H.},
journal={NeurIPS},
booktitle={NeurIPS},
year={2022}
}