add humaneval evaluation

2024-10-01 06:35:38 -04:00 · 2023-05-20 18:27:46 +08:00 · 2023-05-20 18:27:46 +08:00 · ce36447d85
commit ce36447d85
parent aaeb477b89
7 changed files with 283 additions and 10 deletions
--- a/CodeT5+/README.md
+++ b/CodeT5+/README.md
@ -1,11 +1,13 @@
 # CodeT5+

 Official research release for the **CodeT5+** models (`220M`, `770M`, `2B`, `6B` `16B`) for a wide range of **Code Understanding and Generation** tasks.
+Find out more via our [blog post](https://blog.salesforceairesearch.com/codet5-open-code-large-language-models/).

 *Title*: [CodeT5+: Open Code Large Language Models for Code Understanding and Generation](https://arxiv.org/pdf/2305.07922.pdf)

 *Authors*: [Yue Wang](https://yuewang-cuhk.github.io/)\*, [Hung Le](https://sites.google.com/view/henryle2018/home?pli=1)\*, [Akhilesh Deepak Gotmare](https://akhileshgotmare.github.io/), [Nghi D.Q. Bui](https://bdqnghi.github.io/), [Junnan Li](https://sites.google.com/site/junnanlics), [Steven C.H. Hoi](https://sites.google.com/view/stevenhoi/home) (* indicates equal contribution)

+
 # What is this about?
 CodeT5+ is a new family of open code large language models with an encoder-decoder architecture that can flexibly operate in different modes (i.e. _encoder-only_, _decoder-only_, and _encoder-decoder_) to support a wide range of code understanding and generation tasks.

@ -32,7 +34,8 @@ We release the following CodeT5+ models at Huggingface:
 # How to Use?
 All CodeT5+ models and tokenizers can be easily loaded using the `AutoModelForSeq2SeqLM` and `AutoTokenizer` functionality. 
 For tokenizers, CodeT5+ `220M` and `770M` employ the same tokenizer as the original [CodeT5](https://github.com/salesforce/CodeT5) while CodeT5+ `2B`, `6B`, `16B` employ the same tokenizer as [CodeGen]( https://github.com/salesforce/CodeGen).
-To load CodeT5+ `2B`, `6B`, `16B`, please set `trust_remote_code=True` as the [model class](https://huggingface.co/Salesforce/codet5p-16b/blob/main/modeling_codet5p.py) is defined in the Huggingface repo.
+To load CodeT5+ `2B`, `6B`, `16B`, and InstructCodeT5+ `16B`, please set `trust_remote_code=True` as the [model class](https://huggingface.co/Salesforce/codet5p-16b/blob/main/modeling_codet5p.py) is defined in the Huggingface repo.
+Besides, these models would benefit from passing additional prompts to the decoder via `decoder_input_ids` to achieve better generation performance.


 ```python
@ -48,17 +51,28 @@ model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint,
                                              low_cpu_mem_usage=True,
                                              trust_remote_code=True).to(device)

-inputs = tokenizer.encode("def print_hello():", return_tensors="pt").to(device)
-outputs = model.generate(inputs, max_length=12)
+encoding = tokenizer("def print_hello_world():", return_tensors="pt").to(device)
+encoding['decoder_input_ids'] = encoding['input_ids'].clone()
+outputs = model.generate(**encoding, max_length=15)
 print(tokenizer.decode(outputs[0], skip_special_tokens=True))
-
 ```

 # Reproduce the Results

 ## HumanEval

-TBA
+### Installation
+* Install the official HumanEval evaluation tool released by OpenAI following the instructions in ihis [repo](https://github.com/openai/human-eval).
+* Install the Pytorch (version `1.13.1`) and transformers (version `4.21.3`) libraries.
+
+### Generating programs from CodeT5+ models
+`cd humaneval` then run the inference via `bash run_generate.sh`. 
+You can select the model to generate from by changing the `model` variable in the script.
+Following the original setting in the HumanEval paper, we generate 200 programs (`pred_num=200`) for each problem and employs nucleus sampling with different temperature `T` for computing `pass@k` (`T=0.2,0.6,0.8` for `k=1,10,100` respectively).
+The generated programs will be saved in `preds/${model}_T${temp}_N${pred_num}`.
+
+### Evaluating pass@k
+`cd humaneval` then run the evaluation via `bash run_eval.sh`.

 ## Citation

--- a/CodeT5+/humaneval/generate_codet5p.py
+++ b/CodeT5+/humaneval/generate_codet5p.py
@ -0,0 +1,162 @@
+import argparse
+import pprint
+import os
+import re
+from tqdm import tqdm
+import torch
+from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
+from human_eval.data import write_jsonl, read_problems, stream_jsonl
+
+
+def extract_text(prompt, remove_lines=True):
+    token = '\"\"\"'
+    start = token
+    end = '>>>'
+
+    start_idx = prompt.find(start) + len(start)
+    end_idx = prompt.find(end)
+
+    output = prompt[start_idx: end_idx]
+    if remove_lines:
+        output = output.replace('\n', ' ')
+    output = re.sub(r"\s+", " ", output).strip()
+
+    return output
+
+
+INSTRUCTION = """Below is an instruction that describes a task. Write a response that appropriately completes the request.
+
+
+### Instruction:
+Create a Python script for this problem:
+{}
+
+### Response:"""
+
+
+def main():
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument('--model', type=str, default='Salesforce/instructcodet5p-16b', help="")
+    parser.add_argument('--output_path', type=str, help="")
+    parser.add_argument('--start_index', type=int, default=0, help="")
+    parser.add_argument('--end_index', type=int, default=164, help="")
+    parser.add_argument('--temperature', type=float, default=0.8, help="")
+    parser.add_argument('--N', type=int, default=200, help="")
+    parser.add_argument('--max_len', type=int, default=600, help="")
+    parser.add_argument('--decoding_style', type=str, default='sampling', help="")
+    parser.add_argument('--num_seqs_per_iter', type=int, default=50, help='')
+    parser.add_argument('--overwrite', action='store_true', help='')
+
+    args = parser.parse_args()
+
+    argsdict = vars(args)
+    print(pprint.pformat(argsdict))
+
+    STOP_SEQS = ['\nclass', '\ndef', '\n#', '\nif', '\nprint']
+    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+
+    problems = read_problems()
+
+    task_ids = sorted(problems.keys())[args.start_index: args.end_index]
+    prompts = [problems[task_id]['prompt'] for task_id in task_ids]
+    num_samples = len(prompts)
+    print("Number of samples: {}".format(num_samples))
+
+    tokenizer = AutoTokenizer.from_pretrained(args.model)
+
+    model = AutoModelForSeq2SeqLM.from_pretrained(args.model,
+                                                  trust_remote_code=True,  # False for 220m and 770m models
+                                                  torch_dtype=torch.float16,
+                                                  low_cpu_mem_usage=True)
+    model.eval()
+    model.to(device)
+
+    # for larger LLMs such as 2B, 6B, and 16B, we need to pass the text prompt to the decoder
+    prompt_to_decoder = True if any([size in args.model for size in ['2b', '6b', '16b']]) else False
+
+    print(f"Loaded {args.model}.")
+    for i in tqdm(range(num_samples), ncols=0, total=num_samples):
+        output_file = args.output_path + '/{}.jsonl'.format(args.start_index + i)
+
+        if os.path.exists(output_file) and not args.overwrite:
+            print(f'Skip {output_file} as it already exists')
+            continue
+
+        prompt = prompts[i].replace('    ', '\t')
+        if args.model == 'Salesforce/instructcodet5p-16b':
+            prompt_batch = [INSTRUCTION.format(extract_text(prompt))]
+            prompt_batch_decoder = [INSTRUCTION.format(extract_text(prompt)) + prompt]
+        else:
+            prompt_batch = [prompt]
+            prompt_batch_decoder = [prompt]
+
+        ids_batch = [task_ids[i]]
+
+        completion_seqs = []
+
+        encoding = tokenizer(prompt_batch, return_tensors="pt", truncation=True, max_length=args.max_len).to(device)
+        encoding_decoder = tokenizer(prompt_batch_decoder, return_tensors="pt", truncation=True,
+                                     max_length=args.max_len).to(device)
+
+        if args.decoding_style == 'sampling':
+            loops = int(args.N / args.num_seqs_per_iter)
+        else:
+            loops = 1
+
+        for _ in tqdm(range(loops), total=loops, leave=False, ncols=0):
+
+            with torch.no_grad():
+                if args.decoding_style == 'sampling':
+                    if prompt_to_decoder:
+                        gen_tokens = model.generate(**encoding,
+                                                    decoder_input_ids=encoding_decoder['input_ids'],
+                                                    do_sample=True,
+                                                    temperature=args.temperature,
+                                                    max_length=args.max_len,
+                                                    num_return_sequences=args.num_seqs_per_iter,
+                                                    decoder_start_token_id=tokenizer.pad_token_id,
+                                                    eos_token_id=tokenizer.eos_token_id,
+                                                    top_p=0.95)
+                    else:
+                        gen_tokens = model.generate(**encoding,
+                                                    do_sample=True,
+                                                    temperature=args.temperature,
+                                                    max_length=args.max_len,
+                                                    num_return_sequences=args.num_seqs_per_iter,
+                                                    eos_token_id=tokenizer.eos_token_id,
+                                                    top_p=0.95)
+
+            if gen_tokens is not None:
+                if prompt_to_decoder:
+                    gen_tokens = gen_tokens[:, encoding_decoder['input_ids'].shape[-1]:]
+                gen_seqs = tokenizer.batch_decode(gen_tokens, skip_special_tokens=True)
+            else:
+                gen_seqs = None
+
+            if gen_seqs is not None:
+                assert len(ids_batch) == 1
+                task_id = ids_batch[0]
+
+                for seq_idx, gen_seq in enumerate(gen_seqs):
+                    completion_seq = gen_seq
+                    for stop_seq in STOP_SEQS:
+                        index = completion_seq.find(stop_seq)
+                        if index != -1:
+                            completion_seq = completion_seq[:index]
+                    completion_seq = completion_seq.replace('\t', '    ')
+                    all_code = prompt.replace('\t', '    ') + completion_seq
+
+                    completion_seqs.append(
+                        {'task_id': task_id,
+                         'completion': completion_seq,
+                         'all_code': all_code  # final code for evaluation with unit tests
+                         }
+                    )
+
+        print("Saving results to {}".format(output_file))
+        write_jsonl(output_file, completion_seqs)
+
+
+if __name__ == '__main__':
+    main()
--- a/CodeT5+/humaneval/process_preds.py
+++ b/CodeT5+/humaneval/process_preds.py
@ -0,0 +1,47 @@
+from human_eval.data import read_problems, write_jsonl, stream_jsonl
+import glob 
+from tqdm import tqdm
+import argparse
+
+parser = argparse.ArgumentParser()
+
+# Inputs
+parser.add_argument(
+    '--path',
+    type=str,
+    help="")
+parser.add_argument(
+    '--out_path',
+    type=str,
+    help="")
+parser.add_argument(
+    '--add_prompt',
+    action='store_true',
+    help='')
+
+args = parser.parse_args()
+
+
+files = sorted(glob.glob(args.path + '/*.jsonl'))
+print("{} files in {}".format(len(files), args.path))
+
+problems = read_problems('data/HumanEval.jsonl.gz')
+
+output = []
+for code_file in tqdm(files, total=len(files)):
+    codes = [c for c in stream_jsonl(code_file)]
+    if args.add_prompt: 
+        for code in codes: 
+            task_id = code['task_id']
+            prompt = problems[task_id]['prompt'] 
+            if 'def' in code['completion']: 
+                def_line = code['completion'].index('def')
+                completion = code['completion'][def_line:]
+                next_line = completion.index('\n')
+                completion = code['completion'][def_line+next_line+1:]
+                code['all_code'] = prompt + completion 
+    
+    output += codes 
+    
+print("save to {}".format(args.out_path))
+write_jsonl(args.out_path, output)
--- a/CodeT5+/humaneval/run_generate.sh
+++ b/CodeT5+/humaneval/run_generate.sh
@ -0,0 +1,29 @@
+model=instructcodet5p-16b
+temp=0.2
+max_len=800
+pred_num=200
+num_seqs_per_iter=2 # 25 for 350M and 770M, 10 for 2B, 8 for 6B, 2 for 16B on A100-40G
+
+output_path=preds/${model}_T${temp}_N${pred_num}
+
+mkdir -p ${output_path}
+echo 'Output path: '$output_path
+echo 'Model to eval: '$model
+
+# 164 problems, 21 per GPU if GPU=8
+index=0
+gpu_num=8
+for ((i = 0; i < $gpu_num; i++)); do
+  start_index=$((i * 21))
+  end_index=$(((i + 1) * 21))
+
+  gpu=$((i))
+  echo 'Running process #' ${i} 'from' $start_index 'to' $end_index 'on GPU' ${gpu}
+  ((index++))
+  (
+    CUDA_VISIBLE_DEVICES=$gpu python generate_codet5p.py --model Salesforce/${model} \
+      --start_index ${start_index} --end_index ${end_index} --temperature ${temp} \
+      --num_seqs_per_iter ${num_seqs_per_iter} --N ${pred_num} --max_len ${max_len} --output_path ${output_path}
+  ) &
+  if (($index % $gpu_num == 0)); then wait; fi
+done
--- a/CodeT5+/humaneval/test_eval.sh
+++ b/CodeT5+/humaneval/test_eval.sh
@ -0,0 +1,6 @@
+output_path=preds/instructcodet5p-16b_T0.2_N200
+
+echo 'Output path: '$output_path
+python process_preds.py --path ${output_path} --out_path ${output_path}.jsonl
+
+evaluate_functional_correctness ${output_path}.jsonl
--- a/CodeT5/README.md
+++ b/CodeT5/README.md
@ -197,7 +197,23 @@ Note that we employ one A100 GPU for all fine-tuning experiments.
 ### How to fine-tune on your own task and dataset?
 If you want to fine-tune on your dataset, you can add your own task and sub_task in `configs.py` ([here](https://github.com/salesforce/CodeT5/blob/d27512d23ba6130e089e571d8c3e399760db1c31/configs.py#L11)) and add your data path and the function to read in `utils.py` ([here](https://github.com/salesforce/CodeT5/blob/5bb41e21b07fee73f310476a91ded00e385290d7/utils.py#L103) and [here](https://github.com/salesforce/CodeT5/blob/5bb41e21b07fee73f310476a91ded00e385290d7/utils.py#L149)). The read function can be implemented in `_utils.py` similar to [this one](https://github.com/salesforce/CodeT5/blob/aaf9c4a920c4986abfd54a74f5456b056b6409e0/_utils.py#L213). If your task to add is a generation task, you can simply reuse or customize the `run_gen.py`. For understanding tasks, please refer to `run_defect.py` and `run_clone.py`.

-## Get Involved

-Please create a GitHub issue if you have any questions, suggestions, requests or bug-reports. We welcome PRs!
+## Citation

+```bibtex
+@inproceedings{
+    wang2021codet5,
+    title={CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation}, 
+    author={Yue Wang, Weishi Wang, Shafiq Joty, Steven C.H. Hoi},
+    booktitle={EMNLP},
+    year={2021},
+}
+
+@inproceedings{
+    le2022coderl,
+    title={CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning},
+    author={Le, Hung and Wang, Yue and Gotmare, Akhilesh Deepak and Savarese, Silvio and Hoi, Steven C. H.},
+    booktitle={NeurIPS},
+    year={2022}
+}
+```
--- a/README.md
+++ b/README.md
@ -26,7 +26,7 @@ At Salesforce, we build an AI coding assistant demo using CodeT5 as a VS Code pl
 **May 2023**

 **CodeT5+** paper and models are released！🔥 <br>
-[paper](https://arxiv.org/pdf/2305.07922.pdf) | [code](https://github.com/salesforce/CodeT5/tree/main/CodeT5+) | [model](https://huggingface.co/models?sort=downloads&search=codet5p)
+[paper](https://arxiv.org/pdf/2305.07922.pdf) | [code](https://github.com/salesforce/CodeT5/tree/main/CodeT5+) | [model](https://huggingface.co/models?sort=downloads&search=codet5p) | [blog](https://blog.salesforceairesearch.com/codet5-open-code-large-language-models/)

 **Sep 2022**

@ -56,7 +56,6 @@ multilingual code summarization.



-
 ## Citation

 If you find this code to be useful for your research, please consider citing:
@ -74,7 +73,7 @@ If you find this code to be useful for your research, please consider citing:
    le2022coderl,
    title={CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning},
    author={Le, Hung and Wang, Yue and Gotmare, Akhilesh Deepak and Savarese, Silvio and Hoi, Steven C. H.},
-    journal={NeurIPS},
+    booktitle={NeurIPS},
    year={2022}
 }