add two codet5-large checkpoints

2024-10-01 06:35:38 -04:00 · 2022-07-08 11:34:25 +08:00 · 2022-07-08 11:34:25 +08:00 · afcc8efd4a
commit afcc8efd4a
parent 5b37c34f4b
4 changed files with 38 additions and 8 deletions
--- a/README.md
+++ b/README.md
@ -11,10 +11,19 @@ This is the official PyTorch implementation for the following EMNLP 2021 paper f

 ## Updates

+**July 06, 2022**
+
+We release two large-sized CodeT5 checkpoints at Hugging Face: [Salesforce/codet5-large](https://huggingface.co/Salesforce/codet5-large) and [Salesforce/codet5-large-ntp-py](https://huggingface.co/Salesforce/codet5-large-ntp-py), which are introduced by the paper: [CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning](https://arxiv.org/pdf/2207.01780.pdf) by Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, Steven C.H. Hoi.
+
+* CodeT5-large was pretrained using Masked Span Prediction (MSP) objective on CodeSearchNet and achieve new SOTA results on several CodeXGLUE benchmarks. The finetuned checkpoints are released at [here](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/finetuned_models). See Appendix A.1 of the [paper](https://arxiv.org/pdf/2207.01780.pdf) for more details.
+
+* CodeT5-large-ntp-py was first pretrained using Masked Span Prediction (MSP) objective on CodeSearchNet and GCPY (the Python split of [Github Code](https://huggingface.co/datasets/codeparrot/github-code) data), followed by another 10 epochs on GCPY using Next Token Prediction (NTP) objective. 
+
+CodeT5-large-ntp-py is especially optimized for Python code generation tasks and employed as the foundation model for our [CodeRL](https://github.com/salesforce/CodeRL), yielding new SOTA results on the APPS Python competition-level program synthesis benchmark. See the [paper](https://arxiv.org/pdf/2207.01780.pdf) for more details.
+
 **Oct 29, 2021**

-We
-release [fine-tuned checkpoints](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/finetuned_models)
+We release [fine-tuned checkpoints](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/finetuned_models)
 for all the downstream tasks covered in the paper.

 **Oct 25, 2021**
@ -114,7 +123,7 @@ CodeT5 as a VS Code plugin to provide three capabilities for Apex developers:

 ## Citation

-If you find this code to be useful for your research, please consider citing.
+If you find this code to be useful for your research, please consider citing:

 ```
@inproceedings{
@ -124,6 +133,13 @@ If you find this code to be useful for your research, please consider citing.
    booktitle={Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021},
    year={2021},
 }
+
+@article{coderl2022,
+  title={CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning},
+  author={Le, Hung and Wang, Yue and Gotmare, Akhilesh Deepak and Savarese, Silvio and Hoi, Steven C. H.},
+  journal={arXiv preprint arXiv:2207.01780},
+  year={2022}
+}
 ```

 ## License
@ -216,6 +232,12 @@ Please refer to the argument flags in [configs.py](https://github.com/salesforce
 available options. The saved training curves in `summary_dir` can be visualized using [tensorboard](https://pypi.org/project/tensorboard/).
 Note that we employ one A100 GPU for all fine-tuning experiments.

+### How to reproduce the results using the released finetuned checkpoints?
+
+* Remove the `--do_train --do_eval --do_eval_bleu` and reserve only `--do_test` at [here](https://github.com/salesforce/CodeT5/blob/5b37c34f4bbbfcfd972c24a9dd1f45716568ecb5/sh/exp_with_args.sh#L84). 
+* Pass the path of your downloaded finetuned checkpoint to load at [here](https://github.com/salesforce/CodeT5/blob/5b37c34f4bbbfcfd972c24a9dd1f45716568ecb5/run_gen.py#L366), e.g., `file = "CodeT5/finetuned_models/summarize_python_codet5_base.bin"`
+* Run the program: `python run_exp.py --model_tag codet5_base --task summarize --sub_task python`
+
 ### How to fine-tune on your own task and dataset?
 If you want to fine-tune on your dataset, you can add your own task and sub_task in `configs.py` ([here](https://github.com/salesforce/CodeT5/blob/d27512d23ba6130e089e571d8c3e399760db1c31/configs.py#L11)) and add your data path and the function to read in `utils.py` ([here](https://github.com/salesforce/CodeT5/blob/5bb41e21b07fee73f310476a91ded00e385290d7/utils.py#L103) and [here](https://github.com/salesforce/CodeT5/blob/5bb41e21b07fee73f310476a91ded00e385290d7/utils.py#L149)). The read function can be implemented in `_utils.py` similar to [this one](https://github.com/salesforce/CodeT5/blob/aaf9c4a920c4986abfd54a74f5456b056b6409e0/_utils.py#L213). If your task to add is a generation task, you can simply reuse or customize the `run_gen.py`. For understanding tasks, please refer to `run_defect.py` and `run_clone.py`.

--- a/run_clone.py
+++ b/run_clone.py
@ -21,6 +21,8 @@ using a masked language modeling (MLM) loss.

 from __future__ import absolute_import
 import os
+import pdb
+
 from models import CloneModel
 import logging
 import argparse
@ -136,6 +138,7 @@ def main():
    config = config_class.from_pretrained(args.config_name if args.config_name else args.model_name_or_path)
    model = model_class.from_pretrained(args.model_name_or_path)
    tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name)
+    model.resize_token_embeddings(32000)

    model = CloneModel(model, config, tokenizer, args)
    logger.info("Finish loading model [%s] from %s", get_model_size(model), args.model_name_or_path)
--- a/sh/exp_with_args.sh
+++ b/sh/exp_with_args.sh
@ -1,4 +1,4 @@
-WORKDIR="path_to_your_dir/CodeT5"
+WORKDIR="your_CodeT5_path/CodeT5"
 export PYTHONPATH=$WORKDIR

 TASK=${1}
@ -64,6 +64,10 @@ elif [[ $MODEL_TAG == codet5_base ]]; then
  MODEL_TYPE=codet5
  TOKENIZER=Salesforce/codet5-base
  MODEL_PATH=Salesforce/codet5-base
+elif [[ $MODEL_TAG == codet5_large ]]; then
+  MODEL_TYPE=codet5
+  TOKENIZER=Salesforce/codet5-large
+  MODEL_PATH=Salesforce/codet5-large
 fi


@ -78,10 +82,9 @@ else
  RUN_FN=${WORKDIR}/run_gen.py
 fi

-
 CUDA_VISIBLE_DEVICES=${GPU} \
-  python ${RUN_FN}  \
-  --do_train --do_eval --do_eval_bleu --do_test ${MULTI_TASK_AUG}  \
+  python ${RUN_FN}  ${MULTI_TASK_AUG}   \
+  --do_train --do_eval --do_eval_bleu --do_test  \
  --task ${TASK} --sub_task ${SUB_TASK} --model_type ${MODEL_TYPE} --data_num ${DATA_NUM}  \
  --num_train_epochs ${EPOCH} --warmup_steps ${WARMUP} --learning_rate ${LR}e-5 --patience ${PATIENCE} \
  --tokenizer_name=${TOKENIZER}  --model_name_or_path=${MODEL_PATH} --data_dir ${WORKDIR}/data  \
--- a/sh/run_exp.py
+++ b/sh/run_exp.py
@ -76,6 +76,8 @@ def get_args_by_task_model(task, sub_task, model_tag):
            bs = 64
        elif task == 'clone':
            bs = 25
+    elif 'codet5_large' in model_tag:
+        bs = 8
    else:
        bs = 32
        if task == 'translate':
@ -142,7 +144,7 @@ def get_sub_tasks(task):
 if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument("--model_tag", type=str, default='codet5_base',
-                        choices=['roberta', 'codebert', 'bart_base', 'codet5_small', 'codet5_base'])
+                        choices=['roberta', 'codebert', 'bart_base', 'codet5_small', 'codet5_base', 'codet5_large'])
    parser.add_argument("--task", type=str, default='summarize', choices=['summarize', 'concode', 'translate',
                                                                          'refine', 'defect', 'clone', 'multi_task'])
    parser.add_argument("--sub_task", type=str, default='ruby')