add two codet5-large checkpoints

This commit is contained in:
WANG Yue 2022-07-08 11:34:25 +08:00
parent 5b37c34f4b
commit afcc8efd4a
4 changed files with 38 additions and 8 deletions

View File

@ -11,10 +11,19 @@ This is the official PyTorch implementation for the following EMNLP 2021 paper f
## Updates
**July 06, 2022**
We release two large-sized CodeT5 checkpoints at Hugging Face: [Salesforce/codet5-large](https://huggingface.co/Salesforce/codet5-large) and [Salesforce/codet5-large-ntp-py](https://huggingface.co/Salesforce/codet5-large-ntp-py), which are introduced by the paper: [CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning](https://arxiv.org/pdf/2207.01780.pdf) by Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, Steven C.H. Hoi.
* CodeT5-large was pretrained using Masked Span Prediction (MSP) objective on CodeSearchNet and achieve new SOTA results on several CodeXGLUE benchmarks. The finetuned checkpoints are released at [here](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/finetuned_models). See Appendix A.1 of the [paper](https://arxiv.org/pdf/2207.01780.pdf) for more details.
* CodeT5-large-ntp-py was first pretrained using Masked Span Prediction (MSP) objective on CodeSearchNet and GCPY (the Python split of [Github Code](https://huggingface.co/datasets/codeparrot/github-code) data), followed by another 10 epochs on GCPY using Next Token Prediction (NTP) objective.
CodeT5-large-ntp-py is especially optimized for Python code generation tasks and employed as the foundation model for our [CodeRL](https://github.com/salesforce/CodeRL), yielding new SOTA results on the APPS Python competition-level program synthesis benchmark. See the [paper](https://arxiv.org/pdf/2207.01780.pdf) for more details.
**Oct 29, 2021**
We
release [fine-tuned checkpoints](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/finetuned_models)
We release [fine-tuned checkpoints](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/finetuned_models)
for all the downstream tasks covered in the paper.
**Oct 25, 2021**
@ -114,7 +123,7 @@ CodeT5 as a VS Code plugin to provide three capabilities for Apex developers:
## Citation
If you find this code to be useful for your research, please consider citing.
If you find this code to be useful for your research, please consider citing:
```
@inproceedings{
@ -124,6 +133,13 @@ If you find this code to be useful for your research, please consider citing.
booktitle={Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021},
year={2021},
}
@article{coderl2022,
title={CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning},
author={Le, Hung and Wang, Yue and Gotmare, Akhilesh Deepak and Savarese, Silvio and Hoi, Steven C. H.},
journal={arXiv preprint arXiv:2207.01780},
year={2022}
}
```
## License
@ -216,6 +232,12 @@ Please refer to the argument flags in [configs.py](https://github.com/salesforce
available options. The saved training curves in `summary_dir` can be visualized using [tensorboard](https://pypi.org/project/tensorboard/).
Note that we employ one A100 GPU for all fine-tuning experiments.
### How to reproduce the results using the released finetuned checkpoints?
* Remove the `--do_train --do_eval --do_eval_bleu` and reserve only `--do_test` at [here](https://github.com/salesforce/CodeT5/blob/5b37c34f4bbbfcfd972c24a9dd1f45716568ecb5/sh/exp_with_args.sh#L84).
* Pass the path of your downloaded finetuned checkpoint to load at [here](https://github.com/salesforce/CodeT5/blob/5b37c34f4bbbfcfd972c24a9dd1f45716568ecb5/run_gen.py#L366), e.g., `file = "CodeT5/finetuned_models/summarize_python_codet5_base.bin"`
* Run the program: `python run_exp.py --model_tag codet5_base --task summarize --sub_task python`
### How to fine-tune on your own task and dataset?
If you want to fine-tune on your dataset, you can add your own task and sub_task in `configs.py` ([here](https://github.com/salesforce/CodeT5/blob/d27512d23ba6130e089e571d8c3e399760db1c31/configs.py#L11)) and add your data path and the function to read in `utils.py` ([here](https://github.com/salesforce/CodeT5/blob/5bb41e21b07fee73f310476a91ded00e385290d7/utils.py#L103) and [here](https://github.com/salesforce/CodeT5/blob/5bb41e21b07fee73f310476a91ded00e385290d7/utils.py#L149)). The read function can be implemented in `_utils.py` similar to [this one](https://github.com/salesforce/CodeT5/blob/aaf9c4a920c4986abfd54a74f5456b056b6409e0/_utils.py#L213). If your task to add is a generation task, you can simply reuse or customize the `run_gen.py`. For understanding tasks, please refer to `run_defect.py` and `run_clone.py`.

View File

@ -21,6 +21,8 @@ using a masked language modeling (MLM) loss.
from __future__ import absolute_import
import os
import pdb
from models import CloneModel
import logging
import argparse
@ -136,6 +138,7 @@ def main():
config = config_class.from_pretrained(args.config_name if args.config_name else args.model_name_or_path)
model = model_class.from_pretrained(args.model_name_or_path)
tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name)
model.resize_token_embeddings(32000)
model = CloneModel(model, config, tokenizer, args)
logger.info("Finish loading model [%s] from %s", get_model_size(model), args.model_name_or_path)

View File

@ -1,4 +1,4 @@
WORKDIR="path_to_your_dir/CodeT5"
WORKDIR="your_CodeT5_path/CodeT5"
export PYTHONPATH=$WORKDIR
TASK=${1}
@ -64,6 +64,10 @@ elif [[ $MODEL_TAG == codet5_base ]]; then
MODEL_TYPE=codet5
TOKENIZER=Salesforce/codet5-base
MODEL_PATH=Salesforce/codet5-base
elif [[ $MODEL_TAG == codet5_large ]]; then
MODEL_TYPE=codet5
TOKENIZER=Salesforce/codet5-large
MODEL_PATH=Salesforce/codet5-large
fi
@ -78,10 +82,9 @@ else
RUN_FN=${WORKDIR}/run_gen.py
fi
CUDA_VISIBLE_DEVICES=${GPU} \
python ${RUN_FN} \
--do_train --do_eval --do_eval_bleu --do_test ${MULTI_TASK_AUG} \
python ${RUN_FN} ${MULTI_TASK_AUG} \
--do_train --do_eval --do_eval_bleu --do_test \
--task ${TASK} --sub_task ${SUB_TASK} --model_type ${MODEL_TYPE} --data_num ${DATA_NUM} \
--num_train_epochs ${EPOCH} --warmup_steps ${WARMUP} --learning_rate ${LR}e-5 --patience ${PATIENCE} \
--tokenizer_name=${TOKENIZER} --model_name_or_path=${MODEL_PATH} --data_dir ${WORKDIR}/data \

View File

@ -76,6 +76,8 @@ def get_args_by_task_model(task, sub_task, model_tag):
bs = 64
elif task == 'clone':
bs = 25
elif 'codet5_large' in model_tag:
bs = 8
else:
bs = 32
if task == 'translate':
@ -142,7 +144,7 @@ def get_sub_tasks(task):
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument("--model_tag", type=str, default='codet5_base',
choices=['roberta', 'codebert', 'bart_base', 'codet5_small', 'codet5_base'])
choices=['roberta', 'codebert', 'bart_base', 'codet5_small', 'codet5_base', 'codet5_large'])
parser.add_argument("--task", type=str, default='summarize', choices=['summarize', 'concode', 'translate',
'refine', 'defect', 'clone', 'multi_task'])
parser.add_argument("--sub_task", type=str, default='ruby')