mirror of
https://github.com/salesforce/CodeT5.git
synced 2024-10-01 06:35:38 -04:00
add two codet5-large checkpoints
This commit is contained in:
parent
5b37c34f4b
commit
afcc8efd4a
28
README.md
28
README.md
@ -11,10 +11,19 @@ This is the official PyTorch implementation for the following EMNLP 2021 paper f
|
||||
|
||||
## Updates
|
||||
|
||||
**July 06, 2022**
|
||||
|
||||
We release two large-sized CodeT5 checkpoints at Hugging Face: [Salesforce/codet5-large](https://huggingface.co/Salesforce/codet5-large) and [Salesforce/codet5-large-ntp-py](https://huggingface.co/Salesforce/codet5-large-ntp-py), which are introduced by the paper: [CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning](https://arxiv.org/pdf/2207.01780.pdf) by Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, Steven C.H. Hoi.
|
||||
|
||||
* CodeT5-large was pretrained using Masked Span Prediction (MSP) objective on CodeSearchNet and achieve new SOTA results on several CodeXGLUE benchmarks. The finetuned checkpoints are released at [here](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/finetuned_models). See Appendix A.1 of the [paper](https://arxiv.org/pdf/2207.01780.pdf) for more details.
|
||||
|
||||
* CodeT5-large-ntp-py was first pretrained using Masked Span Prediction (MSP) objective on CodeSearchNet and GCPY (the Python split of [Github Code](https://huggingface.co/datasets/codeparrot/github-code) data), followed by another 10 epochs on GCPY using Next Token Prediction (NTP) objective.
|
||||
|
||||
CodeT5-large-ntp-py is especially optimized for Python code generation tasks and employed as the foundation model for our [CodeRL](https://github.com/salesforce/CodeRL), yielding new SOTA results on the APPS Python competition-level program synthesis benchmark. See the [paper](https://arxiv.org/pdf/2207.01780.pdf) for more details.
|
||||
|
||||
**Oct 29, 2021**
|
||||
|
||||
We
|
||||
release [fine-tuned checkpoints](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/finetuned_models)
|
||||
We release [fine-tuned checkpoints](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/finetuned_models)
|
||||
for all the downstream tasks covered in the paper.
|
||||
|
||||
**Oct 25, 2021**
|
||||
@ -114,7 +123,7 @@ CodeT5 as a VS Code plugin to provide three capabilities for Apex developers:
|
||||
|
||||
## Citation
|
||||
|
||||
If you find this code to be useful for your research, please consider citing.
|
||||
If you find this code to be useful for your research, please consider citing:
|
||||
|
||||
```
|
||||
@inproceedings{
|
||||
@ -124,6 +133,13 @@ If you find this code to be useful for your research, please consider citing.
|
||||
booktitle={Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021},
|
||||
year={2021},
|
||||
}
|
||||
|
||||
@article{coderl2022,
|
||||
title={CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning},
|
||||
author={Le, Hung and Wang, Yue and Gotmare, Akhilesh Deepak and Savarese, Silvio and Hoi, Steven C. H.},
|
||||
journal={arXiv preprint arXiv:2207.01780},
|
||||
year={2022}
|
||||
}
|
||||
```
|
||||
|
||||
## License
|
||||
@ -216,6 +232,12 @@ Please refer to the argument flags in [configs.py](https://github.com/salesforce
|
||||
available options. The saved training curves in `summary_dir` can be visualized using [tensorboard](https://pypi.org/project/tensorboard/).
|
||||
Note that we employ one A100 GPU for all fine-tuning experiments.
|
||||
|
||||
### How to reproduce the results using the released finetuned checkpoints?
|
||||
|
||||
* Remove the `--do_train --do_eval --do_eval_bleu` and reserve only `--do_test` at [here](https://github.com/salesforce/CodeT5/blob/5b37c34f4bbbfcfd972c24a9dd1f45716568ecb5/sh/exp_with_args.sh#L84).
|
||||
* Pass the path of your downloaded finetuned checkpoint to load at [here](https://github.com/salesforce/CodeT5/blob/5b37c34f4bbbfcfd972c24a9dd1f45716568ecb5/run_gen.py#L366), e.g., `file = "CodeT5/finetuned_models/summarize_python_codet5_base.bin"`
|
||||
* Run the program: `python run_exp.py --model_tag codet5_base --task summarize --sub_task python`
|
||||
|
||||
### How to fine-tune on your own task and dataset?
|
||||
If you want to fine-tune on your dataset, you can add your own task and sub_task in `configs.py` ([here](https://github.com/salesforce/CodeT5/blob/d27512d23ba6130e089e571d8c3e399760db1c31/configs.py#L11)) and add your data path and the function to read in `utils.py` ([here](https://github.com/salesforce/CodeT5/blob/5bb41e21b07fee73f310476a91ded00e385290d7/utils.py#L103) and [here](https://github.com/salesforce/CodeT5/blob/5bb41e21b07fee73f310476a91ded00e385290d7/utils.py#L149)). The read function can be implemented in `_utils.py` similar to [this one](https://github.com/salesforce/CodeT5/blob/aaf9c4a920c4986abfd54a74f5456b056b6409e0/_utils.py#L213). If your task to add is a generation task, you can simply reuse or customize the `run_gen.py`. For understanding tasks, please refer to `run_defect.py` and `run_clone.py`.
|
||||
|
||||
|
@ -21,6 +21,8 @@ using a masked language modeling (MLM) loss.
|
||||
|
||||
from __future__ import absolute_import
|
||||
import os
|
||||
import pdb
|
||||
|
||||
from models import CloneModel
|
||||
import logging
|
||||
import argparse
|
||||
@ -136,6 +138,7 @@ def main():
|
||||
config = config_class.from_pretrained(args.config_name if args.config_name else args.model_name_or_path)
|
||||
model = model_class.from_pretrained(args.model_name_or_path)
|
||||
tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name)
|
||||
model.resize_token_embeddings(32000)
|
||||
|
||||
model = CloneModel(model, config, tokenizer, args)
|
||||
logger.info("Finish loading model [%s] from %s", get_model_size(model), args.model_name_or_path)
|
||||
|
@ -1,4 +1,4 @@
|
||||
WORKDIR="path_to_your_dir/CodeT5"
|
||||
WORKDIR="your_CodeT5_path/CodeT5"
|
||||
export PYTHONPATH=$WORKDIR
|
||||
|
||||
TASK=${1}
|
||||
@ -64,6 +64,10 @@ elif [[ $MODEL_TAG == codet5_base ]]; then
|
||||
MODEL_TYPE=codet5
|
||||
TOKENIZER=Salesforce/codet5-base
|
||||
MODEL_PATH=Salesforce/codet5-base
|
||||
elif [[ $MODEL_TAG == codet5_large ]]; then
|
||||
MODEL_TYPE=codet5
|
||||
TOKENIZER=Salesforce/codet5-large
|
||||
MODEL_PATH=Salesforce/codet5-large
|
||||
fi
|
||||
|
||||
|
||||
@ -78,10 +82,9 @@ else
|
||||
RUN_FN=${WORKDIR}/run_gen.py
|
||||
fi
|
||||
|
||||
|
||||
CUDA_VISIBLE_DEVICES=${GPU} \
|
||||
python ${RUN_FN} \
|
||||
--do_train --do_eval --do_eval_bleu --do_test ${MULTI_TASK_AUG} \
|
||||
python ${RUN_FN} ${MULTI_TASK_AUG} \
|
||||
--do_train --do_eval --do_eval_bleu --do_test \
|
||||
--task ${TASK} --sub_task ${SUB_TASK} --model_type ${MODEL_TYPE} --data_num ${DATA_NUM} \
|
||||
--num_train_epochs ${EPOCH} --warmup_steps ${WARMUP} --learning_rate ${LR}e-5 --patience ${PATIENCE} \
|
||||
--tokenizer_name=${TOKENIZER} --model_name_or_path=${MODEL_PATH} --data_dir ${WORKDIR}/data \
|
||||
|
@ -76,6 +76,8 @@ def get_args_by_task_model(task, sub_task, model_tag):
|
||||
bs = 64
|
||||
elif task == 'clone':
|
||||
bs = 25
|
||||
elif 'codet5_large' in model_tag:
|
||||
bs = 8
|
||||
else:
|
||||
bs = 32
|
||||
if task == 'translate':
|
||||
@ -142,7 +144,7 @@ def get_sub_tasks(task):
|
||||
if __name__ == '__main__':
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--model_tag", type=str, default='codet5_base',
|
||||
choices=['roberta', 'codebert', 'bart_base', 'codet5_small', 'codet5_base'])
|
||||
choices=['roberta', 'codebert', 'bart_base', 'codet5_small', 'codet5_base', 'codet5_large'])
|
||||
parser.add_argument("--task", type=str, default='summarize', choices=['summarize', 'concode', 'translate',
|
||||
'refine', 'defect', 'clone', 'multi_task'])
|
||||
parser.add_argument("--sub_task", type=str, default='ruby')
|
||||
|
Loading…
Reference in New Issue
Block a user