AI/CodeT5

mirror of https://github.com/salesforce/CodeT5.git synced 2024-10-01 06:35:38 -04:00

Code for CodeT5: a new code-aware pre-trained encoder-decoder model.

code-intelligence language-model nlp programming-language representation-learning

Go to file

WANG Yue 0391b55721 Update README.md		2021-09-24 13:32:36 +08:00
evaluator	third full commit	2021-09-03 22:14:17 +08:00
sh	Update run_exp.py	2021-09-15 21:22:13 +08:00
tokenizer	third full commit	2021-09-03 22:14:17 +08:00
_utils.py	First commit	2021-09-03 21:58:19 +08:00
CODE_OF_CONDUCT.md	add code of conduct and security files	2021-09-03 22:35:47 +08:00
codet5.gif	upload codet5.gif	2021-09-24 13:13:10 +08:00
CodeT5.png	First commit	2021-09-03 21:58:19 +08:00
configs.py	Update configs.py	2021-09-15 21:23:53 +08:00
LICENSE.txt	third full commit	2021-09-03 22:14:17 +08:00
models.py	Update models.py	2021-09-24 10:37:13 +08:00
README.md	Update README.md	2021-09-24 13:32:36 +08:00
run_clone.py	Update run_clone.py	2021-09-15 21:25:57 +08:00
run_gen.py	Update run_gen.py	2021-09-15 21:27:16 +08:00
SECURITY.md	add code of conduct and security files	2021-09-03 22:35:47 +08:00
utils.py	First commit	2021-09-03 21:58:19 +08:00

README.md

CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation

This is the official PyTorch implementation for the following EMNLP 2021 paper from Salesforce Research:

Title: CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation

Authors: Yue Wang, Weishi Wang, Shafiq Joty, and Steven C.H. Hoi

Updates

Sep 24, 2021

CodeT5 is now in hugginface!

You can simply load the model (CodeT5-small and CodeT5-base) and do the inference:

from transformers import RobertaTokenizer, T5ForConditionalGeneration

tokenizer = RobertaTokenizer.from_pretrained('Salesforce/codet5-base')
model = T5ForConditionalGeneration.from_pretrained('Salesforce/codet5-base')

text = "def greet(user): print(f'hello <extra_id_0>!')"
input_ids = tokenizer(text, return_tensors="pt").input_ids

# simply generate one code span
generated_ids = model.generate(input_ids, max_length=8)
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
# this prints "{user.username}"

Introduction

This repo provides the code for reproducing the experiments in CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. CodeT5 is a new pre-trained encoder-decoder model for programming languages, which is pre-trained on 8.35M functions in 8 programming languages (Python, Java, JavaScript, PHP, Ruby, Go, C, and C#). In total, it achieves state-of-the-art results on 14 sub-tasks in a code intelligence benchmark - CodeXGLUE.

Paper link: https://arxiv.org/abs/2109.00859

Blog link: https://blog.einstein.ai/codet5/

The code currently include two pre-trained checkpoints (CodeT5-small and CodeT5-base) and scripts to fine-tine them on 4 generation tasks (code summarization, code generation, translation, and refinement) plus 2 understanding tasks (code defect detection and clone detection) in CodeXGLUE.

In practice, CodeT5 can be deployed as an AI-powered coding assistant to boost the productivity of software developers. At Salesforce, we build an AI coding assistant demo using CodeT5 to provide three capabilities for Apex developers as a VS Code plugin:

Text-to-code generation: generate code based on the natural language description.
Code autocompletion: complete the whole function of code given the target function name.
Code summarization: generate the summary of a function in natural language description.

Citation
License
Dependency
Download
Fine-tuning
Get Involved

Citation

If you find this code to be useful for your research, please consider citing.

@article{CodeT5,
      title={CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation}, 
      author={Yue Wang, Weishi Wang, Shafiq Joty, Steven C.H. Hoi},
      year={2021},
      journal={arXiv preprint arXiv:2109.00859},
}

License

The code is released under the BSD-3 License (see LICENSE.txt for details), but we also ask that users respect the following:

This software should not be used to promote or profit from:

violence, hate, and division,

environmental destruction,

abuse of human rights, or

the destruction of people's physical and mental health.

We encourage users of this software to tell us about the applications in which they are putting it to use by emailing codeT5@salesforce.com, and to use appropriate documentation when developing high-stakes applications of this model.

Dependency

Pytorch 1.7.1
tensorboard 2.4.1
transformers 4.6.1
tree-sitter 0.2.2

Download

Pre-trained checkpoints & Fine-tuning data

Instructions for download:

pip install gsutil

gsutil -m cp -r \
  "gs://sfr-codet5-data-research/data/" \
  "gs://sfr-codet5-data-research/pretrained_models/" \
  .

The repository structure is shown in the following after download:

├── CODE_OF_CONDUCT.md
├── README.md
├── SECURITY.md
├── codet5.gif
├── configs.py
├── models.py
├── run_clone.py
├── run_gen.py
├── utils.py
├── _utils.py
├── LICENSE.txt
├── data
│   ├── clone
│   ├── concode
│   ├── defect
│   ├── refine
│   │   ├── medium
│   │   └── small
│   ├── summarize
│   │   ├── go
│   │   ├── java
│   │   ├── javascript
│   │   ├── php
│   │   ├── python
│   │   └── ruby
│   └── translate
├── evaluator
│   ├── bleu.py
│   ├── smooth_bleu.py
│   └── CodeBLEU
├── pretrained_models
│   ├── codet5_base
│   └── codet5_small
├── sh
│   ├── exp_with_args.sh
│   ├── run_exp.py
│   ├── results
│   ├── saved_models
│   └── tensorboard
└── tokenizer
    └── salesforce
        ├── codet5-merges.txt
        └── codet5-vocab.json

Fine-tuning

Go to sh folder, set the WORKDIR in exp_with_args.sh to be your downloaded CodeT5 repository path.

You can use run_exp.py to run a broad set of experiments by simply passing the model_tag, task, and sub_task arguments. In total, we support four models (i.e., ['roberta', 'codebert', 'codet5_small', 'codet5_base']) and six tasks (i.e., ['summarize', 'concode', 'translate', 'refine', 'defect', 'clone']). For each task, we use the sub_task to specify which specific datasets to fine-tine on.

For example, if you want to run CodeT5-base model on the code summarization task for Ruby, you can simply run:

python run_exp.py --model_tag codet5_base --task summarize --sub_task ruby

Besides, you can specify:

model_dir: where to save fine-tuning checkpoints
res_dir: where to save the performance results 
summary_dir: where to save the training curves
data_num: how many data instances to use, the default -1 is for using the full data
gpu: the index of the GPU to use in the cluster

You can also directly revise the suggested arguments in the get_args_by_task_model function. Please refer to the argument flags in configs.py for the full available options. The saved training curves in summary_dir can be visualized using tensorboard.

Get Involved

Please create a GitHub issue if you have any questions, suggestions, requests or bug-reports. We welcome PRs!