mirror of
https://github.com/salesforce/CodeT5.git
synced 2024-10-01 06:35:38 -04:00
reorganize the repo
This commit is contained in:
parent
d5c9d81af1
commit
71ccd12773
@ -8,42 +8,60 @@ Official research release for the **CodeT5+** models (`220M`, `770M`, `2B`, `6B`
|
||||
|
||||
# What is this about?
|
||||
CodeT5+ is a new family of open code large language models with an encoder-decoder architecture that can flexibly operate in different modes (i.e. _encoder-only_, _decoder-only_, and _encoder-decoder_) to support a wide range of code understanding and generation tasks.
|
||||
See the below overview of CodeT5+.
|
||||
![CodeT5+ overview](codet5p_overview.png)
|
||||
|
||||
To train CodeT5+, we introduce a diverse set of pretraining tasks including _span denoising_, _causal language modeling_, _contrastive learning_, and _text-code matching_ to learn rich representations from both unimodal code data and bimodal code-text data.
|
||||
Additionally, to efficiently scale up the model, we propose a simple yet effective _compute-efficient pretraining_ method to initialize our model with frozen off-the-shelf LLMs such as [CodeGen](https://github.com/salesforce/CodeGen).
|
||||
Furthermore, we explore instruction tuning to align the model with natural language instructions following [Code Alpaca](https://github.com/sahil280114/codealpaca).
|
||||
![CodeT5+ architecture](codet5p_architecture.png)
|
||||
|
||||
|
||||
We implemented a family of CodeT5+ models, with model size ranging from 220M to 16B.
|
||||
Note that CodeT5+ 220M and 770M employ the same architecture of CodeT5-base and large respectively and are pretrained from scratch, while CodeT5+ 2B, 6B, 16B employ a "_shallow encoder and deep decoder_" architecture with the shallow encoder initialized from CodeGen-mono 350M and the deep decoder initialized from CodeGen-mono 2B, 6B, 16B, respectively.
|
||||
Note that CodeT5+ `220M` and `770M` employ the same architecture of CodeT5-base and large respectively and are pretrained from scratch, while CodeT5+ `2B`, `6B`, `16B` employ a "_shallow encoder and deep decoder_" architecture with the shallow encoder initialized from CodeGen-mono 350M and the deep decoder initialized from CodeGen-mono 2B, 6B, 16B, respectively.
|
||||
InstructCodeT5+ 16B is our instruction-tuned model from CodeT5+ 16B.
|
||||
|
||||
![CodeT5+ overview](codet5p_overview.png)
|
||||
|
||||
|
||||
# Released Models
|
||||
We release the following CodeT5+ models:
|
||||
We release the following CodeT5+ models at Huggingface:
|
||||
|
||||
* CodeT5+ `220M` and `770M` at Huggingface [here](https://huggingface.co/Salesforce/codet5p-220m) and [here](https://huggingface.co/Salesforce/codet5p-770m), respectively.
|
||||
* CodeT5+ `220M` and `770M` that are further tuned on Python subset at Huggingface [here](https://huggingface.co/Salesforce/codet5p-220m-py) and [here](https://huggingface.co/Salesforce/codet5p-770m-py), respectively.
|
||||
* CodeT5+ `2B`, `6B`, `16B` will be released soon.
|
||||
* CodeT5+ `220M` and `770M`: [Salesforce/codet5p-220m](https://huggingface.co/Salesforce/codet5p-220m) and [codet5p-770m](https://huggingface.co/Salesforce/codet5p-770m).
|
||||
* CodeT5+ `220M` and `770M` that are further tuned on Python subset: [codet5p-220m-py](https://huggingface.co/Salesforce/codet5p-220m-py) and [codet5p-770m-py](https://huggingface.co/Salesforce/codet5p-770m-py).
|
||||
* CodeT5+ `2B`, `6B`, `16B`: [Salesforce/codet5p-2b](https://huggingface.co/Salesforce/codet5p-2b), [Salesforce/codet5p-6b](https://huggingface.co/Salesforce/codet5p-6b), and [Salesforce/codet5p-16b](https://huggingface.co/Salesforce/codet5p-16b).
|
||||
* InstructCodeT5+ `16B`: [Salesforce/instructcodet5p-16b](https://huggingface.co/Salesforce/instructcodet5p-16b).
|
||||
|
||||
# How to Use?
|
||||
CodeT5+ `220M` and `770M` models can be easily loaded using the `T5ForConditionalGeneration` functionality. They employ the same tokenizer as the original [CodeT5](https://github.com/salesforce/CodeT5).
|
||||
All CodeT5+ models and tokenizers can be easily loaded using the `AutoModelForSeq2SeqLM` and `AutoTokenizer` functionality.
|
||||
For tokenizers, CodeT5+ `220M` and `770M` employ the same tokenizer as the original [CodeT5](https://github.com/salesforce/CodeT5) while CodeT5+ `2B`, `6B`, `16B` employ the same tokenizer as [CodeGen]( https://github.com/salesforce/CodeGen).
|
||||
To load CodeT5+ `2B`, `6B`, `16B`, please set `trust_remote_code=True` as the [model class](https://huggingface.co/Salesforce/codet5p-16b/blob/main/modeling_codet5p.py) is defined in the Huggingface repo.
|
||||
|
||||
|
||||
```python
|
||||
from transformers import T5ForConditionalGeneration, AutoTokenizer
|
||||
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
|
||||
import torch
|
||||
|
||||
checkpoint = "Salesforce/codet5p-770m-py"
|
||||
checkpoint = "Salesforce/instructcodet5p-16b"
|
||||
device = "cuda" # for GPU usage or "cpu" for CPU usage
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
|
||||
model = T5ForConditionalGeneration.from_pretrained(checkpoint).to(device)
|
||||
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint,
|
||||
torch_dtype=torch.float16,
|
||||
low_cpu_mem_usage=True,
|
||||
trust_remote_code=True).to(device)
|
||||
|
||||
inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to(device)
|
||||
outputs = model.generate(inputs, max_length=10)
|
||||
inputs = tokenizer.encode("def print_hello():", return_tensors="pt").to(device)
|
||||
outputs = model.generate(inputs, max_length=12)
|
||||
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
||||
# ==> print('Hello World!')
|
||||
|
||||
```
|
||||
|
||||
# Reproduce the Results
|
||||
|
||||
## HumanEval
|
||||
|
||||
TBA
|
||||
|
||||
## Citation
|
||||
|
||||
```bibtex
|
||||
|
BIN
CodeT5+/codet5p_architecture.png
Normal file
BIN
CodeT5+/codet5p_architecture.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 456 KiB |
BIN
CodeT5.png
BIN
CodeT5.png
Binary file not shown.
Before Width: | Height: | Size: 323 KiB |
203
CodeT5/README.md
Normal file
203
CodeT5/README.md
Normal file
@ -0,0 +1,203 @@
|
||||
# CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation
|
||||
|
||||
This is the official PyTorch implementation for the following EMNLP 2021 paper from Salesforce Research:
|
||||
|
||||
**Title**: [CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation](https://arxiv.org/pdf/2109.00859.pdf)
|
||||
|
||||
**Authors**: [Yue Wang](https://yuewang-cuhk.github.io/), [Weishi Wang](https://www.linkedin.com/in/weishi-wang/)
|
||||
, [Shafiq Joty](https://raihanjoty.github.io/), and [Steven C.H. Hoi](https://sites.google.com/view/stevenhoi/home)
|
||||
|
||||
![CodeT5 demo](../codet5.gif)
|
||||
|
||||
## Updates
|
||||
|
||||
**July 06, 2022**
|
||||
|
||||
We release two large-sized CodeT5 checkpoints at Hugging Face: [Salesforce/codet5-large](https://huggingface.co/Salesforce/codet5-large) and [Salesforce/codet5-large-ntp-py](https://huggingface.co/Salesforce/codet5-large-ntp-py), which are introduced by the paper: [CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning](https://arxiv.org/pdf/2207.01780.pdf) by Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, Steven C.H. Hoi.
|
||||
|
||||
* CodeT5-large was pretrained using Masked Span Prediction (MSP) objective on CodeSearchNet and achieve new SOTA results on several CodeXGLUE benchmarks. The finetuned checkpoints are released at [here](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/finetuned_models). See Appendix A.1 of the [paper](https://arxiv.org/pdf/2207.01780.pdf) for more details.
|
||||
|
||||
* CodeT5-large-ntp-py was first pretrained using Masked Span Prediction (MSP) objective on CodeSearchNet and GCPY (the Python split of [Github Code](https://huggingface.co/datasets/codeparrot/github-code) data), followed by another 10 epochs on GCPY using Next Token Prediction (NTP) objective.
|
||||
|
||||
CodeT5-large-ntp-py is especially optimized for Python code generation tasks and employed as the foundation model for our [CodeRL](https://github.com/salesforce/CodeRL), yielding new SOTA results on the APPS Python competition-level program synthesis benchmark. See the [paper](https://arxiv.org/pdf/2207.01780.pdf) for more details.
|
||||
|
||||
**Oct 29, 2021**
|
||||
|
||||
We release [fine-tuned checkpoints](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/finetuned_models)
|
||||
for all the downstream tasks covered in the paper.
|
||||
|
||||
**Oct 25, 2021**
|
||||
|
||||
We release a CodeT5-base fine-tuned
|
||||
checkpoint ([Salesforce/codet5-base-multi-sum](https://huggingface.co/Salesforce/codet5-base-multi-sum)) for
|
||||
multilingual code summarzation. Below is how to use this model:
|
||||
|
||||
```python
|
||||
from transformers import RobertaTokenizer, T5ForConditionalGeneration
|
||||
|
||||
if __name__ == '__main__':
|
||||
tokenizer = RobertaTokenizer.from_pretrained('Salesforce/codet5-base')
|
||||
model = T5ForConditionalGeneration.from_pretrained('Salesforce/codet5-base-multi-sum')
|
||||
|
||||
text = """def svg_to_image(string, size=None):
|
||||
if isinstance(string, unicode):
|
||||
string = string.encode('utf-8')
|
||||
renderer = QtSvg.QSvgRenderer(QtCore.QByteArray(string))
|
||||
if not renderer.isValid():
|
||||
raise ValueError('Invalid SVG data.')
|
||||
if size is None:
|
||||
size = renderer.defaultSize()
|
||||
image = QtGui.QImage(size, QtGui.QImage.Format_ARGB32)
|
||||
painter = QtGui.QPainter(image)
|
||||
renderer.render(painter)
|
||||
return image"""
|
||||
|
||||
input_ids = tokenizer(text, return_tensors="pt").input_ids
|
||||
|
||||
generated_ids = model.generate(input_ids, max_length=20)
|
||||
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
|
||||
# this prints: "Convert a SVG string to a QImage."
|
||||
```
|
||||
|
||||
**Oct 18, 2021**
|
||||
|
||||
We add a [model card](https://github.com/salesforce/CodeT5/blob/main/CodeT5_model_card.pdf) for CodeT5! Please reach out
|
||||
if you have any questions about it.
|
||||
|
||||
**Sep 24, 2021**
|
||||
|
||||
CodeT5 is now in [hugginface](https://huggingface.co/)!
|
||||
|
||||
You can simply load the model ([CodeT5-small](https://huggingface.co/Salesforce/codet5-small)
|
||||
and [CodeT5-base](https://huggingface.co/Salesforce/codet5-base)) and do the inference:
|
||||
|
||||
```python
|
||||
from transformers import RobertaTokenizer, T5ForConditionalGeneration
|
||||
|
||||
tokenizer = RobertaTokenizer.from_pretrained('Salesforce/codet5-base')
|
||||
model = T5ForConditionalGeneration.from_pretrained('Salesforce/codet5-base')
|
||||
|
||||
text = "def greet(user): print(f'hello <extra_id_0>!')"
|
||||
input_ids = tokenizer(text, return_tensors="pt").input_ids
|
||||
|
||||
# simply generate one code span
|
||||
generated_ids = model.generate(input_ids, max_length=8)
|
||||
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
|
||||
# this prints "{user.username}"
|
||||
```
|
||||
|
||||
## Introduction
|
||||
|
||||
This repo provides the code for reproducing the experiments
|
||||
in [CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation](https://arxiv.org/pdf/2109.00859.pdf)
|
||||
. CodeT5 is a new pre-trained encoder-decoder model for programming languages, which is pre-trained on **8.35M**
|
||||
functions in 8 programming languages (Python, Java, JavaScript, PHP, Ruby, Go, C, and C#). In total, it achieves
|
||||
state-of-the-art results on **14 sub-tasks** in a code intelligence benchmark - [CodeXGLUE](https://github.com/microsoft/CodeXGLUE).
|
||||
|
||||
Paper link: https://arxiv.org/abs/2109.00859
|
||||
|
||||
Blog link: https://blog.salesforceairesearch.com/codet5/
|
||||
|
||||
The code currently includes two pre-trained checkpoints ([CodeT5-small](https://huggingface.co/Salesforce/codet5-small)
|
||||
and [CodeT5-base](https://huggingface.co/Salesforce/codet5-base)) and scripts to fine-tune them on 4 generation tasks (
|
||||
code summarization, code generation, translation, and refinement) plus 2 understanding tasks (code defect detection and
|
||||
clone detection) in CodeXGLUE. We also provide their fine-tuned checkpoints to facilitate the easy replication
|
||||
of our paper.
|
||||
|
||||
In practice, CodeT5 can be deployed as an AI-powered coding assistant to boost the productivity of software developers.
|
||||
At Salesforce, we build an [AI coding assistant demo](https://github.com/salesforce/CodeT5/raw/main/codet5.gif) using
|
||||
CodeT5 as a VS Code plugin to provide three capabilities for Apex developers:
|
||||
|
||||
- **Text-to-code generation**: generate code based on the natural language description.
|
||||
- **Code autocompletion**: complete the whole function of code given the target function name.
|
||||
- **Code summarization**: generate the summary of a function in natural language description.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [Dependency](#dependency)
|
||||
2. [Download](#download)
|
||||
3. [Fine-tuning](#fine-tuning)
|
||||
|
||||
## Dependency
|
||||
|
||||
- Pytorch 1.7.1
|
||||
- tensorboard 2.4.1
|
||||
- transformers 4.6.1
|
||||
- tree-sitter 0.2.2
|
||||
|
||||
## Download
|
||||
|
||||
* [Pre-trained checkpoints](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/pretrained_models)
|
||||
* [Fine-tuning data](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/data)
|
||||
* [Fine-tuned checkpoints](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/finetuned_models)
|
||||
|
||||
Instructions to download:
|
||||
|
||||
```
|
||||
# pip install gsutil
|
||||
cd your-cloned-codet5-path
|
||||
|
||||
gsutil -m cp -r "gs://sfr-codet5-data-research/pretrained_models" .
|
||||
gsutil -m cp -r "gs://sfr-codet5-data-research/data" .
|
||||
gsutil -m cp -r "gs://sfr-codet5-data-research/finetuned_models" .
|
||||
```
|
||||
|
||||
## Fine-tuning
|
||||
|
||||
Go to `sh` folder, set the `WORKDIR` in `exp_with_args.sh` to be your cloned CodeT5 repository path.
|
||||
|
||||
You can use `run_exp.py` to run a broad set of experiments by simply passing the `model_tag`, `task`, and `sub_task`
|
||||
arguments. In total, we support five models (i.e., ['roberta', 'codebert', 'bart_base', 'codet5_small', 'codet5_base'])
|
||||
and six tasks (i.e., ['summarize', 'concode', 'translate', 'refine', 'defect', 'clone']). For each task, we use
|
||||
the `sub_task` to specify which specific datasets to fine-tne on. Below is the full list:
|
||||
|
||||
| \--task | \--sub\_task | Description |
|
||||
| --------- | ---------------------------------- | -------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| summarize | ruby/javascript/go/python/java/php | code summarization task on [CodeSearchNet](https://arxiv.org/abs/1909.09436) data with six PLs |
|
||||
| concode | none | text-to-code generation on [Concode](https://aclanthology.org/D18-1192.pdf) data |
|
||||
| translate | java-cs/cs-java | code-to-code translation between [Java and C#](https://arxiv.org/pdf/2102.04664.pdf) |
|
||||
| refine | small/medium | code refinement on [code repair data](https://arxiv.org/pdf/1812.08693.pdf) with small/medium functions |
|
||||
| defect | none | code defect detection in [C/C++ data](https://proceedings.neurips.cc/paper/2019/file/49265d2447bc3bbfe9e76306ce40a31f-Paper.pdf) |
|
||||
| clone | none | code clone detection in [Java data](https://arxiv.org/pdf/2002.08653.pdf) |
|
||||
|
||||
For example, if you want to run CodeT5-base model on the code summarization task for Python, you can simply run:
|
||||
|
||||
```
|
||||
python run_exp.py --model_tag codet5_base --task summarize --sub_task python
|
||||
```
|
||||
|
||||
For multi-task training, you can type:
|
||||
|
||||
```
|
||||
python run_exp.py --model_tag codet5_base --task multi_task --sub_task none
|
||||
```
|
||||
|
||||
Besides, you can specify:
|
||||
|
||||
```
|
||||
model_dir: where to save fine-tuning checkpoints
|
||||
res_dir: where to save the performance results
|
||||
summary_dir: where to save the training curves
|
||||
data_num: how many data instances to use, the default -1 is for using the full data
|
||||
gpu: the index of the GPU to use in the cluster
|
||||
```
|
||||
|
||||
You can also revise the suggested
|
||||
arguments [here](https://github.com/salesforce/CodeT5/blob/0bf3c0c43e92fcf54d9df68c793ac22f2b60aad4/sh/run_exp.py#L14) or directly customize the [exp_with_args.sh](https://github.com/salesforce/CodeT5/blob/main/sh/exp_with_args.sh) bash file.
|
||||
Please refer to the argument flags in [configs.py](https://github.com/salesforce/CodeT5/blob/main/configs.py) for the full
|
||||
available options. The saved training curves in `summary_dir` can be visualized using [tensorboard](https://pypi.org/project/tensorboard/).
|
||||
Note that we employ one A100 GPU for all fine-tuning experiments.
|
||||
|
||||
### How to reproduce the results using the released finetuned checkpoints?
|
||||
|
||||
* Remove the `--do_train --do_eval --do_eval_bleu` and reserve only `--do_test` at [here](https://github.com/salesforce/CodeT5/blob/5b37c34f4bbbfcfd972c24a9dd1f45716568ecb5/sh/exp_with_args.sh#L84).
|
||||
* Pass the path of your downloaded finetuned checkpoint to load at [here](https://github.com/salesforce/CodeT5/blob/5b37c34f4bbbfcfd972c24a9dd1f45716568ecb5/run_gen.py#L366), e.g., `file = "CodeT5/finetuned_models/summarize_python_codet5_base.bin"`
|
||||
* Run the program: `python run_exp.py --model_tag codet5_base --task summarize --sub_task python`
|
||||
|
||||
### How to fine-tune on your own task and dataset?
|
||||
If you want to fine-tune on your dataset, you can add your own task and sub_task in `configs.py` ([here](https://github.com/salesforce/CodeT5/blob/d27512d23ba6130e089e571d8c3e399760db1c31/configs.py#L11)) and add your data path and the function to read in `utils.py` ([here](https://github.com/salesforce/CodeT5/blob/5bb41e21b07fee73f310476a91ded00e385290d7/utils.py#L103) and [here](https://github.com/salesforce/CodeT5/blob/5bb41e21b07fee73f310476a91ded00e385290d7/utils.py#L149)). The read function can be implemented in `_utils.py` similar to [this one](https://github.com/salesforce/CodeT5/blob/aaf9c4a920c4986abfd54a74f5456b056b6409e0/_utils.py#L213). If your task to add is a generation task, you can simply reuse or customize the `run_gen.py`. For understanding tasks, please refer to `run_defect.py` and `run_clone.py`.
|
||||
|
||||
## Get Involved
|
||||
|
||||
Please create a GitHub issue if you have any questions, suggestions, requests or bug-reports. We welcome PRs!
|
||||
|
0
CodeT5/evaluator/CodeBLEU/__init__.py
Normal file
0
CodeT5/evaluator/CodeBLEU/__init__.py
Normal file
@ -15,7 +15,7 @@ from fractions import Fraction
|
||||
import warnings
|
||||
from collections import Counter
|
||||
|
||||
from evaluator.CodeBLEU.utils import ngrams
|
||||
from CodeT5.evaluator.CodeBLEU.utils import ngrams
|
||||
|
||||
|
||||
def sentence_bleu(
|
@ -5,7 +5,7 @@
|
||||
# -*- coding:utf-8 -*-
|
||||
import argparse
|
||||
import os
|
||||
from evaluator.CodeBLEU import bleu, weighted_ngram_match, syntax_match, dataflow_match
|
||||
from CodeT5.evaluator.CodeBLEU import weighted_ngram_match, bleu, dataflow_match, syntax_match
|
||||
|
||||
|
||||
def get_codebleu(refs, hyp, lang, params='0.25,0.25,0.25,0.25'):
|
@ -1,11 +1,10 @@
|
||||
# Copyright (c) Microsoft Corporation.
|
||||
# Licensed under the MIT license.
|
||||
|
||||
from evaluator.CodeBLEU.parser import DFG_python, DFG_java, DFG_ruby, DFG_go, DFG_php, DFG_javascript, DFG_csharp
|
||||
from evaluator.CodeBLEU.parser import (remove_comments_and_docstrings,
|
||||
tree_to_token_index,
|
||||
index_to_code_token,
|
||||
tree_to_variable_index)
|
||||
from CodeT5.evaluator.CodeBLEU.parser import DFG_python, DFG_java, DFG_ruby, DFG_go, DFG_php, DFG_javascript, DFG_csharp
|
||||
from CodeT5.evaluator.CodeBLEU.parser import (remove_comments_and_docstrings,
|
||||
tree_to_token_index,
|
||||
index_to_code_token)
|
||||
from tree_sitter import Language, Parser
|
||||
import os
|
||||
|
@ -1,11 +1,8 @@
|
||||
# Copyright (c) Microsoft Corporation.
|
||||
# Licensed under the MIT license.
|
||||
|
||||
from evaluator.CodeBLEU.parser import DFG_python, DFG_java, DFG_ruby, DFG_go, DFG_php, DFG_javascript, DFG_csharp
|
||||
from evaluator.CodeBLEU.parser import (remove_comments_and_docstrings,
|
||||
tree_to_token_index,
|
||||
index_to_code_token,
|
||||
tree_to_variable_index)
|
||||
from CodeT5.evaluator.CodeBLEU.parser import DFG_python, DFG_java, DFG_ruby, DFG_go, DFG_php, DFG_javascript, DFG_csharp
|
||||
from CodeT5.evaluator.CodeBLEU.parser import (remove_comments_and_docstrings)
|
||||
from tree_sitter import Language, Parser
|
||||
import os
|
||||
|
@ -18,8 +18,7 @@ from fractions import Fraction
|
||||
import warnings
|
||||
from collections import Counter
|
||||
|
||||
from evaluator.CodeBLEU.utils import ngrams
|
||||
import pdb
|
||||
from CodeT5.evaluator.CodeBLEU.utils import ngrams
|
||||
|
||||
|
||||
def sentence_bleu(
|
0
CodeT5/evaluator/__init__.py
Normal file
0
CodeT5/evaluator/__init__.py
Normal file
@ -23,14 +23,11 @@ import os
|
||||
import torch
|
||||
import logging
|
||||
import argparse
|
||||
import math
|
||||
import numpy as np
|
||||
from tqdm import tqdm
|
||||
from itertools import cycle
|
||||
import multiprocessing
|
||||
import time
|
||||
import sys
|
||||
import pdb
|
||||
|
||||
from torch.utils.tensorboard import SummaryWriter
|
||||
from torch.utils.data import DataLoader, SequentialSampler, RandomSampler
|
246
README.md
246
README.md
@ -1,125 +1,54 @@
|
||||
# CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation
|
||||
# CodeT5 and CodeT5+
|
||||
|
||||
This is the official PyTorch implementation for the following EMNLP 2021 paper from Salesforce Research:
|
||||
Official research release for **CodeT5** and **CodeT5+** models for a wide range of **Code Understanding and Generation** tasks from Salesforce Research.
|
||||
These open code LLMs are introduced by the following papers:
|
||||
|
||||
**Title**: [CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation](https://arxiv.org/pdf/2109.00859.pdf)
|
||||
*Title*: [CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation](https://arxiv.org/pdf/2109.00859.pdf)
|
||||
|
||||
**Authors**: [Yue Wang](https://yuewang-cuhk.github.io/), [Weishi Wang](https://www.linkedin.com/in/weishi-wang/)
|
||||
, [Shafiq Joty](https://raihanjoty.github.io/), and [Steven C.H. Hoi](https://sites.google.com/view/stevenhoi/home)
|
||||
*Authors*: [Yue Wang](https://yuewang-cuhk.github.io/), [Weishi Wang](https://www.linkedin.com/in/weishi-wang/)
|
||||
, [Shafiq Joty](https://raihanjoty.github.io/), [Steven C.H. Hoi](https://sites.google.com/view/stevenhoi/home)
|
||||
|
||||
![CodeT5 demo](codet5.gif)
|
||||
*Title*: [CodeT5+: Open Code Large Language Models for Code Understanding and Generation](https://arxiv.org/pdf/2305.07922.pdf)
|
||||
|
||||
## Updates
|
||||
*Authors*: [Yue Wang](https://yuewang-cuhk.github.io/)\*, [Hung Le](https://sites.google.com/view/henryle2018/home?pli=1)\*, [Akhilesh Deepak Gotmare](https://akhileshgotmare.github.io/), [Nghi D.Q. Bui](https://bdqnghi.github.io/), [Junnan Li](https://sites.google.com/site/junnanlics), [Steven C.H. Hoi](https://sites.google.com/view/stevenhoi/home) (* indicates equal contribution)
|
||||
|
||||
**July 06, 2022**
|
||||
|
||||
We release two large-sized CodeT5 checkpoints at Hugging Face: [Salesforce/codet5-large](https://huggingface.co/Salesforce/codet5-large) and [Salesforce/codet5-large-ntp-py](https://huggingface.co/Salesforce/codet5-large-ntp-py), which are introduced by the paper: [CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning](https://arxiv.org/pdf/2207.01780.pdf) by Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, Steven C.H. Hoi.
|
||||
|
||||
* CodeT5-large was pretrained using Masked Span Prediction (MSP) objective on CodeSearchNet and achieve new SOTA results on several CodeXGLUE benchmarks. The finetuned checkpoints are released at [here](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/finetuned_models). See Appendix A.1 of the [paper](https://arxiv.org/pdf/2207.01780.pdf) for more details.
|
||||
|
||||
* CodeT5-large-ntp-py was first pretrained using Masked Span Prediction (MSP) objective on CodeSearchNet and GCPY (the Python split of [Github Code](https://huggingface.co/datasets/codeparrot/github-code) data), followed by another 10 epochs on GCPY using Next Token Prediction (NTP) objective.
|
||||
|
||||
CodeT5-large-ntp-py is especially optimized for Python code generation tasks and employed as the foundation model for our [CodeRL](https://github.com/salesforce/CodeRL), yielding new SOTA results on the APPS Python competition-level program synthesis benchmark. See the [paper](https://arxiv.org/pdf/2207.01780.pdf) for more details.
|
||||
|
||||
**Oct 29, 2021**
|
||||
|
||||
We release [fine-tuned checkpoints](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/finetuned_models)
|
||||
for all the downstream tasks covered in the paper.
|
||||
|
||||
**Oct 25, 2021**
|
||||
|
||||
We release a CodeT5-base fine-tuned
|
||||
checkpoint ([Salesforce/codet5-base-multi-sum](https://huggingface.co/Salesforce/codet5-base-multi-sum)) for
|
||||
multilingual code summarzation. Below is how to use this model:
|
||||
|
||||
```python
|
||||
from transformers import RobertaTokenizer, T5ForConditionalGeneration
|
||||
|
||||
if __name__ == '__main__':
|
||||
tokenizer = RobertaTokenizer.from_pretrained('Salesforce/codet5-base')
|
||||
model = T5ForConditionalGeneration.from_pretrained('Salesforce/codet5-base-multi-sum')
|
||||
|
||||
text = """def svg_to_image(string, size=None):
|
||||
if isinstance(string, unicode):
|
||||
string = string.encode('utf-8')
|
||||
renderer = QtSvg.QSvgRenderer(QtCore.QByteArray(string))
|
||||
if not renderer.isValid():
|
||||
raise ValueError('Invalid SVG data.')
|
||||
if size is None:
|
||||
size = renderer.defaultSize()
|
||||
image = QtGui.QImage(size, QtGui.QImage.Format_ARGB32)
|
||||
painter = QtGui.QPainter(image)
|
||||
renderer.render(painter)
|
||||
return image"""
|
||||
|
||||
input_ids = tokenizer(text, return_tensors="pt").input_ids
|
||||
|
||||
generated_ids = model.generate(input_ids, max_length=20)
|
||||
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
|
||||
# this prints: "Convert a SVG string to a QImage."
|
||||
```
|
||||
|
||||
**Oct 18, 2021**
|
||||
|
||||
We add a [model card](https://github.com/salesforce/CodeT5/blob/main/CodeT5_model_card.pdf) for CodeT5! Please reach out
|
||||
if you have any questions about it.
|
||||
|
||||
**Sep 24, 2021**
|
||||
|
||||
CodeT5 is now in [hugginface](https://huggingface.co/)!
|
||||
|
||||
You can simply load the model ([CodeT5-small](https://huggingface.co/Salesforce/codet5-small)
|
||||
and [CodeT5-base](https://huggingface.co/Salesforce/codet5-base)) and do the inference:
|
||||
|
||||
```python
|
||||
from transformers import RobertaTokenizer, T5ForConditionalGeneration
|
||||
|
||||
tokenizer = RobertaTokenizer.from_pretrained('Salesforce/codet5-base')
|
||||
model = T5ForConditionalGeneration.from_pretrained('Salesforce/codet5-base')
|
||||
|
||||
text = "def greet(user): print(f'hello <extra_id_0>!')"
|
||||
input_ids = tokenizer(text, return_tensors="pt").input_ids
|
||||
|
||||
# simply generate one code span
|
||||
generated_ids = model.generate(input_ids, max_length=8)
|
||||
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
|
||||
# this prints "{user.username}"
|
||||
```
|
||||
|
||||
## Introduction
|
||||
|
||||
This repo provides the code for reproducing the experiments
|
||||
in [CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation](https://arxiv.org/pdf/2109.00859.pdf)
|
||||
. CodeT5 is a new pre-trained encoder-decoder model for programming languages, which is pre-trained on **8.35M**
|
||||
functions in 8 programming languages (Python, Java, JavaScript, PHP, Ruby, Go, C, and C#). In total, it achieves
|
||||
state-of-the-art results on **14 sub-tasks** in a code intelligence benchmark - [CodeXGLUE](https://github.com/microsoft/CodeXGLUE).
|
||||
|
||||
Paper link: https://arxiv.org/abs/2109.00859
|
||||
|
||||
Blog link: https://blog.salesforceairesearch.com/codet5/
|
||||
|
||||
The code currently includes two pre-trained checkpoints ([CodeT5-small](https://huggingface.co/Salesforce/codet5-small)
|
||||
and [CodeT5-base](https://huggingface.co/Salesforce/codet5-base)) and scripts to fine-tune them on 4 generation tasks (
|
||||
code summarization, code generation, translation, and refinement) plus 2 understanding tasks (code defect detection and
|
||||
clone detection) in CodeXGLUE. We also provide their fine-tuned checkpoints to facilitate the easy replication
|
||||
of our paper.
|
||||
|
||||
In practice, CodeT5 can be deployed as an AI-powered coding assistant to boost the productivity of software developers.
|
||||
At Salesforce, we build an [AI coding assistant demo](https://github.com/salesforce/CodeT5/raw/main/codet5.gif) using
|
||||
CodeT5 as a VS Code plugin to provide three capabilities for Apex developers:
|
||||
In practice, CodeT5 and CodeT5+ models can be deployed as an AI-powered coding assistant to boost the productivity of software developers.
|
||||
At Salesforce, we build an AI coding assistant demo using CodeT5 as a VS Code plugin to provide three capabilities for Apex developers:
|
||||
|
||||
- **Text-to-code generation**: generate code based on the natural language description.
|
||||
- **Code autocompletion**: complete the whole function of code given the target function name.
|
||||
- **Code summarization**: generate the summary of a function in natural language description.
|
||||
|
||||
## Table of Contents
|
||||
![CodeT5 demo](./codet5.gif)
|
||||
|
||||
## What's New: 🎉
|
||||
|
||||
**May 2023**
|
||||
|
||||
**CodeT5+** Paper and models released! ([paper](https://arxiv.org/pdf/2305.07922.pdf), [code](https://github.com/salesforce/CodeT5/tree/main/CodeT5+))
|
||||
|
||||
**July 2022**
|
||||
|
||||
We release two large-sized CodeT5 checkpoints at Hugging Face: [Salesforce/codet5-large](https://huggingface.co/Salesforce/codet5-large) and [Salesforce/codet5-large-ntp-py](https://huggingface.co/Salesforce/codet5-large-ntp-py), which are introduced by the [CodeRL paper](https://arxiv.org/pdf/2207.01780.pdf).
|
||||
|
||||
**Oct 2021**
|
||||
|
||||
We release [fine-tuned checkpoints](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/finetuned_models)
|
||||
for all the downstream tasks covered in the paper.
|
||||
Besides, we release a CodeT5-base fine-tuned
|
||||
checkpoint ([Salesforce/codet5-base-multi-sum](https://huggingface.co/Salesforce/codet5-base-multi-sum)) for
|
||||
multilingual code summarization.
|
||||
|
||||
|
||||
**Sep, 2021**
|
||||
|
||||
CodeT5 is now in [hugginface](https://huggingface.co/)! ([CodeT5-small](https://huggingface.co/Salesforce/codet5-small) and [CodeT5-base](https://huggingface.co/Salesforce/codet5-base)).
|
||||
|
||||
We add a [model card](https://github.com/salesforce/CodeT5/blob/main/CodeT5_model_card.pdf) for CodeT5! Please reach out
|
||||
if you have any questions about it.
|
||||
|
||||
|
||||
1. [Citation](#citation)
|
||||
2. [License](#license)
|
||||
3. [Dependency](#dependency)
|
||||
4. [Download](#download)
|
||||
5. [Fine-tuning](#fine-tuning)
|
||||
6. [Get Involved](#get-involved)
|
||||
|
||||
## Citation
|
||||
|
||||
@ -130,15 +59,24 @@ If you find this code to be useful for your research, please consider citing:
|
||||
wang2021codet5,
|
||||
title={CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation},
|
||||
author={Yue Wang, Weishi Wang, Shafiq Joty, Steven C.H. Hoi},
|
||||
booktitle={Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021},
|
||||
booktitle={EMNLP},
|
||||
year={2021},
|
||||
}
|
||||
|
||||
@article{coderl2022,
|
||||
title={CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning},
|
||||
author={Le, Hung and Wang, Yue and Gotmare, Akhilesh Deepak and Savarese, Silvio and Hoi, Steven C. H.},
|
||||
journal={arXiv preprint arXiv:2207.01780},
|
||||
year={2022}
|
||||
@inproceedings{
|
||||
le2022coderl,
|
||||
title={CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning},
|
||||
author={Le, Hung and Wang, Yue and Gotmare, Akhilesh Deepak and Savarese, Silvio and Hoi, Steven C. H.},
|
||||
journal={NeurIPS},
|
||||
year={2022}
|
||||
}
|
||||
|
||||
@article{
|
||||
wang2023codet5plus,
|
||||
title={CodeT5+: Open Code Large Language Models for Code Understanding and Generation},
|
||||
author={Wang, Yue and Le, Hung and Gotmare, Akhilesh Deepak and Bui, Nghi D.Q. and Li, Junnan and Hoi, Steven C. H.},
|
||||
journal={arXiv preprint},
|
||||
year={2023}
|
||||
}
|
||||
```
|
||||
|
||||
@ -162,84 +100,6 @@ codeT5@salesforce.com, and to
|
||||
use [appropriate](https://arxiv.org/abs/1810.03993) [documentation](https://www.partnershiponai.org/about-ml/) when
|
||||
developing high-stakes applications of this model.
|
||||
|
||||
## Dependency
|
||||
|
||||
- Pytorch 1.7.1
|
||||
- tensorboard 2.4.1
|
||||
- transformers 4.6.1
|
||||
- tree-sitter 0.2.2
|
||||
|
||||
## Download
|
||||
|
||||
* [Pre-trained checkpoints](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/pretrained_models)
|
||||
* [Fine-tuning data](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/data)
|
||||
* [Fine-tuned checkpoints](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/finetuned_models)
|
||||
|
||||
Instructions to download:
|
||||
|
||||
```
|
||||
# pip install gsutil
|
||||
cd your-cloned-codet5-path
|
||||
|
||||
gsutil -m cp -r "gs://sfr-codet5-data-research/pretrained_models" .
|
||||
gsutil -m cp -r "gs://sfr-codet5-data-research/data" .
|
||||
gsutil -m cp -r "gs://sfr-codet5-data-research/finetuned_models" .
|
||||
```
|
||||
|
||||
## Fine-tuning
|
||||
|
||||
Go to `sh` folder, set the `WORKDIR` in `exp_with_args.sh` to be your cloned CodeT5 repository path.
|
||||
|
||||
You can use `run_exp.py` to run a broad set of experiments by simply passing the `model_tag`, `task`, and `sub_task`
|
||||
arguments. In total, we support five models (i.e., ['roberta', 'codebert', 'bart_base', 'codet5_small', 'codet5_base'])
|
||||
and six tasks (i.e., ['summarize', 'concode', 'translate', 'refine', 'defect', 'clone']). For each task, we use
|
||||
the `sub_task` to specify which specific datasets to fine-tne on. Below is the full list:
|
||||
|
||||
| \--task | \--sub\_task | Description |
|
||||
| --------- | ---------------------------------- | -------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| summarize | ruby/javascript/go/python/java/php | code summarization task on [CodeSearchNet](https://arxiv.org/abs/1909.09436) data with six PLs |
|
||||
| concode | none | text-to-code generation on [Concode](https://aclanthology.org/D18-1192.pdf) data |
|
||||
| translate | java-cs/cs-java | code-to-code translation between [Java and C#](https://arxiv.org/pdf/2102.04664.pdf) |
|
||||
| refine | small/medium | code refinement on [code repair data](https://arxiv.org/pdf/1812.08693.pdf) with small/medium functions |
|
||||
| defect | none | code defect detection in [C/C++ data](https://proceedings.neurips.cc/paper/2019/file/49265d2447bc3bbfe9e76306ce40a31f-Paper.pdf) |
|
||||
| clone | none | code clone detection in [Java data](https://arxiv.org/pdf/2002.08653.pdf) |
|
||||
|
||||
For example, if you want to run CodeT5-base model on the code summarization task for Python, you can simply run:
|
||||
|
||||
```
|
||||
python run_exp.py --model_tag codet5_base --task summarize --sub_task python
|
||||
```
|
||||
|
||||
For multi-task training, you can type:
|
||||
|
||||
```
|
||||
python run_exp.py --model_tag codet5_base --task multi_task --sub_task none
|
||||
```
|
||||
|
||||
Besides, you can specify:
|
||||
|
||||
```
|
||||
model_dir: where to save fine-tuning checkpoints
|
||||
res_dir: where to save the performance results
|
||||
summary_dir: where to save the training curves
|
||||
data_num: how many data instances to use, the default -1 is for using the full data
|
||||
gpu: the index of the GPU to use in the cluster
|
||||
```
|
||||
|
||||
You can also revise the suggested
|
||||
arguments [here](https://github.com/salesforce/CodeT5/blob/0bf3c0c43e92fcf54d9df68c793ac22f2b60aad4/sh/run_exp.py#L14) or directly customize the [exp_with_args.sh](https://github.com/salesforce/CodeT5/blob/main/sh/exp_with_args.sh) bash file.
|
||||
Please refer to the argument flags in [configs.py](https://github.com/salesforce/CodeT5/blob/main/configs.py) for the full
|
||||
available options. The saved training curves in `summary_dir` can be visualized using [tensorboard](https://pypi.org/project/tensorboard/).
|
||||
Note that we employ one A100 GPU for all fine-tuning experiments.
|
||||
|
||||
### How to reproduce the results using the released finetuned checkpoints?
|
||||
|
||||
* Remove the `--do_train --do_eval --do_eval_bleu` and reserve only `--do_test` at [here](https://github.com/salesforce/CodeT5/blob/5b37c34f4bbbfcfd972c24a9dd1f45716568ecb5/sh/exp_with_args.sh#L84).
|
||||
* Pass the path of your downloaded finetuned checkpoint to load at [here](https://github.com/salesforce/CodeT5/blob/5b37c34f4bbbfcfd972c24a9dd1f45716568ecb5/run_gen.py#L366), e.g., `file = "CodeT5/finetuned_models/summarize_python_codet5_base.bin"`
|
||||
* Run the program: `python run_exp.py --model_tag codet5_base --task summarize --sub_task python`
|
||||
|
||||
### How to fine-tune on your own task and dataset?
|
||||
If you want to fine-tune on your dataset, you can add your own task and sub_task in `configs.py` ([here](https://github.com/salesforce/CodeT5/blob/d27512d23ba6130e089e571d8c3e399760db1c31/configs.py#L11)) and add your data path and the function to read in `utils.py` ([here](https://github.com/salesforce/CodeT5/blob/5bb41e21b07fee73f310476a91ded00e385290d7/utils.py#L103) and [here](https://github.com/salesforce/CodeT5/blob/5bb41e21b07fee73f310476a91ded00e385290d7/utils.py#L149)). The read function can be implemented in `_utils.py` similar to [this one](https://github.com/salesforce/CodeT5/blob/aaf9c4a920c4986abfd54a74f5456b056b6409e0/_utils.py#L213). If your task to add is a generation task, you can simply reuse or customize the `run_gen.py`. For understanding tasks, please refer to `run_defect.py` and `run_clone.py`.
|
||||
|
||||
## Get Involved
|
||||
|
||||
|
BIN
codet5.gif
BIN
codet5.gif
Binary file not shown.
Before Width: | Height: | Size: 4.0 MiB After Width: | Height: | Size: 4.7 MiB |
Loading…
Reference in New Issue
Block a user