diff --git a/CodeT5+/README.md b/CodeT5+/README.md index 3fce6a8..11bd53d 100644 --- a/CodeT5+/README.md +++ b/CodeT5+/README.md @@ -7,43 +7,61 @@ Official research release for the **CodeT5+** models (`220M`, `770M`, `2B`, `6B` *Authors*: [Yue Wang](https://yuewang-cuhk.github.io/)\*, [Hung Le](https://sites.google.com/view/henryle2018/home?pli=1)\*, [Akhilesh Deepak Gotmare](https://akhileshgotmare.github.io/), [Nghi D.Q. Bui](https://bdqnghi.github.io/), [Junnan Li](https://sites.google.com/site/junnanlics), [Steven C.H. Hoi](https://sites.google.com/view/stevenhoi/home) (* indicates equal contribution) # What is this about? -CodeT5+ is a new family of open code large language models with an encoder-decoder architecture that can flexibly operate in different modes (i.e. _encoder-only_, _decoder-only_, and _encoder-decoder_) to support a wide range of code understanding and generation tasks. +CodeT5+ is a new family of open code large language models with an encoder-decoder architecture that can flexibly operate in different modes (i.e. _encoder-only_, _decoder-only_, and _encoder-decoder_) to support a wide range of code understanding and generation tasks. +See the below overview of CodeT5+. +![CodeT5+ overview](codet5p_overview.png) To train CodeT5+, we introduce a diverse set of pretraining tasks including _span denoising_, _causal language modeling_, _contrastive learning_, and _text-code matching_ to learn rich representations from both unimodal code data and bimodal code-text data. Additionally, to efficiently scale up the model, we propose a simple yet effective _compute-efficient pretraining_ method to initialize our model with frozen off-the-shelf LLMs such as [CodeGen](https://github.com/salesforce/CodeGen). Furthermore, we explore instruction tuning to align the model with natural language instructions following [Code Alpaca](https://github.com/sahil280114/codealpaca). +![CodeT5+ architecture](codet5p_architecture.png) + We implemented a family of CodeT5+ models, with model size ranging from 220M to 16B. -Note that CodeT5+ 220M and 770M employ the same architecture of CodeT5-base and large respectively and are pretrained from scratch, while CodeT5+ 2B, 6B, 16B employ a "_shallow encoder and deep decoder_" architecture with the shallow encoder initialized from CodeGen-mono 350M and the deep decoder initialized from CodeGen-mono 2B, 6B, 16B, respectively. +Note that CodeT5+ `220M` and `770M` employ the same architecture of CodeT5-base and large respectively and are pretrained from scratch, while CodeT5+ `2B`, `6B`, `16B` employ a "_shallow encoder and deep decoder_" architecture with the shallow encoder initialized from CodeGen-mono 350M and the deep decoder initialized from CodeGen-mono 2B, 6B, 16B, respectively. InstructCodeT5+ 16B is our instruction-tuned model from CodeT5+ 16B. -![CodeT5+ overview](codet5p_overview.png) + # Released Models -We release the following CodeT5+ models: +We release the following CodeT5+ models at Huggingface: -* CodeT5+ `220M` and `770M` at Huggingface [here](https://huggingface.co/Salesforce/codet5p-220m) and [here](https://huggingface.co/Salesforce/codet5p-770m), respectively. -* CodeT5+ `220M` and `770M` that are further tuned on Python subset at Huggingface [here](https://huggingface.co/Salesforce/codet5p-220m-py) and [here](https://huggingface.co/Salesforce/codet5p-770m-py), respectively. -* CodeT5+ `2B`, `6B`, `16B` will be released soon. +* CodeT5+ `220M` and `770M`: [Salesforce/codet5p-220m](https://huggingface.co/Salesforce/codet5p-220m) and [codet5p-770m](https://huggingface.co/Salesforce/codet5p-770m). +* CodeT5+ `220M` and `770M` that are further tuned on Python subset: [codet5p-220m-py](https://huggingface.co/Salesforce/codet5p-220m-py) and [codet5p-770m-py](https://huggingface.co/Salesforce/codet5p-770m-py). +* CodeT5+ `2B`, `6B`, `16B`: [Salesforce/codet5p-2b](https://huggingface.co/Salesforce/codet5p-2b), [Salesforce/codet5p-6b](https://huggingface.co/Salesforce/codet5p-6b), and [Salesforce/codet5p-16b](https://huggingface.co/Salesforce/codet5p-16b). +* InstructCodeT5+ `16B`: [Salesforce/instructcodet5p-16b](https://huggingface.co/Salesforce/instructcodet5p-16b). # How to Use? -CodeT5+ `220M` and `770M` models can be easily loaded using the `T5ForConditionalGeneration` functionality. They employ the same tokenizer as the original [CodeT5](https://github.com/salesforce/CodeT5). +All CodeT5+ models and tokenizers can be easily loaded using the `AutoModelForSeq2SeqLM` and `AutoTokenizer` functionality. +For tokenizers, CodeT5+ `220M` and `770M` employ the same tokenizer as the original [CodeT5](https://github.com/salesforce/CodeT5) while CodeT5+ `2B`, `6B`, `16B` employ the same tokenizer as [CodeGen]( https://github.com/salesforce/CodeGen). +To load CodeT5+ `2B`, `6B`, `16B`, please set `trust_remote_code=True` as the [model class](https://huggingface.co/Salesforce/codet5p-16b/blob/main/modeling_codet5p.py) is defined in the Huggingface repo. + ```python -from transformers import T5ForConditionalGeneration, AutoTokenizer +from transformers import AutoModelForSeq2SeqLM, AutoTokenizer +import torch -checkpoint = "Salesforce/codet5p-770m-py" +checkpoint = "Salesforce/instructcodet5p-16b" device = "cuda" # for GPU usage or "cpu" for CPU usage tokenizer = AutoTokenizer.from_pretrained(checkpoint) -model = T5ForConditionalGeneration.from_pretrained(checkpoint).to(device) +model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint, + torch_dtype=torch.float16, + low_cpu_mem_usage=True, + trust_remote_code=True).to(device) -inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to(device) -outputs = model.generate(inputs, max_length=10) +inputs = tokenizer.encode("def print_hello():", return_tensors="pt").to(device) +outputs = model.generate(inputs, max_length=12) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) -# ==> print('Hello World!') + ``` +# Reproduce the Results + +## HumanEval + +TBA + ## Citation ```bibtex diff --git a/CodeT5+/codet5p_architecture.png b/CodeT5+/codet5p_architecture.png new file mode 100644 index 0000000..58ee617 Binary files /dev/null and b/CodeT5+/codet5p_architecture.png differ diff --git a/CodeT5.png b/CodeT5.png deleted file mode 100644 index f0741bd..0000000 Binary files a/CodeT5.png and /dev/null differ diff --git a/CodeT5/README.md b/CodeT5/README.md new file mode 100644 index 0000000..4f66888 --- /dev/null +++ b/CodeT5/README.md @@ -0,0 +1,203 @@ +# CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation + +This is the official PyTorch implementation for the following EMNLP 2021 paper from Salesforce Research: + +**Title**: [CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation](https://arxiv.org/pdf/2109.00859.pdf) + +**Authors**: [Yue Wang](https://yuewang-cuhk.github.io/), [Weishi Wang](https://www.linkedin.com/in/weishi-wang/) +, [Shafiq Joty](https://raihanjoty.github.io/), and [Steven C.H. Hoi](https://sites.google.com/view/stevenhoi/home) + +![CodeT5 demo](../codet5.gif) + +## Updates + +**July 06, 2022** + +We release two large-sized CodeT5 checkpoints at Hugging Face: [Salesforce/codet5-large](https://huggingface.co/Salesforce/codet5-large) and [Salesforce/codet5-large-ntp-py](https://huggingface.co/Salesforce/codet5-large-ntp-py), which are introduced by the paper: [CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning](https://arxiv.org/pdf/2207.01780.pdf) by Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, Steven C.H. Hoi. + +* CodeT5-large was pretrained using Masked Span Prediction (MSP) objective on CodeSearchNet and achieve new SOTA results on several CodeXGLUE benchmarks. The finetuned checkpoints are released at [here](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/finetuned_models). See Appendix A.1 of the [paper](https://arxiv.org/pdf/2207.01780.pdf) for more details. + +* CodeT5-large-ntp-py was first pretrained using Masked Span Prediction (MSP) objective on CodeSearchNet and GCPY (the Python split of [Github Code](https://huggingface.co/datasets/codeparrot/github-code) data), followed by another 10 epochs on GCPY using Next Token Prediction (NTP) objective. + +CodeT5-large-ntp-py is especially optimized for Python code generation tasks and employed as the foundation model for our [CodeRL](https://github.com/salesforce/CodeRL), yielding new SOTA results on the APPS Python competition-level program synthesis benchmark. See the [paper](https://arxiv.org/pdf/2207.01780.pdf) for more details. + +**Oct 29, 2021** + +We release [fine-tuned checkpoints](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/finetuned_models) +for all the downstream tasks covered in the paper. + +**Oct 25, 2021** + +We release a CodeT5-base fine-tuned +checkpoint ([Salesforce/codet5-base-multi-sum](https://huggingface.co/Salesforce/codet5-base-multi-sum)) for +multilingual code summarzation. Below is how to use this model: + +```python +from transformers import RobertaTokenizer, T5ForConditionalGeneration + +if __name__ == '__main__': + tokenizer = RobertaTokenizer.from_pretrained('Salesforce/codet5-base') + model = T5ForConditionalGeneration.from_pretrained('Salesforce/codet5-base-multi-sum') + + text = """def svg_to_image(string, size=None): + if isinstance(string, unicode): + string = string.encode('utf-8') + renderer = QtSvg.QSvgRenderer(QtCore.QByteArray(string)) + if not renderer.isValid(): + raise ValueError('Invalid SVG data.') + if size is None: + size = renderer.defaultSize() + image = QtGui.QImage(size, QtGui.QImage.Format_ARGB32) + painter = QtGui.QPainter(image) + renderer.render(painter) + return image""" + + input_ids = tokenizer(text, return_tensors="pt").input_ids + + generated_ids = model.generate(input_ids, max_length=20) + print(tokenizer.decode(generated_ids[0], skip_special_tokens=True)) + # this prints: "Convert a SVG string to a QImage." +``` + +**Oct 18, 2021** + +We add a [model card](https://github.com/salesforce/CodeT5/blob/main/CodeT5_model_card.pdf) for CodeT5! Please reach out +if you have any questions about it. + +**Sep 24, 2021** + +CodeT5 is now in [hugginface](https://huggingface.co/)! + +You can simply load the model ([CodeT5-small](https://huggingface.co/Salesforce/codet5-small) +and [CodeT5-base](https://huggingface.co/Salesforce/codet5-base)) and do the inference: + +```python +from transformers import RobertaTokenizer, T5ForConditionalGeneration + +tokenizer = RobertaTokenizer.from_pretrained('Salesforce/codet5-base') +model = T5ForConditionalGeneration.from_pretrained('Salesforce/codet5-base') + +text = "def greet(user): print(f'hello !')" +input_ids = tokenizer(text, return_tensors="pt").input_ids + +# simply generate one code span +generated_ids = model.generate(input_ids, max_length=8) +print(tokenizer.decode(generated_ids[0], skip_special_tokens=True)) +# this prints "{user.username}" +``` + +## Introduction + +This repo provides the code for reproducing the experiments +in [CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation](https://arxiv.org/pdf/2109.00859.pdf) +. CodeT5 is a new pre-trained encoder-decoder model for programming languages, which is pre-trained on **8.35M** +functions in 8 programming languages (Python, Java, JavaScript, PHP, Ruby, Go, C, and C#). In total, it achieves +state-of-the-art results on **14 sub-tasks** in a code intelligence benchmark - [CodeXGLUE](https://github.com/microsoft/CodeXGLUE). + +Paper link: https://arxiv.org/abs/2109.00859 + +Blog link: https://blog.salesforceairesearch.com/codet5/ + +The code currently includes two pre-trained checkpoints ([CodeT5-small](https://huggingface.co/Salesforce/codet5-small) +and [CodeT5-base](https://huggingface.co/Salesforce/codet5-base)) and scripts to fine-tune them on 4 generation tasks ( +code summarization, code generation, translation, and refinement) plus 2 understanding tasks (code defect detection and +clone detection) in CodeXGLUE. We also provide their fine-tuned checkpoints to facilitate the easy replication +of our paper. + +In practice, CodeT5 can be deployed as an AI-powered coding assistant to boost the productivity of software developers. +At Salesforce, we build an [AI coding assistant demo](https://github.com/salesforce/CodeT5/raw/main/codet5.gif) using +CodeT5 as a VS Code plugin to provide three capabilities for Apex developers: + +- **Text-to-code generation**: generate code based on the natural language description. +- **Code autocompletion**: complete the whole function of code given the target function name. +- **Code summarization**: generate the summary of a function in natural language description. + +## Table of Contents + +1. [Dependency](#dependency) +2. [Download](#download) +3. [Fine-tuning](#fine-tuning) + +## Dependency + +- Pytorch 1.7.1 +- tensorboard 2.4.1 +- transformers 4.6.1 +- tree-sitter 0.2.2 + +## Download + +* [Pre-trained checkpoints](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/pretrained_models) +* [Fine-tuning data](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/data) +* [Fine-tuned checkpoints](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/finetuned_models) + +Instructions to download: + +``` +# pip install gsutil +cd your-cloned-codet5-path + +gsutil -m cp -r "gs://sfr-codet5-data-research/pretrained_models" . +gsutil -m cp -r "gs://sfr-codet5-data-research/data" . +gsutil -m cp -r "gs://sfr-codet5-data-research/finetuned_models" . +``` + +## Fine-tuning + +Go to `sh` folder, set the `WORKDIR` in `exp_with_args.sh` to be your cloned CodeT5 repository path. + +You can use `run_exp.py` to run a broad set of experiments by simply passing the `model_tag`, `task`, and `sub_task` +arguments. In total, we support five models (i.e., ['roberta', 'codebert', 'bart_base', 'codet5_small', 'codet5_base']) +and six tasks (i.e., ['summarize', 'concode', 'translate', 'refine', 'defect', 'clone']). For each task, we use +the `sub_task` to specify which specific datasets to fine-tne on. Below is the full list: + +| \--task | \--sub\_task | Description | +| --------- | ---------------------------------- | -------------------------------------------------------------------------------------------------------------------------------- | +| summarize | ruby/javascript/go/python/java/php | code summarization task on [CodeSearchNet](https://arxiv.org/abs/1909.09436) data with six PLs | +| concode | none | text-to-code generation on [Concode](https://aclanthology.org/D18-1192.pdf) data | +| translate | java-cs/cs-java | code-to-code translation between [Java and C#](https://arxiv.org/pdf/2102.04664.pdf) | +| refine | small/medium | code refinement on [code repair data](https://arxiv.org/pdf/1812.08693.pdf) with small/medium functions | +| defect | none | code defect detection in [C/C++ data](https://proceedings.neurips.cc/paper/2019/file/49265d2447bc3bbfe9e76306ce40a31f-Paper.pdf) | +| clone | none | code clone detection in [Java data](https://arxiv.org/pdf/2002.08653.pdf) | + +For example, if you want to run CodeT5-base model on the code summarization task for Python, you can simply run: + +``` +python run_exp.py --model_tag codet5_base --task summarize --sub_task python +``` + +For multi-task training, you can type: + +``` +python run_exp.py --model_tag codet5_base --task multi_task --sub_task none +``` + +Besides, you can specify: + +``` +model_dir: where to save fine-tuning checkpoints +res_dir: where to save the performance results +summary_dir: where to save the training curves +data_num: how many data instances to use, the default -1 is for using the full data +gpu: the index of the GPU to use in the cluster +``` + +You can also revise the suggested +arguments [here](https://github.com/salesforce/CodeT5/blob/0bf3c0c43e92fcf54d9df68c793ac22f2b60aad4/sh/run_exp.py#L14) or directly customize the [exp_with_args.sh](https://github.com/salesforce/CodeT5/blob/main/sh/exp_with_args.sh) bash file. +Please refer to the argument flags in [configs.py](https://github.com/salesforce/CodeT5/blob/main/configs.py) for the full +available options. The saved training curves in `summary_dir` can be visualized using [tensorboard](https://pypi.org/project/tensorboard/). +Note that we employ one A100 GPU for all fine-tuning experiments. + +### How to reproduce the results using the released finetuned checkpoints? + +* Remove the `--do_train --do_eval --do_eval_bleu` and reserve only `--do_test` at [here](https://github.com/salesforce/CodeT5/blob/5b37c34f4bbbfcfd972c24a9dd1f45716568ecb5/sh/exp_with_args.sh#L84). +* Pass the path of your downloaded finetuned checkpoint to load at [here](https://github.com/salesforce/CodeT5/blob/5b37c34f4bbbfcfd972c24a9dd1f45716568ecb5/run_gen.py#L366), e.g., `file = "CodeT5/finetuned_models/summarize_python_codet5_base.bin"` +* Run the program: `python run_exp.py --model_tag codet5_base --task summarize --sub_task python` + +### How to fine-tune on your own task and dataset? +If you want to fine-tune on your dataset, you can add your own task and sub_task in `configs.py` ([here](https://github.com/salesforce/CodeT5/blob/d27512d23ba6130e089e571d8c3e399760db1c31/configs.py#L11)) and add your data path and the function to read in `utils.py` ([here](https://github.com/salesforce/CodeT5/blob/5bb41e21b07fee73f310476a91ded00e385290d7/utils.py#L103) and [here](https://github.com/salesforce/CodeT5/blob/5bb41e21b07fee73f310476a91ded00e385290d7/utils.py#L149)). The read function can be implemented in `_utils.py` similar to [this one](https://github.com/salesforce/CodeT5/blob/aaf9c4a920c4986abfd54a74f5456b056b6409e0/_utils.py#L213). If your task to add is a generation task, you can simply reuse or customize the `run_gen.py`. For understanding tasks, please refer to `run_defect.py` and `run_clone.py`. + +## Get Involved + +Please create a GitHub issue if you have any questions, suggestions, requests or bug-reports. We welcome PRs! + diff --git a/_utils.py b/CodeT5/_utils.py similarity index 100% rename from _utils.py rename to CodeT5/_utils.py diff --git a/configs.py b/CodeT5/configs.py similarity index 100% rename from configs.py rename to CodeT5/configs.py diff --git a/CodeT5/evaluator/CodeBLEU/__init__.py b/CodeT5/evaluator/CodeBLEU/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/evaluator/CodeBLEU/bleu.py b/CodeT5/evaluator/CodeBLEU/bleu.py similarity index 97% rename from evaluator/CodeBLEU/bleu.py rename to CodeT5/evaluator/CodeBLEU/bleu.py index b320252..e1150c6 100644 --- a/evaluator/CodeBLEU/bleu.py +++ b/CodeT5/evaluator/CodeBLEU/bleu.py @@ -15,7 +15,7 @@ from fractions import Fraction import warnings from collections import Counter -from evaluator.CodeBLEU.utils import ngrams +from CodeT5.evaluator.CodeBLEU.utils import ngrams def sentence_bleu( diff --git a/evaluator/CodeBLEU/calc_code_bleu.py b/CodeT5/evaluator/CodeBLEU/calc_code_bleu.py similarity index 95% rename from evaluator/CodeBLEU/calc_code_bleu.py rename to CodeT5/evaluator/CodeBLEU/calc_code_bleu.py index e8d3a12..6d915c8 100644 --- a/evaluator/CodeBLEU/calc_code_bleu.py +++ b/CodeT5/evaluator/CodeBLEU/calc_code_bleu.py @@ -5,7 +5,7 @@ # -*- coding:utf-8 -*- import argparse import os -from evaluator.CodeBLEU import bleu, weighted_ngram_match, syntax_match, dataflow_match +from CodeT5.evaluator.CodeBLEU import weighted_ngram_match, bleu, dataflow_match, syntax_match def get_codebleu(refs, hyp, lang, params='0.25,0.25,0.25,0.25'): diff --git a/evaluator/CodeBLEU/dataflow_match.py b/CodeT5/evaluator/CodeBLEU/dataflow_match.py similarity index 89% rename from evaluator/CodeBLEU/dataflow_match.py rename to CodeT5/evaluator/CodeBLEU/dataflow_match.py index cc866b4..d8c19d2 100644 --- a/evaluator/CodeBLEU/dataflow_match.py +++ b/CodeT5/evaluator/CodeBLEU/dataflow_match.py @@ -1,11 +1,10 @@ # Copyright (c) Microsoft Corporation. # Licensed under the MIT license. -from evaluator.CodeBLEU.parser import DFG_python, DFG_java, DFG_ruby, DFG_go, DFG_php, DFG_javascript, DFG_csharp -from evaluator.CodeBLEU.parser import (remove_comments_and_docstrings, - tree_to_token_index, - index_to_code_token, - tree_to_variable_index) +from CodeT5.evaluator.CodeBLEU.parser import DFG_python, DFG_java, DFG_ruby, DFG_go, DFG_php, DFG_javascript, DFG_csharp +from CodeT5.evaluator.CodeBLEU.parser import (remove_comments_and_docstrings, + tree_to_token_index, + index_to_code_token) from tree_sitter import Language, Parser import os diff --git a/evaluator/CodeBLEU/keywords/c_sharp.txt b/CodeT5/evaluator/CodeBLEU/keywords/c_sharp.txt similarity index 100% rename from evaluator/CodeBLEU/keywords/c_sharp.txt rename to CodeT5/evaluator/CodeBLEU/keywords/c_sharp.txt diff --git a/evaluator/CodeBLEU/keywords/java.txt b/CodeT5/evaluator/CodeBLEU/keywords/java.txt similarity index 100% rename from evaluator/CodeBLEU/keywords/java.txt rename to CodeT5/evaluator/CodeBLEU/keywords/java.txt diff --git a/evaluator/CodeBLEU/parser/DFG.py b/CodeT5/evaluator/CodeBLEU/parser/DFG.py similarity index 100% rename from evaluator/CodeBLEU/parser/DFG.py rename to CodeT5/evaluator/CodeBLEU/parser/DFG.py diff --git a/evaluator/CodeBLEU/parser/__init__.py b/CodeT5/evaluator/CodeBLEU/parser/__init__.py similarity index 100% rename from evaluator/CodeBLEU/parser/__init__.py rename to CodeT5/evaluator/CodeBLEU/parser/__init__.py diff --git a/evaluator/CodeBLEU/parser/build.py b/CodeT5/evaluator/CodeBLEU/parser/build.py similarity index 100% rename from evaluator/CodeBLEU/parser/build.py rename to CodeT5/evaluator/CodeBLEU/parser/build.py diff --git a/evaluator/CodeBLEU/parser/build.sh b/CodeT5/evaluator/CodeBLEU/parser/build.sh similarity index 100% rename from evaluator/CodeBLEU/parser/build.sh rename to CodeT5/evaluator/CodeBLEU/parser/build.sh diff --git a/evaluator/CodeBLEU/parser/my-languages.so b/CodeT5/evaluator/CodeBLEU/parser/my-languages.so similarity index 100% rename from evaluator/CodeBLEU/parser/my-languages.so rename to CodeT5/evaluator/CodeBLEU/parser/my-languages.so diff --git a/evaluator/CodeBLEU/parser/utils.py b/CodeT5/evaluator/CodeBLEU/parser/utils.py similarity index 100% rename from evaluator/CodeBLEU/parser/utils.py rename to CodeT5/evaluator/CodeBLEU/parser/utils.py diff --git a/evaluator/CodeBLEU/readme.txt b/CodeT5/evaluator/CodeBLEU/readme.txt similarity index 100% rename from evaluator/CodeBLEU/readme.txt rename to CodeT5/evaluator/CodeBLEU/readme.txt diff --git a/evaluator/CodeBLEU/syntax_match.py b/CodeT5/evaluator/CodeBLEU/syntax_match.py similarity index 84% rename from evaluator/CodeBLEU/syntax_match.py rename to CodeT5/evaluator/CodeBLEU/syntax_match.py index 57569b7..1dcc8e9 100644 --- a/evaluator/CodeBLEU/syntax_match.py +++ b/CodeT5/evaluator/CodeBLEU/syntax_match.py @@ -1,11 +1,8 @@ # Copyright (c) Microsoft Corporation. # Licensed under the MIT license. -from evaluator.CodeBLEU.parser import DFG_python, DFG_java, DFG_ruby, DFG_go, DFG_php, DFG_javascript, DFG_csharp -from evaluator.CodeBLEU.parser import (remove_comments_and_docstrings, - tree_to_token_index, - index_to_code_token, - tree_to_variable_index) +from CodeT5.evaluator.CodeBLEU.parser import DFG_python, DFG_java, DFG_ruby, DFG_go, DFG_php, DFG_javascript, DFG_csharp +from CodeT5.evaluator.CodeBLEU.parser import (remove_comments_and_docstrings) from tree_sitter import Language, Parser import os diff --git a/evaluator/CodeBLEU/utils.py b/CodeT5/evaluator/CodeBLEU/utils.py similarity index 100% rename from evaluator/CodeBLEU/utils.py rename to CodeT5/evaluator/CodeBLEU/utils.py diff --git a/evaluator/CodeBLEU/weighted_ngram_match.py b/CodeT5/evaluator/CodeBLEU/weighted_ngram_match.py similarity index 97% rename from evaluator/CodeBLEU/weighted_ngram_match.py rename to CodeT5/evaluator/CodeBLEU/weighted_ngram_match.py index f2d346d..edeeace 100644 --- a/evaluator/CodeBLEU/weighted_ngram_match.py +++ b/CodeT5/evaluator/CodeBLEU/weighted_ngram_match.py @@ -18,8 +18,7 @@ from fractions import Fraction import warnings from collections import Counter -from evaluator.CodeBLEU.utils import ngrams -import pdb +from CodeT5.evaluator.CodeBLEU.utils import ngrams def sentence_bleu( diff --git a/CodeT5/evaluator/__init__.py b/CodeT5/evaluator/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/evaluator/bleu.py b/CodeT5/evaluator/bleu.py similarity index 100% rename from evaluator/bleu.py rename to CodeT5/evaluator/bleu.py diff --git a/evaluator/smooth_bleu.py b/CodeT5/evaluator/smooth_bleu.py similarity index 100% rename from evaluator/smooth_bleu.py rename to CodeT5/evaluator/smooth_bleu.py diff --git a/models.py b/CodeT5/models.py similarity index 100% rename from models.py rename to CodeT5/models.py diff --git a/run_clone.py b/CodeT5/run_clone.py similarity index 100% rename from run_clone.py rename to CodeT5/run_clone.py diff --git a/run_defect.py b/CodeT5/run_defect.py similarity index 100% rename from run_defect.py rename to CodeT5/run_defect.py diff --git a/run_gen.py b/CodeT5/run_gen.py similarity index 100% rename from run_gen.py rename to CodeT5/run_gen.py diff --git a/run_multi_gen.py b/CodeT5/run_multi_gen.py similarity index 99% rename from run_multi_gen.py rename to CodeT5/run_multi_gen.py index bfa9b0f..c3dc221 100644 --- a/run_multi_gen.py +++ b/CodeT5/run_multi_gen.py @@ -23,14 +23,11 @@ import os import torch import logging import argparse -import math import numpy as np from tqdm import tqdm from itertools import cycle import multiprocessing import time -import sys -import pdb from torch.utils.tensorboard import SummaryWriter from torch.utils.data import DataLoader, SequentialSampler, RandomSampler diff --git a/sh/exp_with_args.sh b/CodeT5/sh/exp_with_args.sh similarity index 100% rename from sh/exp_with_args.sh rename to CodeT5/sh/exp_with_args.sh diff --git a/sh/run_exp.py b/CodeT5/sh/run_exp.py similarity index 100% rename from sh/run_exp.py rename to CodeT5/sh/run_exp.py diff --git a/tokenizer/apply_tokenizer.py b/CodeT5/tokenizer/apply_tokenizer.py similarity index 100% rename from tokenizer/apply_tokenizer.py rename to CodeT5/tokenizer/apply_tokenizer.py diff --git a/tokenizer/salesforce/codet5-merges.txt b/CodeT5/tokenizer/salesforce/codet5-merges.txt similarity index 100% rename from tokenizer/salesforce/codet5-merges.txt rename to CodeT5/tokenizer/salesforce/codet5-merges.txt diff --git a/tokenizer/salesforce/codet5-vocab.json b/CodeT5/tokenizer/salesforce/codet5-vocab.json similarity index 100% rename from tokenizer/salesforce/codet5-vocab.json rename to CodeT5/tokenizer/salesforce/codet5-vocab.json diff --git a/tokenizer/train_tokenizer.py b/CodeT5/tokenizer/train_tokenizer.py similarity index 100% rename from tokenizer/train_tokenizer.py rename to CodeT5/tokenizer/train_tokenizer.py diff --git a/utils.py b/CodeT5/utils.py similarity index 100% rename from utils.py rename to CodeT5/utils.py diff --git a/README.md b/README.md index ef4a9ba..e625c47 100644 --- a/README.md +++ b/README.md @@ -1,125 +1,54 @@ -# CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation +# CodeT5 and CodeT5+ -This is the official PyTorch implementation for the following EMNLP 2021 paper from Salesforce Research: +Official research release for **CodeT5** and **CodeT5+** models for a wide range of **Code Understanding and Generation** tasks from Salesforce Research. +These open code LLMs are introduced by the following papers: -**Title**: [CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation](https://arxiv.org/pdf/2109.00859.pdf) +*Title*: [CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation](https://arxiv.org/pdf/2109.00859.pdf) -**Authors**: [Yue Wang](https://yuewang-cuhk.github.io/), [Weishi Wang](https://www.linkedin.com/in/weishi-wang/) -, [Shafiq Joty](https://raihanjoty.github.io/), and [Steven C.H. Hoi](https://sites.google.com/view/stevenhoi/home) +*Authors*: [Yue Wang](https://yuewang-cuhk.github.io/), [Weishi Wang](https://www.linkedin.com/in/weishi-wang/) +, [Shafiq Joty](https://raihanjoty.github.io/), [Steven C.H. Hoi](https://sites.google.com/view/stevenhoi/home) -![CodeT5 demo](codet5.gif) +*Title*: [CodeT5+: Open Code Large Language Models for Code Understanding and Generation](https://arxiv.org/pdf/2305.07922.pdf) -## Updates +*Authors*: [Yue Wang](https://yuewang-cuhk.github.io/)\*, [Hung Le](https://sites.google.com/view/henryle2018/home?pli=1)\*, [Akhilesh Deepak Gotmare](https://akhileshgotmare.github.io/), [Nghi D.Q. Bui](https://bdqnghi.github.io/), [Junnan Li](https://sites.google.com/site/junnanlics), [Steven C.H. Hoi](https://sites.google.com/view/stevenhoi/home) (* indicates equal contribution) -**July 06, 2022** -We release two large-sized CodeT5 checkpoints at Hugging Face: [Salesforce/codet5-large](https://huggingface.co/Salesforce/codet5-large) and [Salesforce/codet5-large-ntp-py](https://huggingface.co/Salesforce/codet5-large-ntp-py), which are introduced by the paper: [CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning](https://arxiv.org/pdf/2207.01780.pdf) by Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, Steven C.H. Hoi. - -* CodeT5-large was pretrained using Masked Span Prediction (MSP) objective on CodeSearchNet and achieve new SOTA results on several CodeXGLUE benchmarks. The finetuned checkpoints are released at [here](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/finetuned_models). See Appendix A.1 of the [paper](https://arxiv.org/pdf/2207.01780.pdf) for more details. - -* CodeT5-large-ntp-py was first pretrained using Masked Span Prediction (MSP) objective on CodeSearchNet and GCPY (the Python split of [Github Code](https://huggingface.co/datasets/codeparrot/github-code) data), followed by another 10 epochs on GCPY using Next Token Prediction (NTP) objective. - -CodeT5-large-ntp-py is especially optimized for Python code generation tasks and employed as the foundation model for our [CodeRL](https://github.com/salesforce/CodeRL), yielding new SOTA results on the APPS Python competition-level program synthesis benchmark. See the [paper](https://arxiv.org/pdf/2207.01780.pdf) for more details. - -**Oct 29, 2021** - -We release [fine-tuned checkpoints](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/finetuned_models) -for all the downstream tasks covered in the paper. - -**Oct 25, 2021** - -We release a CodeT5-base fine-tuned -checkpoint ([Salesforce/codet5-base-multi-sum](https://huggingface.co/Salesforce/codet5-base-multi-sum)) for -multilingual code summarzation. Below is how to use this model: - -```python -from transformers import RobertaTokenizer, T5ForConditionalGeneration - -if __name__ == '__main__': - tokenizer = RobertaTokenizer.from_pretrained('Salesforce/codet5-base') - model = T5ForConditionalGeneration.from_pretrained('Salesforce/codet5-base-multi-sum') - - text = """def svg_to_image(string, size=None): - if isinstance(string, unicode): - string = string.encode('utf-8') - renderer = QtSvg.QSvgRenderer(QtCore.QByteArray(string)) - if not renderer.isValid(): - raise ValueError('Invalid SVG data.') - if size is None: - size = renderer.defaultSize() - image = QtGui.QImage(size, QtGui.QImage.Format_ARGB32) - painter = QtGui.QPainter(image) - renderer.render(painter) - return image""" - - input_ids = tokenizer(text, return_tensors="pt").input_ids - - generated_ids = model.generate(input_ids, max_length=20) - print(tokenizer.decode(generated_ids[0], skip_special_tokens=True)) - # this prints: "Convert a SVG string to a QImage." -``` - -**Oct 18, 2021** - -We add a [model card](https://github.com/salesforce/CodeT5/blob/main/CodeT5_model_card.pdf) for CodeT5! Please reach out -if you have any questions about it. - -**Sep 24, 2021** - -CodeT5 is now in [hugginface](https://huggingface.co/)! - -You can simply load the model ([CodeT5-small](https://huggingface.co/Salesforce/codet5-small) -and [CodeT5-base](https://huggingface.co/Salesforce/codet5-base)) and do the inference: - -```python -from transformers import RobertaTokenizer, T5ForConditionalGeneration - -tokenizer = RobertaTokenizer.from_pretrained('Salesforce/codet5-base') -model = T5ForConditionalGeneration.from_pretrained('Salesforce/codet5-base') - -text = "def greet(user): print(f'hello !')" -input_ids = tokenizer(text, return_tensors="pt").input_ids - -# simply generate one code span -generated_ids = model.generate(input_ids, max_length=8) -print(tokenizer.decode(generated_ids[0], skip_special_tokens=True)) -# this prints "{user.username}" -``` - -## Introduction - -This repo provides the code for reproducing the experiments -in [CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation](https://arxiv.org/pdf/2109.00859.pdf) -. CodeT5 is a new pre-trained encoder-decoder model for programming languages, which is pre-trained on **8.35M** -functions in 8 programming languages (Python, Java, JavaScript, PHP, Ruby, Go, C, and C#). In total, it achieves -state-of-the-art results on **14 sub-tasks** in a code intelligence benchmark - [CodeXGLUE](https://github.com/microsoft/CodeXGLUE). - -Paper link: https://arxiv.org/abs/2109.00859 - -Blog link: https://blog.salesforceairesearch.com/codet5/ - -The code currently includes two pre-trained checkpoints ([CodeT5-small](https://huggingface.co/Salesforce/codet5-small) -and [CodeT5-base](https://huggingface.co/Salesforce/codet5-base)) and scripts to fine-tune them on 4 generation tasks ( -code summarization, code generation, translation, and refinement) plus 2 understanding tasks (code defect detection and -clone detection) in CodeXGLUE. We also provide their fine-tuned checkpoints to facilitate the easy replication -of our paper. - -In practice, CodeT5 can be deployed as an AI-powered coding assistant to boost the productivity of software developers. -At Salesforce, we build an [AI coding assistant demo](https://github.com/salesforce/CodeT5/raw/main/codet5.gif) using -CodeT5 as a VS Code plugin to provide three capabilities for Apex developers: +In practice, CodeT5 and CodeT5+ models can be deployed as an AI-powered coding assistant to boost the productivity of software developers. +At Salesforce, we build an AI coding assistant demo using CodeT5 as a VS Code plugin to provide three capabilities for Apex developers: - **Text-to-code generation**: generate code based on the natural language description. - **Code autocompletion**: complete the whole function of code given the target function name. - **Code summarization**: generate the summary of a function in natural language description. -## Table of Contents +![CodeT5 demo](./codet5.gif) + +## What's New: 🎉 + +**May 2023** + +**CodeT5+** Paper and models released! ([paper](https://arxiv.org/pdf/2305.07922.pdf), [code](https://github.com/salesforce/CodeT5/tree/main/CodeT5+)) + +**July 2022** + +We release two large-sized CodeT5 checkpoints at Hugging Face: [Salesforce/codet5-large](https://huggingface.co/Salesforce/codet5-large) and [Salesforce/codet5-large-ntp-py](https://huggingface.co/Salesforce/codet5-large-ntp-py), which are introduced by the [CodeRL paper](https://arxiv.org/pdf/2207.01780.pdf). + +**Oct 2021** + +We release [fine-tuned checkpoints](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/finetuned_models) +for all the downstream tasks covered in the paper. +Besides, we release a CodeT5-base fine-tuned +checkpoint ([Salesforce/codet5-base-multi-sum](https://huggingface.co/Salesforce/codet5-base-multi-sum)) for +multilingual code summarization. + + +**Sep, 2021** + +CodeT5 is now in [hugginface](https://huggingface.co/)! ([CodeT5-small](https://huggingface.co/Salesforce/codet5-small) and [CodeT5-base](https://huggingface.co/Salesforce/codet5-base)). + +We add a [model card](https://github.com/salesforce/CodeT5/blob/main/CodeT5_model_card.pdf) for CodeT5! Please reach out +if you have any questions about it. + -1. [Citation](#citation) -2. [License](#license) -3. [Dependency](#dependency) -4. [Download](#download) -5. [Fine-tuning](#fine-tuning) -6. [Get Involved](#get-involved) ## Citation @@ -130,15 +59,24 @@ If you find this code to be useful for your research, please consider citing: wang2021codet5, title={CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation}, author={Yue Wang, Weishi Wang, Shafiq Joty, Steven C.H. Hoi}, - booktitle={Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021}, + booktitle={EMNLP}, year={2021}, } -@article{coderl2022, - title={CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning}, - author={Le, Hung and Wang, Yue and Gotmare, Akhilesh Deepak and Savarese, Silvio and Hoi, Steven C. H.}, - journal={arXiv preprint arXiv:2207.01780}, - year={2022} +@inproceedings{ + le2022coderl, + title={CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning}, + author={Le, Hung and Wang, Yue and Gotmare, Akhilesh Deepak and Savarese, Silvio and Hoi, Steven C. H.}, + journal={NeurIPS}, + year={2022} +} + +@article{ + wang2023codet5plus, + title={CodeT5+: Open Code Large Language Models for Code Understanding and Generation}, + author={Wang, Yue and Le, Hung and Gotmare, Akhilesh Deepak and Bui, Nghi D.Q. and Li, Junnan and Hoi, Steven C. H.}, + journal={arXiv preprint}, + year={2023} } ``` @@ -162,84 +100,6 @@ codeT5@salesforce.com, and to use [appropriate](https://arxiv.org/abs/1810.03993) [documentation](https://www.partnershiponai.org/about-ml/) when developing high-stakes applications of this model. -## Dependency - -- Pytorch 1.7.1 -- tensorboard 2.4.1 -- transformers 4.6.1 -- tree-sitter 0.2.2 - -## Download - -* [Pre-trained checkpoints](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/pretrained_models) -* [Fine-tuning data](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/data) -* [Fine-tuned checkpoints](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/finetuned_models) - -Instructions to download: - -``` -# pip install gsutil -cd your-cloned-codet5-path - -gsutil -m cp -r "gs://sfr-codet5-data-research/pretrained_models" . -gsutil -m cp -r "gs://sfr-codet5-data-research/data" . -gsutil -m cp -r "gs://sfr-codet5-data-research/finetuned_models" . -``` - -## Fine-tuning - -Go to `sh` folder, set the `WORKDIR` in `exp_with_args.sh` to be your cloned CodeT5 repository path. - -You can use `run_exp.py` to run a broad set of experiments by simply passing the `model_tag`, `task`, and `sub_task` -arguments. In total, we support five models (i.e., ['roberta', 'codebert', 'bart_base', 'codet5_small', 'codet5_base']) -and six tasks (i.e., ['summarize', 'concode', 'translate', 'refine', 'defect', 'clone']). For each task, we use -the `sub_task` to specify which specific datasets to fine-tne on. Below is the full list: - -| \--task | \--sub\_task | Description | -| --------- | ---------------------------------- | -------------------------------------------------------------------------------------------------------------------------------- | -| summarize | ruby/javascript/go/python/java/php | code summarization task on [CodeSearchNet](https://arxiv.org/abs/1909.09436) data with six PLs | -| concode | none | text-to-code generation on [Concode](https://aclanthology.org/D18-1192.pdf) data | -| translate | java-cs/cs-java | code-to-code translation between [Java and C#](https://arxiv.org/pdf/2102.04664.pdf) | -| refine | small/medium | code refinement on [code repair data](https://arxiv.org/pdf/1812.08693.pdf) with small/medium functions | -| defect | none | code defect detection in [C/C++ data](https://proceedings.neurips.cc/paper/2019/file/49265d2447bc3bbfe9e76306ce40a31f-Paper.pdf) | -| clone | none | code clone detection in [Java data](https://arxiv.org/pdf/2002.08653.pdf) | - -For example, if you want to run CodeT5-base model on the code summarization task for Python, you can simply run: - -``` -python run_exp.py --model_tag codet5_base --task summarize --sub_task python -``` - -For multi-task training, you can type: - -``` -python run_exp.py --model_tag codet5_base --task multi_task --sub_task none -``` - -Besides, you can specify: - -``` -model_dir: where to save fine-tuning checkpoints -res_dir: where to save the performance results -summary_dir: where to save the training curves -data_num: how many data instances to use, the default -1 is for using the full data -gpu: the index of the GPU to use in the cluster -``` - -You can also revise the suggested -arguments [here](https://github.com/salesforce/CodeT5/blob/0bf3c0c43e92fcf54d9df68c793ac22f2b60aad4/sh/run_exp.py#L14) or directly customize the [exp_with_args.sh](https://github.com/salesforce/CodeT5/blob/main/sh/exp_with_args.sh) bash file. -Please refer to the argument flags in [configs.py](https://github.com/salesforce/CodeT5/blob/main/configs.py) for the full -available options. The saved training curves in `summary_dir` can be visualized using [tensorboard](https://pypi.org/project/tensorboard/). -Note that we employ one A100 GPU for all fine-tuning experiments. - -### How to reproduce the results using the released finetuned checkpoints? - -* Remove the `--do_train --do_eval --do_eval_bleu` and reserve only `--do_test` at [here](https://github.com/salesforce/CodeT5/blob/5b37c34f4bbbfcfd972c24a9dd1f45716568ecb5/sh/exp_with_args.sh#L84). -* Pass the path of your downloaded finetuned checkpoint to load at [here](https://github.com/salesforce/CodeT5/blob/5b37c34f4bbbfcfd972c24a9dd1f45716568ecb5/run_gen.py#L366), e.g., `file = "CodeT5/finetuned_models/summarize_python_codet5_base.bin"` -* Run the program: `python run_exp.py --model_tag codet5_base --task summarize --sub_task python` - -### How to fine-tune on your own task and dataset? -If you want to fine-tune on your dataset, you can add your own task and sub_task in `configs.py` ([here](https://github.com/salesforce/CodeT5/blob/d27512d23ba6130e089e571d8c3e399760db1c31/configs.py#L11)) and add your data path and the function to read in `utils.py` ([here](https://github.com/salesforce/CodeT5/blob/5bb41e21b07fee73f310476a91ded00e385290d7/utils.py#L103) and [here](https://github.com/salesforce/CodeT5/blob/5bb41e21b07fee73f310476a91ded00e385290d7/utils.py#L149)). The read function can be implemented in `_utils.py` similar to [this one](https://github.com/salesforce/CodeT5/blob/aaf9c4a920c4986abfd54a74f5456b056b6409e0/_utils.py#L213). If your task to add is a generation task, you can simply reuse or customize the `run_gen.py`. For understanding tasks, please refer to `run_defect.py` and `run_clone.py`. ## Get Involved diff --git a/codet5.gif b/codet5.gif index 9accaaf..2205845 100644 Binary files a/codet5.gif and b/codet5.gif differ