reorganize the repo

2024-10-01 06:35:38 -04:00 · 2023-05-17 17:34:00 +08:00 · 2023-05-17 17:34:00 +08:00 · 71ccd12773
commit 71ccd12773
parent d5c9d81af1
39 changed files with 297 additions and 224 deletions
--- a/CodeT5+/README.md
+++ b/CodeT5+/README.md
@ -8,42 +8,60 @@ Official research release for the **CodeT5+** models (`220M`, `770M`, `2B`, `6B`

 # What is this about?
 CodeT5+ is a new family of open code large language models with an encoder-decoder architecture that can flexibly operate in different modes (i.e. _encoder-only_, _decoder-only_, and _encoder-decoder_) to support a wide range of code understanding and generation tasks.
+See the below overview of CodeT5+.
+![CodeT5+ overview](codet5p_overview.png)

 To train CodeT5+, we introduce a diverse set of pretraining tasks including _span denoising_, _causal language modeling_, _contrastive learning_, and _text-code matching_ to learn rich representations from both unimodal code data and bimodal code-text data. 
 Additionally, to efficiently scale up the model, we propose a simple yet effective _compute-efficient pretraining_ method to initialize our model with frozen off-the-shelf LLMs such as [CodeGen](https://github.com/salesforce/CodeGen). 
 Furthermore, we explore instruction tuning to align the model with natural language instructions following [Code Alpaca](https://github.com/sahil280114/codealpaca). 
+![CodeT5+ architecture](codet5p_architecture.png)
+

 We implemented a family of CodeT5+ models, with model size ranging from 220M to 16B. 
-Note that CodeT5+ 220M and 770M employ the same architecture of CodeT5-base and large respectively and are pretrained from scratch, while CodeT5+ 2B, 6B, 16B employ a "_shallow encoder and deep decoder_" architecture with the shallow encoder initialized from CodeGen-mono 350M and the deep decoder initialized from CodeGen-mono 2B, 6B, 16B, respectively.
+Note that CodeT5+ `220M` and `770M` employ the same architecture of CodeT5-base and large respectively and are pretrained from scratch, while CodeT5+ `2B`, `6B`, `16B` employ a "_shallow encoder and deep decoder_" architecture with the shallow encoder initialized from CodeGen-mono 350M and the deep decoder initialized from CodeGen-mono 2B, 6B, 16B, respectively.
 InstructCodeT5+ 16B is our instruction-tuned model from CodeT5+ 16B. 

-![CodeT5+ overview](codet5p_overview.png)
+

 # Released Models
-We release the following CodeT5+ models:
+We release the following CodeT5+ models at Huggingface:

-* CodeT5+ `220M` and `770M` at Huggingface [here](https://huggingface.co/Salesforce/codet5p-220m) and [here](https://huggingface.co/Salesforce/codet5p-770m), respectively.
-* CodeT5+ `220M` and `770M` that are further tuned on Python subset at Huggingface [here](https://huggingface.co/Salesforce/codet5p-220m-py) and [here](https://huggingface.co/Salesforce/codet5p-770m-py), respectively.
-* CodeT5+ `2B`, `6B`, `16B` will be released soon.
+* CodeT5+ `220M` and `770M`: [Salesforce/codet5p-220m](https://huggingface.co/Salesforce/codet5p-220m) and [codet5p-770m](https://huggingface.co/Salesforce/codet5p-770m).
+* CodeT5+ `220M` and `770M` that are further tuned on Python subset: [codet5p-220m-py](https://huggingface.co/Salesforce/codet5p-220m-py) and [codet5p-770m-py](https://huggingface.co/Salesforce/codet5p-770m-py).
+* CodeT5+ `2B`, `6B`, `16B`: [Salesforce/codet5p-2b](https://huggingface.co/Salesforce/codet5p-2b), [Salesforce/codet5p-6b](https://huggingface.co/Salesforce/codet5p-6b), and [Salesforce/codet5p-16b](https://huggingface.co/Salesforce/codet5p-16b).
+* InstructCodeT5+ `16B`: [Salesforce/instructcodet5p-16b](https://huggingface.co/Salesforce/instructcodet5p-16b).

 # How to Use?
-CodeT5+ `220M` and `770M` models can be easily loaded using the `T5ForConditionalGeneration` functionality. They employ the same tokenizer as the original [CodeT5](https://github.com/salesforce/CodeT5).
+All CodeT5+ models and tokenizers can be easily loaded using the `AutoModelForSeq2SeqLM` and `AutoTokenizer` functionality. 
+For tokenizers, CodeT5+ `220M` and `770M` employ the same tokenizer as the original [CodeT5](https://github.com/salesforce/CodeT5) while CodeT5+ `2B`, `6B`, `16B` employ the same tokenizer as [CodeGen]( https://github.com/salesforce/CodeGen).
+To load CodeT5+ `2B`, `6B`, `16B`, please set `trust_remote_code=True` as the [model class](https://huggingface.co/Salesforce/codet5p-16b/blob/main/modeling_codet5p.py) is defined in the Huggingface repo.
+

 ```python
-from transformers import T5ForConditionalGeneration, AutoTokenizer
+from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
+import torch

-checkpoint = "Salesforce/codet5p-770m-py"
+checkpoint = "Salesforce/instructcodet5p-16b"
 device = "cuda" # for GPU usage or "cpu" for CPU usage

 tokenizer = AutoTokenizer.from_pretrained(checkpoint)
-model = T5ForConditionalGeneration.from_pretrained(checkpoint).to(device)
+model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint,
+                                              torch_dtype=torch.float16,
+                                              low_cpu_mem_usage=True,
+                                              trust_remote_code=True).to(device)

-inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to(device)
-outputs = model.generate(inputs, max_length=10)
+inputs = tokenizer.encode("def print_hello():", return_tensors="pt").to(device)
+outputs = model.generate(inputs, max_length=12)
 print(tokenizer.decode(outputs[0], skip_special_tokens=True))
-# ==>     print('Hello World!')
+
 ```

+# Reproduce the Results
+
+## HumanEval
+
+TBA
+
 ## Citation

 ```bibtex
--- a/CodeT5+/codet5p_architecture.png
+++ b/CodeT5+/codet5p_architecture.png
--- a/CodeT5.png
+++ b/CodeT5.png
--- a/CodeT5/README.md
+++ b/CodeT5/README.md
@ -0,0 +1,203 @@
+# CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation
+
+This is the official PyTorch implementation for the following EMNLP 2021 paper from Salesforce Research:
+
+**Title**: [CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation](https://arxiv.org/pdf/2109.00859.pdf)
+
+**Authors**: [Yue Wang](https://yuewang-cuhk.github.io/), [Weishi Wang](https://www.linkedin.com/in/weishi-wang/)
+, [Shafiq Joty](https://raihanjoty.github.io/), and [Steven C.H. Hoi](https://sites.google.com/view/stevenhoi/home)
+
+![CodeT5 demo](../codet5.gif)
+
+## Updates
+
+**July 06, 2022**
+
+We release two large-sized CodeT5 checkpoints at Hugging Face: [Salesforce/codet5-large](https://huggingface.co/Salesforce/codet5-large) and [Salesforce/codet5-large-ntp-py](https://huggingface.co/Salesforce/codet5-large-ntp-py), which are introduced by the paper: [CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning](https://arxiv.org/pdf/2207.01780.pdf) by Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, Steven C.H. Hoi.
+
+* CodeT5-large was pretrained using Masked Span Prediction (MSP) objective on CodeSearchNet and achieve new SOTA results on several CodeXGLUE benchmarks. The finetuned checkpoints are released at [here](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/finetuned_models). See Appendix A.1 of the [paper](https://arxiv.org/pdf/2207.01780.pdf) for more details.
+
+* CodeT5-large-ntp-py was first pretrained using Masked Span Prediction (MSP) objective on CodeSearchNet and GCPY (the Python split of [Github Code](https://huggingface.co/datasets/codeparrot/github-code) data), followed by another 10 epochs on GCPY using Next Token Prediction (NTP) objective. 
+
+CodeT5-large-ntp-py is especially optimized for Python code generation tasks and employed as the foundation model for our [CodeRL](https://github.com/salesforce/CodeRL), yielding new SOTA results on the APPS Python competition-level program synthesis benchmark. See the [paper](https://arxiv.org/pdf/2207.01780.pdf) for more details.
+
+**Oct 29, 2021**
+
+We release [fine-tuned checkpoints](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/finetuned_models)
+for all the downstream tasks covered in the paper.
+
+**Oct 25, 2021**
+
+We release a CodeT5-base fine-tuned
+checkpoint ([Salesforce/codet5-base-multi-sum](https://huggingface.co/Salesforce/codet5-base-multi-sum)) for
+multilingual code summarzation. Below is how to use this model:
+
+```python
+from transformers import RobertaTokenizer, T5ForConditionalGeneration
+
+if __name__ == '__main__':
+    tokenizer = RobertaTokenizer.from_pretrained('Salesforce/codet5-base')
+    model = T5ForConditionalGeneration.from_pretrained('Salesforce/codet5-base-multi-sum')
+
+    text = """def svg_to_image(string, size=None):
+    if isinstance(string, unicode):
+        string = string.encode('utf-8')
+        renderer = QtSvg.QSvgRenderer(QtCore.QByteArray(string))
+    if not renderer.isValid():
+        raise ValueError('Invalid SVG data.')
+    if size is None:
+        size = renderer.defaultSize()
+        image = QtGui.QImage(size, QtGui.QImage.Format_ARGB32)
+        painter = QtGui.QPainter(image)
+        renderer.render(painter)
+    return image"""
+
+    input_ids = tokenizer(text, return_tensors="pt").input_ids
+
+    generated_ids = model.generate(input_ids, max_length=20)
+    print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
+    # this prints: "Convert a SVG string to a QImage."
+```
+
+**Oct 18, 2021**
+
+We add a [model card](https://github.com/salesforce/CodeT5/blob/main/CodeT5_model_card.pdf) for CodeT5! Please reach out
+if you have any questions about it.
+
+**Sep 24, 2021**
+
+CodeT5 is now in [hugginface](https://huggingface.co/)!
+
+You can simply load the model ([CodeT5-small](https://huggingface.co/Salesforce/codet5-small)
+and [CodeT5-base](https://huggingface.co/Salesforce/codet5-base)) and do the inference:
+
+```python
+from transformers import RobertaTokenizer, T5ForConditionalGeneration
+
+tokenizer = RobertaTokenizer.from_pretrained('Salesforce/codet5-base')
+model = T5ForConditionalGeneration.from_pretrained('Salesforce/codet5-base')
+
+text = "def greet(user): print(f'hello <extra_id_0>!')"
+input_ids = tokenizer(text, return_tensors="pt").input_ids
+
+# simply generate one code span
+generated_ids = model.generate(input_ids, max_length=8)
+print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
+# this prints "{user.username}"
+```
+
+## Introduction
+
+This repo provides the code for reproducing the experiments
+in [CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation](https://arxiv.org/pdf/2109.00859.pdf)
+. CodeT5 is a new pre-trained encoder-decoder model for programming languages, which is pre-trained on **8.35M**
+functions in 8 programming languages (Python, Java, JavaScript, PHP, Ruby, Go, C, and C#). In total, it achieves
+state-of-the-art results on **14 sub-tasks** in a code intelligence benchmark - [CodeXGLUE](https://github.com/microsoft/CodeXGLUE).
+
+Paper link: https://arxiv.org/abs/2109.00859
+
+Blog link: https://blog.salesforceairesearch.com/codet5/
+
+The code currently includes two pre-trained checkpoints ([CodeT5-small](https://huggingface.co/Salesforce/codet5-small)
+and [CodeT5-base](https://huggingface.co/Salesforce/codet5-base)) and scripts to fine-tune them on 4 generation tasks (
+code summarization, code generation, translation, and refinement) plus 2 understanding tasks (code defect detection and
+clone detection) in CodeXGLUE. We also provide their fine-tuned checkpoints to facilitate the easy replication
+of our paper.
+
+In practice, CodeT5 can be deployed as an AI-powered coding assistant to boost the productivity of software developers.
+At Salesforce, we build an [AI coding assistant demo](https://github.com/salesforce/CodeT5/raw/main/codet5.gif) using
+CodeT5 as a VS Code plugin to provide three capabilities for Apex developers:
+
+- **Text-to-code generation**: generate code based on the natural language description.
+- **Code autocompletion**: complete the whole function of code given the target function name.
+- **Code summarization**: generate the summary of a function in natural language description.
+
+## Table of Contents
+
+1. [Dependency](#dependency)
+2. [Download](#download)
+3. [Fine-tuning](#fine-tuning)
+
+## Dependency
+
+- Pytorch 1.7.1
+- tensorboard 2.4.1
+- transformers 4.6.1
+- tree-sitter 0.2.2
+
+## Download
+
+* [Pre-trained checkpoints](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/pretrained_models)
+* [Fine-tuning data](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/data)
+* [Fine-tuned checkpoints](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/finetuned_models)
+
+Instructions to download:
+
+```
+# pip install gsutil
+cd your-cloned-codet5-path
+
+gsutil -m cp -r "gs://sfr-codet5-data-research/pretrained_models" .
+gsutil -m cp -r "gs://sfr-codet5-data-research/data" .
+gsutil -m cp -r "gs://sfr-codet5-data-research/finetuned_models" .
+```
+
+## Fine-tuning
+
+Go to `sh` folder, set the `WORKDIR` in `exp_with_args.sh` to be your cloned CodeT5 repository path.
+
+You can use `run_exp.py` to run a broad set of experiments by simply passing the `model_tag`, `task`, and `sub_task`
+arguments. In total, we support five models (i.e., ['roberta', 'codebert', 'bart_base', 'codet5_small', 'codet5_base'])
+and six tasks (i.e., ['summarize', 'concode', 'translate', 'refine', 'defect', 'clone']). For each task, we use
+the `sub_task` to specify which specific datasets to fine-tne on. Below is the full list:
+
+| \--task   | \--sub\_task                       | Description                                                                                                                      |
+| --------- | ---------------------------------- | -------------------------------------------------------------------------------------------------------------------------------- |
+| summarize | ruby/javascript/go/python/java/php | code summarization task on [CodeSearchNet](https://arxiv.org/abs/1909.09436) data with six PLs                                   |
+| concode   | none                               | text-to-code generation on [Concode](https://aclanthology.org/D18-1192.pdf) data                                                 |
+| translate | java-cs/cs-java                    | code-to-code translation between [Java and C#](https://arxiv.org/pdf/2102.04664.pdf)                                             |
+| refine    | small/medium                       | code refinement on [code repair data](https://arxiv.org/pdf/1812.08693.pdf) with small/medium functions                          |
+| defect    | none                               | code defect detection in [C/C++ data](https://proceedings.neurips.cc/paper/2019/file/49265d2447bc3bbfe9e76306ce40a31f-Paper.pdf) |
+| clone     | none                               | code clone detection in [Java data](https://arxiv.org/pdf/2002.08653.pdf)                                                        |
+
+For example, if you want to run CodeT5-base model on the code summarization task for Python, you can simply run:
+
+```
+python run_exp.py --model_tag codet5_base --task summarize --sub_task python
+```
+
+For multi-task training, you can type:
+
+```
+python run_exp.py --model_tag codet5_base --task multi_task --sub_task none
+```
+
+Besides, you can specify:
+
+```
+model_dir: where to save fine-tuning checkpoints
+res_dir: where to save the performance results 
+summary_dir: where to save the training curves
+data_num: how many data instances to use, the default -1 is for using the full data
+gpu: the index of the GPU to use in the cluster
+``` 
+
+You can also revise the suggested
+arguments [here](https://github.com/salesforce/CodeT5/blob/0bf3c0c43e92fcf54d9df68c793ac22f2b60aad4/sh/run_exp.py#L14) or directly customize the [exp_with_args.sh](https://github.com/salesforce/CodeT5/blob/main/sh/exp_with_args.sh) bash file.
+Please refer to the argument flags in [configs.py](https://github.com/salesforce/CodeT5/blob/main/configs.py) for the full
+available options. The saved training curves in `summary_dir` can be visualized using [tensorboard](https://pypi.org/project/tensorboard/).
+Note that we employ one A100 GPU for all fine-tuning experiments.
+
+### How to reproduce the results using the released finetuned checkpoints?
+
+* Remove the `--do_train --do_eval --do_eval_bleu` and reserve only `--do_test` at [here](https://github.com/salesforce/CodeT5/blob/5b37c34f4bbbfcfd972c24a9dd1f45716568ecb5/sh/exp_with_args.sh#L84). 
+* Pass the path of your downloaded finetuned checkpoint to load at [here](https://github.com/salesforce/CodeT5/blob/5b37c34f4bbbfcfd972c24a9dd1f45716568ecb5/run_gen.py#L366), e.g., `file = "CodeT5/finetuned_models/summarize_python_codet5_base.bin"`
+* Run the program: `python run_exp.py --model_tag codet5_base --task summarize --sub_task python`
+
+### How to fine-tune on your own task and dataset?
+If you want to fine-tune on your dataset, you can add your own task and sub_task in `configs.py` ([here](https://github.com/salesforce/CodeT5/blob/d27512d23ba6130e089e571d8c3e399760db1c31/configs.py#L11)) and add your data path and the function to read in `utils.py` ([here](https://github.com/salesforce/CodeT5/blob/5bb41e21b07fee73f310476a91ded00e385290d7/utils.py#L103) and [here](https://github.com/salesforce/CodeT5/blob/5bb41e21b07fee73f310476a91ded00e385290d7/utils.py#L149)). The read function can be implemented in `_utils.py` similar to [this one](https://github.com/salesforce/CodeT5/blob/aaf9c4a920c4986abfd54a74f5456b056b6409e0/_utils.py#L213). If your task to add is a generation task, you can simply reuse or customize the `run_gen.py`. For understanding tasks, please refer to `run_defect.py` and `run_clone.py`.
+
+## Get Involved
+
+Please create a GitHub issue if you have any questions, suggestions, requests or bug-reports. We welcome PRs!
+
--- a/CodeT5/_utils.py
+++ b/CodeT5/_utils.py
--- a/CodeT5/configs.py
+++ b/CodeT5/configs.py
--- a/CodeT5/evaluator/CodeBLEU/init.py
+++ b/CodeT5/evaluator/CodeBLEU/init.py
--- a/CodeT5/evaluator/CodeBLEU/bleu.py
+++ b/CodeT5/evaluator/CodeBLEU/bleu.py
@ -15,7 +15,7 @@ from fractions import Fraction
 import warnings
 from collections import Counter

-from evaluator.CodeBLEU.utils import ngrams
+from CodeT5.evaluator.CodeBLEU.utils import ngrams


 def sentence_bleu(
--- a/CodeT5/evaluator/CodeBLEU/calc_code_bleu.py
+++ b/CodeT5/evaluator/CodeBLEU/calc_code_bleu.py
@ -5,7 +5,7 @@
 # -*- coding:utf-8 -*-
 import argparse
 import os
-from evaluator.CodeBLEU import bleu, weighted_ngram_match, syntax_match, dataflow_match
+from CodeT5.evaluator.CodeBLEU import weighted_ngram_match, bleu, dataflow_match, syntax_match


 def get_codebleu(refs, hyp, lang, params='0.25,0.25,0.25,0.25'):
--- a/CodeT5/evaluator/CodeBLEU/dataflow_match.py
+++ b/CodeT5/evaluator/CodeBLEU/dataflow_match.py
@ -1,11 +1,10 @@
 # Copyright (c) Microsoft Corporation. 
 # Licensed under the MIT license.

-from evaluator.CodeBLEU.parser import DFG_python, DFG_java, DFG_ruby, DFG_go, DFG_php, DFG_javascript, DFG_csharp
-from evaluator.CodeBLEU.parser import (remove_comments_and_docstrings,
-                                       tree_to_token_index,
-                                       index_to_code_token,
-                                       tree_to_variable_index)
+from CodeT5.evaluator.CodeBLEU.parser import DFG_python, DFG_java, DFG_ruby, DFG_go, DFG_php, DFG_javascript, DFG_csharp
+from CodeT5.evaluator.CodeBLEU.parser import (remove_comments_and_docstrings,
+                                              tree_to_token_index,
+                                              index_to_code_token)
 from tree_sitter import Language, Parser
 import os

--- a/CodeT5/evaluator/CodeBLEU/keywords/c_sharp.txt
+++ b/CodeT5/evaluator/CodeBLEU/keywords/c_sharp.txt
--- a/CodeT5/evaluator/CodeBLEU/keywords/java.txt
+++ b/CodeT5/evaluator/CodeBLEU/keywords/java.txt
--- a/CodeT5/evaluator/CodeBLEU/parser/DFG.py
+++ b/CodeT5/evaluator/CodeBLEU/parser/DFG.py
--- a/CodeT5/evaluator/CodeBLEU/parser/init.py
+++ b/CodeT5/evaluator/CodeBLEU/parser/init.py
--- a/CodeT5/evaluator/CodeBLEU/parser/build.py
+++ b/CodeT5/evaluator/CodeBLEU/parser/build.py
--- a/CodeT5/evaluator/CodeBLEU/parser/build.sh
+++ b/CodeT5/evaluator/CodeBLEU/parser/build.sh
--- a/CodeT5/evaluator/CodeBLEU/parser/my-languages.so
+++ b/CodeT5/evaluator/CodeBLEU/parser/my-languages.so
--- a/CodeT5/evaluator/CodeBLEU/parser/utils.py
+++ b/CodeT5/evaluator/CodeBLEU/parser/utils.py
--- a/CodeT5/evaluator/CodeBLEU/readme.txt
+++ b/CodeT5/evaluator/CodeBLEU/readme.txt
--- a/CodeT5/evaluator/CodeBLEU/syntax_match.py
+++ b/CodeT5/evaluator/CodeBLEU/syntax_match.py
@ -1,11 +1,8 @@
 # Copyright (c) Microsoft Corporation. 
 # Licensed under the MIT license.

-from evaluator.CodeBLEU.parser import DFG_python, DFG_java, DFG_ruby, DFG_go, DFG_php, DFG_javascript, DFG_csharp
-from evaluator.CodeBLEU.parser import (remove_comments_and_docstrings,
-                                       tree_to_token_index,
-                                       index_to_code_token,
-                                       tree_to_variable_index)
+from CodeT5.evaluator.CodeBLEU.parser import DFG_python, DFG_java, DFG_ruby, DFG_go, DFG_php, DFG_javascript, DFG_csharp
+from CodeT5.evaluator.CodeBLEU.parser import (remove_comments_and_docstrings)
 from tree_sitter import Language, Parser
 import os

--- a/CodeT5/evaluator/CodeBLEU/utils.py
+++ b/CodeT5/evaluator/CodeBLEU/utils.py
--- a/CodeT5/evaluator/CodeBLEU/weighted_ngram_match.py
+++ b/CodeT5/evaluator/CodeBLEU/weighted_ngram_match.py
@ -18,8 +18,7 @@ from fractions import Fraction
 import warnings
 from collections import Counter

-from evaluator.CodeBLEU.utils import ngrams
-import pdb
+from CodeT5.evaluator.CodeBLEU.utils import ngrams


 def sentence_bleu(
--- a/CodeT5/evaluator/init.py
+++ b/CodeT5/evaluator/init.py
--- a/CodeT5/evaluator/bleu.py
+++ b/CodeT5/evaluator/bleu.py
--- a/CodeT5/evaluator/smooth_bleu.py
+++ b/CodeT5/evaluator/smooth_bleu.py
--- a/CodeT5/models.py
+++ b/CodeT5/models.py
--- a/CodeT5/run_clone.py
+++ b/CodeT5/run_clone.py
--- a/CodeT5/run_defect.py
+++ b/CodeT5/run_defect.py
--- a/CodeT5/run_gen.py
+++ b/CodeT5/run_gen.py
--- a/CodeT5/run_multi_gen.py
+++ b/CodeT5/run_multi_gen.py
@ -23,14 +23,11 @@ import os
 import torch
 import logging
 import argparse
-import math
 import numpy as np
 from tqdm import tqdm
 from itertools import cycle
 import multiprocessing
 import time
-import sys
-import pdb

 from torch.utils.tensorboard import SummaryWriter
 from torch.utils.data import DataLoader, SequentialSampler, RandomSampler
--- a/CodeT5/sh/exp_with_args.sh
+++ b/CodeT5/sh/exp_with_args.sh
--- a/CodeT5/sh/run_exp.py
+++ b/CodeT5/sh/run_exp.py
--- a/CodeT5/tokenizer/apply_tokenizer.py
+++ b/CodeT5/tokenizer/apply_tokenizer.py
--- a/CodeT5/tokenizer/salesforce/codet5-merges.txt
+++ b/CodeT5/tokenizer/salesforce/codet5-merges.txt
--- a/CodeT5/tokenizer/salesforce/codet5-vocab.json
+++ b/CodeT5/tokenizer/salesforce/codet5-vocab.json
--- a/CodeT5/tokenizer/train_tokenizer.py
+++ b/CodeT5/tokenizer/train_tokenizer.py
--- a/CodeT5/utils.py
+++ b/CodeT5/utils.py
--- a/README.md
+++ b/README.md
@ -1,125 +1,54 @@
-# CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation
+# CodeT5 and CodeT5+

-This is the official PyTorch implementation for the following EMNLP 2021 paper from Salesforce Research:
+Official research release for  **CodeT5** and **CodeT5+** models for a wide range of **Code Understanding and Generation** tasks from Salesforce Research.
+These open code LLMs are introduced by the following papers:

-**Title**: [CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation](https://arxiv.org/pdf/2109.00859.pdf)
+*Title*: [CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation](https://arxiv.org/pdf/2109.00859.pdf)

-**Authors**: [Yue Wang](https://yuewang-cuhk.github.io/), [Weishi Wang](https://www.linkedin.com/in/weishi-wang/)
-, [Shafiq Joty](https://raihanjoty.github.io/), and [Steven C.H. Hoi](https://sites.google.com/view/stevenhoi/home)
+*Authors*: [Yue Wang](https://yuewang-cuhk.github.io/), [Weishi Wang](https://www.linkedin.com/in/weishi-wang/)
+, [Shafiq Joty](https://raihanjoty.github.io/), [Steven C.H. Hoi](https://sites.google.com/view/stevenhoi/home)

-![CodeT5 demo](codet5.gif)
+*Title*: [CodeT5+: Open Code Large Language Models for Code Understanding and Generation](https://arxiv.org/pdf/2305.07922.pdf)

-## Updates
+*Authors*: [Yue Wang](https://yuewang-cuhk.github.io/)\*, [Hung Le](https://sites.google.com/view/henryle2018/home?pli=1)\*, [Akhilesh Deepak Gotmare](https://akhileshgotmare.github.io/), [Nghi D.Q. Bui](https://bdqnghi.github.io/), [Junnan Li](https://sites.google.com/site/junnanlics), [Steven C.H. Hoi](https://sites.google.com/view/stevenhoi/home) (* indicates equal contribution)

-**July 06, 2022**

-We release two large-sized CodeT5 checkpoints at Hugging Face: [Salesforce/codet5-large](https://huggingface.co/Salesforce/codet5-large) and [Salesforce/codet5-large-ntp-py](https://huggingface.co/Salesforce/codet5-large-ntp-py), which are introduced by the paper: [CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning](https://arxiv.org/pdf/2207.01780.pdf) by Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, Steven C.H. Hoi.
-
-* CodeT5-large was pretrained using Masked Span Prediction (MSP) objective on CodeSearchNet and achieve new SOTA results on several CodeXGLUE benchmarks. The finetuned checkpoints are released at [here](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/finetuned_models). See Appendix A.1 of the [paper](https://arxiv.org/pdf/2207.01780.pdf) for more details.
-
-* CodeT5-large-ntp-py was first pretrained using Masked Span Prediction (MSP) objective on CodeSearchNet and GCPY (the Python split of [Github Code](https://huggingface.co/datasets/codeparrot/github-code) data), followed by another 10 epochs on GCPY using Next Token Prediction (NTP) objective. 
-
-CodeT5-large-ntp-py is especially optimized for Python code generation tasks and employed as the foundation model for our [CodeRL](https://github.com/salesforce/CodeRL), yielding new SOTA results on the APPS Python competition-level program synthesis benchmark. See the [paper](https://arxiv.org/pdf/2207.01780.pdf) for more details.
-
-**Oct 29, 2021**
-
-We release [fine-tuned checkpoints](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/finetuned_models)
-for all the downstream tasks covered in the paper.
-
-**Oct 25, 2021**
-
-We release a CodeT5-base fine-tuned
-checkpoint ([Salesforce/codet5-base-multi-sum](https://huggingface.co/Salesforce/codet5-base-multi-sum)) for
-multilingual code summarzation. Below is how to use this model:
-
-```python
-from transformers import RobertaTokenizer, T5ForConditionalGeneration
-
-if __name__ == '__main__':
-    tokenizer = RobertaTokenizer.from_pretrained('Salesforce/codet5-base')
-    model = T5ForConditionalGeneration.from_pretrained('Salesforce/codet5-base-multi-sum')
-
-    text = """def svg_to_image(string, size=None):
-    if isinstance(string, unicode):
-        string = string.encode('utf-8')
-        renderer = QtSvg.QSvgRenderer(QtCore.QByteArray(string))
-    if not renderer.isValid():
-        raise ValueError('Invalid SVG data.')
-    if size is None:
-        size = renderer.defaultSize()
-        image = QtGui.QImage(size, QtGui.QImage.Format_ARGB32)
-        painter = QtGui.QPainter(image)
-        renderer.render(painter)
-    return image"""
-
-    input_ids = tokenizer(text, return_tensors="pt").input_ids
-
-    generated_ids = model.generate(input_ids, max_length=20)
-    print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
-    # this prints: "Convert a SVG string to a QImage."
-```
-
-**Oct 18, 2021**
-
-We add a [model card](https://github.com/salesforce/CodeT5/blob/main/CodeT5_model_card.pdf) for CodeT5! Please reach out
-if you have any questions about it.
-
-**Sep 24, 2021**
-
-CodeT5 is now in [hugginface](https://huggingface.co/)!
-
-You can simply load the model ([CodeT5-small](https://huggingface.co/Salesforce/codet5-small)
-and [CodeT5-base](https://huggingface.co/Salesforce/codet5-base)) and do the inference:
-
-```python
-from transformers import RobertaTokenizer, T5ForConditionalGeneration
-
-tokenizer = RobertaTokenizer.from_pretrained('Salesforce/codet5-base')
-model = T5ForConditionalGeneration.from_pretrained('Salesforce/codet5-base')
-
-text = "def greet(user): print(f'hello <extra_id_0>!')"
-input_ids = tokenizer(text, return_tensors="pt").input_ids
-
-# simply generate one code span
-generated_ids = model.generate(input_ids, max_length=8)
-print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
-# this prints "{user.username}"
-```
-
-## Introduction
-
-This repo provides the code for reproducing the experiments
-in [CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation](https://arxiv.org/pdf/2109.00859.pdf)
-. CodeT5 is a new pre-trained encoder-decoder model for programming languages, which is pre-trained on **8.35M**
-functions in 8 programming languages (Python, Java, JavaScript, PHP, Ruby, Go, C, and C#). In total, it achieves
-state-of-the-art results on **14 sub-tasks** in a code intelligence benchmark - [CodeXGLUE](https://github.com/microsoft/CodeXGLUE).
-
-Paper link: https://arxiv.org/abs/2109.00859
-
-Blog link: https://blog.salesforceairesearch.com/codet5/
-
-The code currently includes two pre-trained checkpoints ([CodeT5-small](https://huggingface.co/Salesforce/codet5-small)
-and [CodeT5-base](https://huggingface.co/Salesforce/codet5-base)) and scripts to fine-tune them on 4 generation tasks (
-code summarization, code generation, translation, and refinement) plus 2 understanding tasks (code defect detection and
-clone detection) in CodeXGLUE. We also provide their fine-tuned checkpoints to facilitate the easy replication
-of our paper.
-
-In practice, CodeT5 can be deployed as an AI-powered coding assistant to boost the productivity of software developers.
-At Salesforce, we build an [AI coding assistant demo](https://github.com/salesforce/CodeT5/raw/main/codet5.gif) using
-CodeT5 as a VS Code plugin to provide three capabilities for Apex developers:
+In practice, CodeT5 and CodeT5+ models can be deployed as an AI-powered coding assistant to boost the productivity of software developers.
+At Salesforce, we build an AI coding assistant demo using CodeT5 as a VS Code plugin to provide three capabilities for Apex developers:

 - **Text-to-code generation**: generate code based on the natural language description.
 - **Code autocompletion**: complete the whole function of code given the target function name.
 - **Code summarization**: generate the summary of a function in natural language description.

-## Table of Contents
+![CodeT5 demo](./codet5.gif)
+
+## What's New: 🎉 
+
+**May 2023**
+
+**CodeT5+** Paper and models released! ([paper](https://arxiv.org/pdf/2305.07922.pdf), [code](https://github.com/salesforce/CodeT5/tree/main/CodeT5+))
+
+**July 2022**
+
+We release two large-sized CodeT5 checkpoints at Hugging Face: [Salesforce/codet5-large](https://huggingface.co/Salesforce/codet5-large) and [Salesforce/codet5-large-ntp-py](https://huggingface.co/Salesforce/codet5-large-ntp-py), which are introduced by the [CodeRL paper](https://arxiv.org/pdf/2207.01780.pdf).
+
+**Oct 2021**
+
+We release [fine-tuned checkpoints](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/finetuned_models)
+for all the downstream tasks covered in the paper.
+Besides, we release a CodeT5-base fine-tuned
+checkpoint ([Salesforce/codet5-base-multi-sum](https://huggingface.co/Salesforce/codet5-base-multi-sum)) for
+multilingual code summarization. 
+
+
+**Sep, 2021**
+
+CodeT5 is now in [hugginface](https://huggingface.co/)!  ([CodeT5-small](https://huggingface.co/Salesforce/codet5-small) and [CodeT5-base](https://huggingface.co/Salesforce/codet5-base)).
+
+We add a [model card](https://github.com/salesforce/CodeT5/blob/main/CodeT5_model_card.pdf) for CodeT5! Please reach out
+if you have any questions about it.
+

-1. [Citation](#citation)
-2. [License](#license)
-3. [Dependency](#dependency)
-4. [Download](#download)
-5. [Fine-tuning](#fine-tuning)
-6. [Get Involved](#get-involved)

 ## Citation

@ -130,15 +59,24 @@ If you find this code to be useful for your research, please consider citing:
    wang2021codet5,
    title={CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation}, 
    author={Yue Wang, Weishi Wang, Shafiq Joty, Steven C.H. Hoi},
-    booktitle={Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021},
+    booktitle={EMNLP},
    year={2021},
 }

-@article{coderl2022,
-  title={CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning},
-  author={Le, Hung and Wang, Yue and Gotmare, Akhilesh Deepak and Savarese, Silvio and Hoi, Steven C. H.},
-  journal={arXiv preprint arXiv:2207.01780},
-  year={2022}
+@inproceedings{
+    le2022coderl,
+    title={CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning},
+    author={Le, Hung and Wang, Yue and Gotmare, Akhilesh Deepak and Savarese, Silvio and Hoi, Steven C. H.},
+    journal={NeurIPS},
+    year={2022}
+}
+
+@article{
+    wang2023codet5plus,
+    title={CodeT5+: Open Code Large Language Models for Code Understanding and Generation},
+    author={Wang, Yue and Le, Hung and Gotmare, Akhilesh Deepak and Bui, Nghi D.Q. and Li, Junnan and Hoi, Steven C. H.},
+    journal={arXiv preprint},
+    year={2023}
 }
 ```

@ -162,84 +100,6 @@ codeT5@salesforce.com, and to
 use [appropriate](https://arxiv.org/abs/1810.03993) [documentation](https://www.partnershiponai.org/about-ml/) when
 developing high-stakes applications of this model.

-## Dependency
-
- Pytorch 1.7.1
- tensorboard 2.4.1
- transformers 4.6.1
- tree-sitter 0.2.2
-
-## Download
-
-* [Pre-trained checkpoints](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/pretrained_models)
-* [Fine-tuning data](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/data)
-* [Fine-tuned checkpoints](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/finetuned_models)
-
-Instructions to download:
-
-```
-# pip install gsutil
-cd your-cloned-codet5-path
-
-gsutil -m cp -r "gs://sfr-codet5-data-research/pretrained_models" .
-gsutil -m cp -r "gs://sfr-codet5-data-research/data" .
-gsutil -m cp -r "gs://sfr-codet5-data-research/finetuned_models" .
-```
-
-## Fine-tuning
-
-Go to `sh` folder, set the `WORKDIR` in `exp_with_args.sh` to be your cloned CodeT5 repository path.
-
-You can use `run_exp.py` to run a broad set of experiments by simply passing the `model_tag`, `task`, and `sub_task`
-arguments. In total, we support five models (i.e., ['roberta', 'codebert', 'bart_base', 'codet5_small', 'codet5_base'])
-and six tasks (i.e., ['summarize', 'concode', 'translate', 'refine', 'defect', 'clone']). For each task, we use
-the `sub_task` to specify which specific datasets to fine-tne on. Below is the full list:
-
-| \--task   | \--sub\_task                       | Description                                                                                                                      |
-| --------- | ---------------------------------- | -------------------------------------------------------------------------------------------------------------------------------- |
-| summarize | ruby/javascript/go/python/java/php | code summarization task on [CodeSearchNet](https://arxiv.org/abs/1909.09436) data with six PLs                                   |
-| concode   | none                               | text-to-code generation on [Concode](https://aclanthology.org/D18-1192.pdf) data                                                 |
-| translate | java-cs/cs-java                    | code-to-code translation between [Java and C#](https://arxiv.org/pdf/2102.04664.pdf)                                             |
-| refine    | small/medium                       | code refinement on [code repair data](https://arxiv.org/pdf/1812.08693.pdf) with small/medium functions                          |
-| defect    | none                               | code defect detection in [C/C++ data](https://proceedings.neurips.cc/paper/2019/file/49265d2447bc3bbfe9e76306ce40a31f-Paper.pdf) |
-| clone     | none                               | code clone detection in [Java data](https://arxiv.org/pdf/2002.08653.pdf)                                                        |
-
-For example, if you want to run CodeT5-base model on the code summarization task for Python, you can simply run:
-
-```
-python run_exp.py --model_tag codet5_base --task summarize --sub_task python
-```
-
-For multi-task training, you can type:
-
-```
-python run_exp.py --model_tag codet5_base --task multi_task --sub_task none
-```
-
-Besides, you can specify:
-
-```
-model_dir: where to save fine-tuning checkpoints
-res_dir: where to save the performance results 
-summary_dir: where to save the training curves
-data_num: how many data instances to use, the default -1 is for using the full data
-gpu: the index of the GPU to use in the cluster
-``` 
-
-You can also revise the suggested
-arguments [here](https://github.com/salesforce/CodeT5/blob/0bf3c0c43e92fcf54d9df68c793ac22f2b60aad4/sh/run_exp.py#L14) or directly customize the [exp_with_args.sh](https://github.com/salesforce/CodeT5/blob/main/sh/exp_with_args.sh) bash file.
-Please refer to the argument flags in [configs.py](https://github.com/salesforce/CodeT5/blob/main/configs.py) for the full
-available options. The saved training curves in `summary_dir` can be visualized using [tensorboard](https://pypi.org/project/tensorboard/).
-Note that we employ one A100 GPU for all fine-tuning experiments.
-
-### How to reproduce the results using the released finetuned checkpoints?
-
-* Remove the `--do_train --do_eval --do_eval_bleu` and reserve only `--do_test` at [here](https://github.com/salesforce/CodeT5/blob/5b37c34f4bbbfcfd972c24a9dd1f45716568ecb5/sh/exp_with_args.sh#L84). 
-* Pass the path of your downloaded finetuned checkpoint to load at [here](https://github.com/salesforce/CodeT5/blob/5b37c34f4bbbfcfd972c24a9dd1f45716568ecb5/run_gen.py#L366), e.g., `file = "CodeT5/finetuned_models/summarize_python_codet5_base.bin"`
-* Run the program: `python run_exp.py --model_tag codet5_base --task summarize --sub_task python`
-
-### How to fine-tune on your own task and dataset?
-If you want to fine-tune on your dataset, you can add your own task and sub_task in `configs.py` ([here](https://github.com/salesforce/CodeT5/blob/d27512d23ba6130e089e571d8c3e399760db1c31/configs.py#L11)) and add your data path and the function to read in `utils.py` ([here](https://github.com/salesforce/CodeT5/blob/5bb41e21b07fee73f310476a91ded00e385290d7/utils.py#L103) and [here](https://github.com/salesforce/CodeT5/blob/5bb41e21b07fee73f310476a91ded00e385290d7/utils.py#L149)). The read function can be implemented in `_utils.py` similar to [this one](https://github.com/salesforce/CodeT5/blob/aaf9c4a920c4986abfd54a74f5456b056b6409e0/_utils.py#L213). If your task to add is a generation task, you can simply reuse or customize the `run_gen.py`. For understanding tasks, please refer to `run_defect.py` and `run_clone.py`.

 ## Get Involved

--- a/codet5.gif
+++ b/codet5.gif