reorganize the repo

2024-10-01 06:35:38 -04:00 · 2023-05-17 17:34:00 +08:00 · 2023-05-17 17:34:00 +08:00 · 71ccd12773
commit 71ccd12773
parent d5c9d81af1
39 changed files with 297 additions and 224 deletions
--- a/CodeT5+/README.md
+++ b/CodeT5+/README.md
@ -7,43 +7,61 @@ Official research release for the **CodeT5+** models (`220M`, `770M`, `2B`, `6B`
 *Authors*: [Yue Wang](https://yuewang-cuhk.github.io/)\*, [Hung Le](https://sites.google.com/view/henryle2018/home?pli=1)\*, [Akhilesh Deepak Gotmare](https://akhileshgotmare.github.io/), [Nghi D.Q. Bui](https://bdqnghi.github.io/), [Junnan Li](https://sites.google.com/site/junnanlics), [Steven C.H. Hoi](https://sites.google.com/view/stevenhoi/home) (* indicates equal contribution)
 # What is this about?
-CodeT5+ is a new family of open code large language models with an encoder-decoder architecture that can flexibly operate in different modes (i.e. _encoder-only_, _decoder-only_, and _encoder-decoder_) to support a wide range of code understanding and generation tasks. 
+CodeT5+ is a new family of open code large language models with an encoder-decoder architecture that can flexibly operate in different modes (i.e. _encoder-only_, _decoder-only_, and _encoder-decoder_) to support a wide range of code understanding and generation tasks.
 See the below overview of CodeT5+.
 ![CodeT5+ overview](codet5p_overview.png)
 To train CodeT5+, we introduce a diverse set of pretraining tasks including _span denoising_, _causal language modeling_, _contrastive learning_, and _text-code matching_ to learn rich representations from both unimodal code data and bimodal code-text data. 
 Additionally, to efficiently scale up the model, we propose a simple yet effective _compute-efficient pretraining_ method to initialize our model with frozen off-the-shelf LLMs such as [CodeGen](https://github.com/salesforce/CodeGen). 
 Furthermore, we explore instruction tuning to align the model with natural language instructions following [Code Alpaca](https://github.com/sahil280114/codealpaca). 
 ![CodeT5+ architecture](codet5p_architecture.png)
 We implemented a family of CodeT5+ models, with model size ranging from 220M to 16B. 
-Note that CodeT5+ 220M and 770M employ the same architecture of CodeT5-base and large respectively and are pretrained from scratch, while CodeT5+ 2B, 6B, 16B employ a "_shallow encoder and deep decoder_" architecture with the shallow encoder initialized from CodeGen-mono 350M and the deep decoder initialized from CodeGen-mono 2B, 6B, 16B, respectively.
+Note that CodeT5+ `220M` and `770M` employ the same architecture of CodeT5-base and large respectively and are pretrained from scratch, while CodeT5+ `2B`, `6B`, `16B` employ a "_shallow encoder and deep decoder_" architecture with the shallow encoder initialized from CodeGen-mono 350M and the deep decoder initialized from CodeGen-mono 2B, 6B, 16B, respectively.
 InstructCodeT5+ 16B is our instruction-tuned model from CodeT5+ 16B. 
-![CodeT5+ overview](codet5p_overview.png)
+
 # Released Models
-We release the following CodeT5+ models:
+We release the following CodeT5+ models at Huggingface:
-* CodeT5+ `220M` and `770M` at Huggingface [here](https://huggingface.co/Salesforce/codet5p-220m) and [here](https://huggingface.co/Salesforce/codet5p-770m), respectively.
+* CodeT5+ `220M` and `770M`: [Salesforce/codet5p-220m](https://huggingface.co/Salesforce/codet5p-220m) and [codet5p-770m](https://huggingface.co/Salesforce/codet5p-770m).
-* CodeT5+ `220M` and `770M` that are further tuned on Python subset at Huggingface [here](https://huggingface.co/Salesforce/codet5p-220m-py) and [here](https://huggingface.co/Salesforce/codet5p-770m-py), respectively.
+* CodeT5+ `220M` and `770M` that are further tuned on Python subset: [codet5p-220m-py](https://huggingface.co/Salesforce/codet5p-220m-py) and [codet5p-770m-py](https://huggingface.co/Salesforce/codet5p-770m-py).
-* CodeT5+ `2B`, `6B`, `16B` will be released soon.
+* CodeT5+ `2B`, `6B`, `16B`: [Salesforce/codet5p-2b](https://huggingface.co/Salesforce/codet5p-2b), [Salesforce/codet5p-6b](https://huggingface.co/Salesforce/codet5p-6b), and [Salesforce/codet5p-16b](https://huggingface.co/Salesforce/codet5p-16b).
 * InstructCodeT5+ `16B`: [Salesforce/instructcodet5p-16b](https://huggingface.co/Salesforce/instructcodet5p-16b).
 # How to Use?
-CodeT5+ `220M` and `770M` models can be easily loaded using the `T5ForConditionalGeneration` functionality. They employ the same tokenizer as the original [CodeT5](https://github.com/salesforce/CodeT5).
+All CodeT5+ models and tokenizers can be easily loaded using the `AutoModelForSeq2SeqLM` and `AutoTokenizer` functionality. 
 For tokenizers, CodeT5+ `220M` and `770M` employ the same tokenizer as the original [CodeT5](https://github.com/salesforce/CodeT5) while CodeT5+ `2B`, `6B`, `16B` employ the same tokenizer as [CodeGen]( https://github.com/salesforce/CodeGen).
 To load CodeT5+ `2B`, `6B`, `16B`, please set `trust_remote_code=True` as the [model class](https://huggingface.co/Salesforce/codet5p-16b/blob/main/modeling_codet5p.py) is defined in the Huggingface repo.
 ```python
-from transformers import T5ForConditionalGeneration, AutoTokenizer
+from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
 import torch
-checkpoint = "Salesforce/codet5p-770m-py"
+checkpoint = "Salesforce/instructcodet5p-16b"
 device = "cuda" # for GPU usage or "cpu" for CPU usage
 tokenizer = AutoTokenizer.from_pretrained(checkpoint)
-model = T5ForConditionalGeneration.from_pretrained(checkpoint).to(device)
+model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint,
                                              torch_dtype=torch.float16,
                                              low_cpu_mem_usage=True,
                                              trust_remote_code=True).to(device)
-inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to(device)
+inputs = tokenizer.encode("def print_hello():", return_tensors="pt").to(device)
-outputs = model.generate(inputs, max_length=10)
+outputs = model.generate(inputs, max_length=12)
 print(tokenizer.decode(outputs[0], skip_special_tokens=True))
-# ==>     print('Hello World!')
+
 ```
 # Reproduce the Results
 ## HumanEval
 TBA
 ## Citation
 ```bibtex
--- a/CodeT5+/codet5p_architecture.png
+++ b/CodeT5+/codet5p_architecture.png
--- a/CodeT5.png
+++ b/CodeT5.png
--- a/CodeT5/README.md
+++ b/CodeT5/README.md
@ -0,0 +1,203 @@
 # CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation
 This is the official PyTorch implementation for the following EMNLP 2021 paper from Salesforce Research:
 **Title**: [CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation](https://arxiv.org/pdf/2109.00859.pdf)
 **Authors**: [Yue Wang](https://yuewang-cuhk.github.io/), [Weishi Wang](https://www.linkedin.com/in/weishi-wang/)
 , [Shafiq Joty](https://raihanjoty.github.io/), and [Steven C.H. Hoi](https://sites.google.com/view/stevenhoi/home)
 ![CodeT5 demo](../codet5.gif)
 ## Updates
 **July 06, 2022**
 We release two large-sized CodeT5 checkpoints at Hugging Face: [Salesforce/codet5-large](https://huggingface.co/Salesforce/codet5-large) and [Salesforce/codet5-large-ntp-py](https://huggingface.co/Salesforce/codet5-large-ntp-py), which are introduced by the paper: [CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning](https://arxiv.org/pdf/2207.01780.pdf) by Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, Steven C.H. Hoi.
 * CodeT5-large was pretrained using Masked Span Prediction (MSP) objective on CodeSearchNet and achieve new SOTA results on several CodeXGLUE benchmarks. The finetuned checkpoints are released at [here](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/finetuned_models). See Appendix A.1 of the [paper](https://arxiv.org/pdf/2207.01780.pdf) for more details.
 * CodeT5-large-ntp-py was first pretrained using Masked Span Prediction (MSP) objective on CodeSearchNet and GCPY (the Python split of [Github Code](https://huggingface.co/datasets/codeparrot/github-code) data), followed by another 10 epochs on GCPY using Next Token Prediction (NTP) objective. 
 CodeT5-large-ntp-py is especially optimized for Python code generation tasks and employed as the foundation model for our [CodeRL](https://github.com/salesforce/CodeRL), yielding new SOTA results on the APPS Python competition-level program synthesis benchmark. See the [paper](https://arxiv.org/pdf/2207.01780.pdf) for more details.
 **Oct 29, 2021**
 We release [fine-tuned checkpoints](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/finetuned_models)
 for all the downstream tasks covered in the paper.
 **Oct 25, 2021**
 We release a CodeT5-base fine-tuned
 checkpoint ([Salesforce/codet5-base-multi-sum](https://huggingface.co/Salesforce/codet5-base-multi-sum)) for
 multilingual code summarzation. Below is how to use this model:
 ```python
 from transformers import RobertaTokenizer, T5ForConditionalGeneration
 if __name__ == '__main__':
    tokenizer = RobertaTokenizer.from_pretrained('Salesforce/codet5-base')
    model = T5ForConditionalGeneration.from_pretrained('Salesforce/codet5-base-multi-sum')
    text = """def svg_to_image(string, size=None):
    if isinstance(string, unicode):
        string = string.encode('utf-8')
        renderer = QtSvg.QSvgRenderer(QtCore.QByteArray(string))
    if not renderer.isValid():
        raise ValueError('Invalid SVG data.')
    if size is None:
        size = renderer.defaultSize()
        image = QtGui.QImage(size, QtGui.QImage.Format_ARGB32)
        painter = QtGui.QPainter(image)
        renderer.render(painter)
    return image"""
    input_ids = tokenizer(text, return_tensors="pt").input_ids
    generated_ids = model.generate(input_ids, max_length=20)
    print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
    # this prints: "Convert a SVG string to a QImage."
 ```
 **Oct 18, 2021**
 We add a [model card](https://github.com/salesforce/CodeT5/blob/main/CodeT5_model_card.pdf) for CodeT5! Please reach out
 if you have any questions about it.
 **Sep 24, 2021**
 CodeT5 is now in [hugginface](https://huggingface.co/)!
 You can simply load the model ([CodeT5-small](https://huggingface.co/Salesforce/codet5-small)
 and [CodeT5-base](https://huggingface.co/Salesforce/codet5-base)) and do the inference:
 ```python
 from transformers import RobertaTokenizer, T5ForConditionalGeneration
 tokenizer = RobertaTokenizer.from_pretrained('Salesforce/codet5-base')
 model = T5ForConditionalGeneration.from_pretrained('Salesforce/codet5-base')
 text = "def greet(user): print(f'hello <extra_id_0>!')"
 input_ids = tokenizer(text, return_tensors="pt").input_ids
 # simply generate one code span
 generated_ids = model.generate(input_ids, max_length=8)
 print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
 # this prints "{user.username}"
 ```
 ## Introduction
 This repo provides the code for reproducing the experiments
 in [CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation](https://arxiv.org/pdf/2109.00859.pdf)
 . CodeT5 is a new pre-trained encoder-decoder model for programming languages, which is pre-trained on **8.35M**
 functions in 8 programming languages (Python, Java, JavaScript, PHP, Ruby, Go, C, and C#). In total, it achieves
 state-of-the-art results on **14 sub-tasks** in a code intelligence benchmark - [CodeXGLUE](https://github.com/microsoft/CodeXGLUE).
 Paper link: https://arxiv.org/abs/2109.00859
 Blog link: https://blog.salesforceairesearch.com/codet5/
 The code currently includes two pre-trained checkpoints ([CodeT5-small](https://huggingface.co/Salesforce/codet5-small)
 and [CodeT5-base](https://huggingface.co/Salesforce/codet5-base)) and scripts to fine-tune them on 4 generation tasks (
 code summarization, code generation, translation, and refinement) plus 2 understanding tasks (code defect detection and
 clone detection) in CodeXGLUE. We also provide their fine-tuned checkpoints to facilitate the easy replication
 of our paper.
 In practice, CodeT5 can be deployed as an AI-powered coding assistant to boost the productivity of software developers.
 At Salesforce, we build an [AI coding assistant demo](https://github.com/salesforce/CodeT5/raw/main/codet5.gif) using
 CodeT5 as a VS Code plugin to provide three capabilities for Apex developers:
 - **Text-to-code generation**: generate code based on the natural language description.
 - **Code autocompletion**: complete the whole function of code given the target function name.
 - **Code summarization**: generate the summary of a function in natural language description.
 ## Table of Contents
 1. [Dependency](#dependency)
 2. [Download](#download)
 3. [Fine-tuning](#fine-tuning)
 ## Dependency
 - Pytorch 1.7.1
 - tensorboard 2.4.1
 - transformers 4.6.1
 - tree-sitter 0.2.2
 ## Download
 * [Pre-trained checkpoints](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/pretrained_models)
 * [Fine-tuning data](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/data)
 * [Fine-tuned checkpoints](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/finetuned_models)
 Instructions to download:
 ```
 # pip install gsutil
 cd your-cloned-codet5-path
 gsutil -m cp -r "gs://sfr-codet5-data-research/pretrained_models" .
 gsutil -m cp -r "gs://sfr-codet5-data-research/data" .
 gsutil -m cp -r "gs://sfr-codet5-data-research/finetuned_models" .
 ```
 ## Fine-tuning
 Go to `sh` folder, set the `WORKDIR` in `exp_with_args.sh` to be your cloned CodeT5 repository path.
 You can use `run_exp.py` to run a broad set of experiments by simply passing the `model_tag`, `task`, and `sub_task`
 arguments. In total, we support five models (i.e., ['roberta', 'codebert', 'bart_base', 'codet5_small', 'codet5_base'])
 and six tasks (i.e., ['summarize', 'concode', 'translate', 'refine', 'defect', 'clone']). For each task, we use
 the `sub_task` to specify which specific datasets to fine-tne on. Below is the full list:
 | \--task   | \--sub\_task                       | Description                                                                                                                      |
 | --------- | ---------------------------------- | -------------------------------------------------------------------------------------------------------------------------------- |
 | summarize | ruby/javascript/go/python/java/php | code summarization task on [CodeSearchNet](https://arxiv.org/abs/1909.09436) data with six PLs                                   |
 | concode   | none                               | text-to-code generation on [Concode](https://aclanthology.org/D18-1192.pdf) data                                                 |
 | translate | java-cs/cs-java                    | code-to-code translation between [Java and C#](https://arxiv.org/pdf/2102.04664.pdf)                                             |
 | refine    | small/medium                       | code refinement on [code repair data](https://arxiv.org/pdf/1812.08693.pdf) with small/medium functions                          |
 | defect    | none                               | code defect detection in [C/C++ data](https://proceedings.neurips.cc/paper/2019/file/49265d2447bc3bbfe9e76306ce40a31f-Paper.pdf) |
 | clone     | none                               | code clone detection in [Java data](https://arxiv.org/pdf/2002.08653.pdf)                                                        |
 For example, if you want to run CodeT5-base model on the code summarization task for Python, you can simply run:
 ```
 python run_exp.py --model_tag codet5_base --task summarize --sub_task python
 ```
 For multi-task training, you can type:
 ```
 python run_exp.py --model_tag codet5_base --task multi_task --sub_task none
 ```
 Besides, you can specify:
 ```
 model_dir: where to save fine-tuning checkpoints
 res_dir: where to save the performance results 
 summary_dir: where to save the training curves
 data_num: how many data instances to use, the default -1 is for using the full data
 gpu: the index of the GPU to use in the cluster
 ``` 
 You can also revise the suggested
 arguments [here](https://github.com/salesforce/CodeT5/blob/0bf3c0c43e92fcf54d9df68c793ac22f2b60aad4/sh/run_exp.py#L14) or directly customize the [exp_with_args.sh](https://github.com/salesforce/CodeT5/blob/main/sh/exp_with_args.sh) bash file.
 Please refer to the argument flags in [configs.py](https://github.com/salesforce/CodeT5/blob/main/configs.py) for the full
 available options. The saved training curves in `summary_dir` can be visualized using [tensorboard](https://pypi.org/project/tensorboard/).
 Note that we employ one A100 GPU for all fine-tuning experiments.
 ### How to reproduce the results using the released finetuned checkpoints?
 * Remove the `--do_train --do_eval --do_eval_bleu` and reserve only `--do_test` at [here](https://github.com/salesforce/CodeT5/blob/5b37c34f4bbbfcfd972c24a9dd1f45716568ecb5/sh/exp_with_args.sh#L84). 
 * Pass the path of your downloaded finetuned checkpoint to load at [here](https://github.com/salesforce/CodeT5/blob/5b37c34f4bbbfcfd972c24a9dd1f45716568ecb5/run_gen.py#L366), e.g., `file = "CodeT5/finetuned_models/summarize_python_codet5_base.bin"`
 * Run the program: `python run_exp.py --model_tag codet5_base --task summarize --sub_task python`
 ### How to fine-tune on your own task and dataset?
 If you want to fine-tune on your dataset, you can add your own task and sub_task in `configs.py` ([here](https://github.com/salesforce/CodeT5/blob/d27512d23ba6130e089e571d8c3e399760db1c31/configs.py#L11)) and add your data path and the function to read in `utils.py` ([here](https://github.com/salesforce/CodeT5/blob/5bb41e21b07fee73f310476a91ded00e385290d7/utils.py#L103) and [here](https://github.com/salesforce/CodeT5/blob/5bb41e21b07fee73f310476a91ded00e385290d7/utils.py#L149)). The read function can be implemented in `_utils.py` similar to [this one](https://github.com/salesforce/CodeT5/blob/aaf9c4a920c4986abfd54a74f5456b056b6409e0/_utils.py#L213). If your task to add is a generation task, you can simply reuse or customize the `run_gen.py`. For understanding tasks, please refer to `run_defect.py` and `run_clone.py`.
 ## Get Involved
 Please create a GitHub issue if you have any questions, suggestions, requests or bug-reports. We welcome PRs!
--- a/CodeT5/_utils.py
+++ b/CodeT5/_utils.py
--- a/CodeT5/configs.py
+++ b/CodeT5/configs.py
--- a/CodeT5/evaluator/CodeBLEU/init.py
+++ b/CodeT5/evaluator/CodeBLEU/init.py
--- a/CodeT5/evaluator/CodeBLEU/bleu.py
+++ b/CodeT5/evaluator/CodeBLEU/bleu.py
@ -15,7 +15,7 @@ from fractions import Fraction
 import warnings
 from collections import Counter
-from evaluator.CodeBLEU.utils import ngrams
+from CodeT5.evaluator.CodeBLEU.utils import ngrams
 def sentence_bleu(
--- a/CodeT5/evaluator/CodeBLEU/calc_code_bleu.py
+++ b/CodeT5/evaluator/CodeBLEU/calc_code_bleu.py
@ -5,7 +5,7 @@
 # -*- coding:utf-8 -*-
 import argparse
 import os
-from evaluator.CodeBLEU import bleu, weighted_ngram_match, syntax_match, dataflow_match
+from CodeT5.evaluator.CodeBLEU import weighted_ngram_match, bleu, dataflow_match, syntax_match
 def get_codebleu(refs, hyp, lang, params='0.25,0.25,0.25,0.25'):
--- a/CodeT5/evaluator/CodeBLEU/dataflow_match.py
+++ b/CodeT5/evaluator/CodeBLEU/dataflow_match.py
@ -1,11 +1,10 @@
 # Copyright (c) Microsoft Corporation. 
 # Licensed under the MIT license.
-from evaluator.CodeBLEU.parser import DFG_python, DFG_java, DFG_ruby, DFG_go, DFG_php, DFG_javascript, DFG_csharp
+from CodeT5.evaluator.CodeBLEU.parser import DFG_python, DFG_java, DFG_ruby, DFG_go, DFG_php, DFG_javascript, DFG_csharp
-from evaluator.CodeBLEU.parser import (remove_comments_and_docstrings,
+from CodeT5.evaluator.CodeBLEU.parser import (remove_comments_and_docstrings,
-                                       tree_to_token_index,
+                                              tree_to_token_index,
-                                       index_to_code_token,
+                                              index_to_code_token)
                                       tree_to_variable_index)
 from tree_sitter import Language, Parser
 import os
--- a/CodeT5/evaluator/CodeBLEU/keywords/c_sharp.txt
+++ b/CodeT5/evaluator/CodeBLEU/keywords/c_sharp.txt
--- a/CodeT5/evaluator/CodeBLEU/keywords/java.txt
+++ b/CodeT5/evaluator/CodeBLEU/keywords/java.txt
--- a/CodeT5/evaluator/CodeBLEU/parser/DFG.py
+++ b/CodeT5/evaluator/CodeBLEU/parser/DFG.py
--- a/CodeT5/evaluator/CodeBLEU/parser/init.py
+++ b/CodeT5/evaluator/CodeBLEU/parser/init.py
--- a/CodeT5/evaluator/CodeBLEU/parser/build.py
+++ b/CodeT5/evaluator/CodeBLEU/parser/build.py
--- a/CodeT5/evaluator/CodeBLEU/parser/build.sh
+++ b/CodeT5/evaluator/CodeBLEU/parser/build.sh
--- a/CodeT5/evaluator/CodeBLEU/parser/my-languages.so
+++ b/CodeT5/evaluator/CodeBLEU/parser/my-languages.so
--- a/CodeT5/evaluator/CodeBLEU/parser/utils.py
+++ b/CodeT5/evaluator/CodeBLEU/parser/utils.py
--- a/CodeT5/evaluator/CodeBLEU/readme.txt
+++ b/CodeT5/evaluator/CodeBLEU/readme.txt
--- a/CodeT5/evaluator/CodeBLEU/syntax_match.py
+++ b/CodeT5/evaluator/CodeBLEU/syntax_match.py
@ -1,11 +1,8 @@
 # Copyright (c) Microsoft Corporation. 
 # Licensed under the MIT license.
-from evaluator.CodeBLEU.parser import DFG_python, DFG_java, DFG_ruby, DFG_go, DFG_php, DFG_javascript, DFG_csharp
+from CodeT5.evaluator.CodeBLEU.parser import DFG_python, DFG_java, DFG_ruby, DFG_go, DFG_php, DFG_javascript, DFG_csharp
-from evaluator.CodeBLEU.parser import (remove_comments_and_docstrings,
+from CodeT5.evaluator.CodeBLEU.parser import (remove_comments_and_docstrings)
                                       tree_to_token_index,
                                       index_to_code_token,
                                       tree_to_variable_index)
 from tree_sitter import Language, Parser
 import os
--- a/CodeT5/evaluator/CodeBLEU/utils.py
+++ b/CodeT5/evaluator/CodeBLEU/utils.py
--- a/CodeT5/evaluator/CodeBLEU/weighted_ngram_match.py
+++ b/CodeT5/evaluator/CodeBLEU/weighted_ngram_match.py
@ -18,8 +18,7 @@ from fractions import Fraction
 import warnings
 from collections import Counter
-from evaluator.CodeBLEU.utils import ngrams
+from CodeT5.evaluator.CodeBLEU.utils import ngrams
 import pdb
 def sentence_bleu(
--- a/CodeT5/evaluator/init.py
+++ b/CodeT5/evaluator/init.py
--- a/CodeT5/evaluator/bleu.py
+++ b/CodeT5/evaluator/bleu.py
--- a/CodeT5/evaluator/smooth_bleu.py
+++ b/CodeT5/evaluator/smooth_bleu.py
--- a/CodeT5/models.py
+++ b/CodeT5/models.py
--- a/CodeT5/run_clone.py
+++ b/CodeT5/run_clone.py
--- a/CodeT5/run_defect.py
+++ b/CodeT5/run_defect.py
--- a/CodeT5/run_gen.py
+++ b/CodeT5/run_gen.py
--- a/CodeT5/run_multi_gen.py
+++ b/CodeT5/run_multi_gen.py
@ -23,14 +23,11 @@ import os
 import torch
 import logging
 import argparse
 import math
 import numpy as np
 from tqdm import tqdm
 from itertools import cycle
 import multiprocessing
 import time
 import sys
 import pdb
 from torch.utils.tensorboard import SummaryWriter
 from torch.utils.data import DataLoader, SequentialSampler, RandomSampler
--- a/CodeT5/sh/exp_with_args.sh
+++ b/CodeT5/sh/exp_with_args.sh
--- a/CodeT5/sh/run_exp.py
+++ b/CodeT5/sh/run_exp.py
--- a/CodeT5/tokenizer/apply_tokenizer.py
+++ b/CodeT5/tokenizer/apply_tokenizer.py
--- a/CodeT5/tokenizer/salesforce/codet5-merges.txt
+++ b/CodeT5/tokenizer/salesforce/codet5-merges.txt
--- a/CodeT5/tokenizer/salesforce/codet5-vocab.json
+++ b/CodeT5/tokenizer/salesforce/codet5-vocab.json
--- a/CodeT5/tokenizer/train_tokenizer.py
+++ b/CodeT5/tokenizer/train_tokenizer.py
--- a/CodeT5/utils.py
+++ b/CodeT5/utils.py
--- a/README.md
+++ b/README.md
@ -1,125 +1,54 @@
-# CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation
+# CodeT5 and CodeT5+
-This is the official PyTorch implementation for the following EMNLP 2021 paper from Salesforce Research:
+Official research release for  **CodeT5** and **CodeT5+** models for a wide range of **Code Understanding and Generation** tasks from Salesforce Research.
 These open code LLMs are introduced by the following papers:
-**Title**: [CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation](https://arxiv.org/pdf/2109.00859.pdf)
+*Title*: [CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation](https://arxiv.org/pdf/2109.00859.pdf)
-**Authors**: [Yue Wang](https://yuewang-cuhk.github.io/), [Weishi Wang](https://www.linkedin.com/in/weishi-wang/)
+*Authors*: [Yue Wang](https://yuewang-cuhk.github.io/), [Weishi Wang](https://www.linkedin.com/in/weishi-wang/)
-, [Shafiq Joty](https://raihanjoty.github.io/), and [Steven C.H. Hoi](https://sites.google.com/view/stevenhoi/home)
+, [Shafiq Joty](https://raihanjoty.github.io/), [Steven C.H. Hoi](https://sites.google.com/view/stevenhoi/home)
-![CodeT5 demo](codet5.gif)
+*Title*: [CodeT5+: Open Code Large Language Models for Code Understanding and Generation](https://arxiv.org/pdf/2305.07922.pdf)
-## Updates
+*Authors*: [Yue Wang](https://yuewang-cuhk.github.io/)\*, [Hung Le](https://sites.google.com/view/henryle2018/home?pli=1)\*, [Akhilesh Deepak Gotmare](https://akhileshgotmare.github.io/), [Nghi D.Q. Bui](https://bdqnghi.github.io/), [Junnan Li](https://sites.google.com/site/junnanlics), [Steven C.H. Hoi](https://sites.google.com/view/stevenhoi/home) (* indicates equal contribution)
 **July 06, 2022**
-We release two large-sized CodeT5 checkpoints at Hugging Face: [Salesforce/codet5-large](https://huggingface.co/Salesforce/codet5-large) and [Salesforce/codet5-large-ntp-py](https://huggingface.co/Salesforce/codet5-large-ntp-py), which are introduced by the paper: [CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning](https://arxiv.org/pdf/2207.01780.pdf) by Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, Steven C.H. Hoi.
+In practice, CodeT5 and CodeT5+ models can be deployed as an AI-powered coding assistant to boost the productivity of software developers.
-
+At Salesforce, we build an AI coding assistant demo using CodeT5 as a VS Code plugin to provide three capabilities for Apex developers:
 * CodeT5-large was pretrained using Masked Span Prediction (MSP) objective on CodeSearchNet and achieve new SOTA results on several CodeXGLUE benchmarks. The finetuned checkpoints are released at [here](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/finetuned_models). See Appendix A.1 of the [paper](https://arxiv.org/pdf/2207.01780.pdf) for more details.
 * CodeT5-large-ntp-py was first pretrained using Masked Span Prediction (MSP) objective on CodeSearchNet and GCPY (the Python split of [Github Code](https://huggingface.co/datasets/codeparrot/github-code) data), followed by another 10 epochs on GCPY using Next Token Prediction (NTP) objective. 
 CodeT5-large-ntp-py is especially optimized for Python code generation tasks and employed as the foundation model for our [CodeRL](https://github.com/salesforce/CodeRL), yielding new SOTA results on the APPS Python competition-level program synthesis benchmark. See the [paper](https://arxiv.org/pdf/2207.01780.pdf) for more details.
 **Oct 29, 2021**
 We release [fine-tuned checkpoints](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/finetuned_models)
 for all the downstream tasks covered in the paper.
 **Oct 25, 2021**
 We release a CodeT5-base fine-tuned
 checkpoint ([Salesforce/codet5-base-multi-sum](https://huggingface.co/Salesforce/codet5-base-multi-sum)) for
 multilingual code summarzation. Below is how to use this model:
 ```python
 from transformers import RobertaTokenizer, T5ForConditionalGeneration
 if __name__ == '__main__':
    tokenizer = RobertaTokenizer.from_pretrained('Salesforce/codet5-base')
    model = T5ForConditionalGeneration.from_pretrained('Salesforce/codet5-base-multi-sum')
    text = """def svg_to_image(string, size=None):
    if isinstance(string, unicode):
        string = string.encode('utf-8')
        renderer = QtSvg.QSvgRenderer(QtCore.QByteArray(string))
    if not renderer.isValid():
        raise ValueError('Invalid SVG data.')
    if size is None:
        size = renderer.defaultSize()
        image = QtGui.QImage(size, QtGui.QImage.Format_ARGB32)
        painter = QtGui.QPainter(image)
        renderer.render(painter)
    return image"""
    input_ids = tokenizer(text, return_tensors="pt").input_ids
    generated_ids = model.generate(input_ids, max_length=20)
    print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
    # this prints: "Convert a SVG string to a QImage."
 ```
 **Oct 18, 2021**
 We add a [model card](https://github.com/salesforce/CodeT5/blob/main/CodeT5_model_card.pdf) for CodeT5! Please reach out
 if you have any questions about it.
 **Sep 24, 2021**
 CodeT5 is now in [hugginface](https://huggingface.co/)!
 You can simply load the model ([CodeT5-small](https://huggingface.co/Salesforce/codet5-small)
 and [CodeT5-base](https://huggingface.co/Salesforce/codet5-base)) and do the inference:
 ```python
 from transformers import RobertaTokenizer, T5ForConditionalGeneration
 tokenizer = RobertaTokenizer.from_pretrained('Salesforce/codet5-base')
 model = T5ForConditionalGeneration.from_pretrained('Salesforce/codet5-base')
 text = "def greet(user): print(f'hello <extra_id_0>!')"
 input_ids = tokenizer(text, return_tensors="pt").input_ids
 # simply generate one code span
 generated_ids = model.generate(input_ids, max_length=8)
 print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
 # this prints "{user.username}"
 ```
 ## Introduction
 This repo provides the code for reproducing the experiments
 in [CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation](https://arxiv.org/pdf/2109.00859.pdf)
 . CodeT5 is a new pre-trained encoder-decoder model for programming languages, which is pre-trained on **8.35M**
 functions in 8 programming languages (Python, Java, JavaScript, PHP, Ruby, Go, C, and C#). In total, it achieves
 state-of-the-art results on **14 sub-tasks** in a code intelligence benchmark - [CodeXGLUE](https://github.com/microsoft/CodeXGLUE).
 Paper link: https://arxiv.org/abs/2109.00859
 Blog link: https://blog.salesforceairesearch.com/codet5/
 The code currently includes two pre-trained checkpoints ([CodeT5-small](https://huggingface.co/Salesforce/codet5-small)
 and [CodeT5-base](https://huggingface.co/Salesforce/codet5-base)) and scripts to fine-tune them on 4 generation tasks (
 code summarization, code generation, translation, and refinement) plus 2 understanding tasks (code defect detection and
 clone detection) in CodeXGLUE. We also provide their fine-tuned checkpoints to facilitate the easy replication
 of our paper.
 In practice, CodeT5 can be deployed as an AI-powered coding assistant to boost the productivity of software developers.
 At Salesforce, we build an [AI coding assistant demo](https://github.com/salesforce/CodeT5/raw/main/codet5.gif) using
 CodeT5 as a VS Code plugin to provide three capabilities for Apex developers:
 - **Text-to-code generation**: generate code based on the natural language description.
 - **Code autocompletion**: complete the whole function of code given the target function name.
 - **Code summarization**: generate the summary of a function in natural language description.
-## Table of Contents
+![CodeT5 demo](./codet5.gif)
 ## What's New: 🎉 
 **May 2023**
 **CodeT5+** Paper and models released! ([paper](https://arxiv.org/pdf/2305.07922.pdf), [code](https://github.com/salesforce/CodeT5/tree/main/CodeT5+))
 **July 2022**
 We release two large-sized CodeT5 checkpoints at Hugging Face: [Salesforce/codet5-large](https://huggingface.co/Salesforce/codet5-large) and [Salesforce/codet5-large-ntp-py](https://huggingface.co/Salesforce/codet5-large-ntp-py), which are introduced by the [CodeRL paper](https://arxiv.org/pdf/2207.01780.pdf).
 **Oct 2021**
 We release [fine-tuned checkpoints](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/finetuned_models)
 for all the downstream tasks covered in the paper.
 Besides, we release a CodeT5-base fine-tuned
 checkpoint ([Salesforce/codet5-base-multi-sum](https://huggingface.co/Salesforce/codet5-base-multi-sum)) for
 multilingual code summarization. 
 **Sep, 2021**
 CodeT5 is now in [hugginface](https://huggingface.co/)!  ([CodeT5-small](https://huggingface.co/Salesforce/codet5-small) and [CodeT5-base](https://huggingface.co/Salesforce/codet5-base)).
 We add a [model card](https://github.com/salesforce/CodeT5/blob/main/CodeT5_model_card.pdf) for CodeT5! Please reach out
 if you have any questions about it.
 1. [Citation](#citation)
 2. [License](#license)
 3. [Dependency](#dependency)
 4. [Download](#download)
 5. [Fine-tuning](#fine-tuning)
 6. [Get Involved](#get-involved)
 ## Citation
@ -130,15 +59,24 @@ If you find this code to be useful for your research, please consider citing:
    wang2021codet5,
    title={CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation}, 
    author={Yue Wang, Weishi Wang, Shafiq Joty, Steven C.H. Hoi},
-    booktitle={Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021},
+    booktitle={EMNLP},
    year={2021},
 }
-@article{coderl2022,
+@inproceedings{
-  title={CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning},
+    le2022coderl,
-  author={Le, Hung and Wang, Yue and Gotmare, Akhilesh Deepak and Savarese, Silvio and Hoi, Steven C. H.},
+    title={CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning},
-  journal={arXiv preprint arXiv:2207.01780},
+    author={Le, Hung and Wang, Yue and Gotmare, Akhilesh Deepak and Savarese, Silvio and Hoi, Steven C. H.},
-  year={2022}
+    journal={NeurIPS},
    year={2022}
 }
@article{
    wang2023codet5plus,
    title={CodeT5+: Open Code Large Language Models for Code Understanding and Generation},
    author={Wang, Yue and Le, Hung and Gotmare, Akhilesh Deepak and Bui, Nghi D.Q. and Li, Junnan and Hoi, Steven C. H.},
    journal={arXiv preprint},
    year={2023}
 }
 ```
@ -162,84 +100,6 @@ codeT5@salesforce.com, and to
 use [appropriate](https://arxiv.org/abs/1810.03993) [documentation](https://www.partnershiponai.org/about-ml/) when
 developing high-stakes applications of this model.
 ## Dependency
 - Pytorch 1.7.1
 - tensorboard 2.4.1
 - transformers 4.6.1
 - tree-sitter 0.2.2
 ## Download
 * [Pre-trained checkpoints](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/pretrained_models)
 * [Fine-tuning data](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/data)
 * [Fine-tuned checkpoints](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/finetuned_models)
 Instructions to download:
 ```
 # pip install gsutil
 cd your-cloned-codet5-path
 gsutil -m cp -r "gs://sfr-codet5-data-research/pretrained_models" .
 gsutil -m cp -r "gs://sfr-codet5-data-research/data" .
 gsutil -m cp -r "gs://sfr-codet5-data-research/finetuned_models" .
 ```
 ## Fine-tuning
 Go to `sh` folder, set the `WORKDIR` in `exp_with_args.sh` to be your cloned CodeT5 repository path.
 You can use `run_exp.py` to run a broad set of experiments by simply passing the `model_tag`, `task`, and `sub_task`
 arguments. In total, we support five models (i.e., ['roberta', 'codebert', 'bart_base', 'codet5_small', 'codet5_base'])
 and six tasks (i.e., ['summarize', 'concode', 'translate', 'refine', 'defect', 'clone']). For each task, we use
 the `sub_task` to specify which specific datasets to fine-tne on. Below is the full list:
 | \--task   | \--sub\_task                       | Description                                                                                                                      |
 | --------- | ---------------------------------- | -------------------------------------------------------------------------------------------------------------------------------- |
 | summarize | ruby/javascript/go/python/java/php | code summarization task on [CodeSearchNet](https://arxiv.org/abs/1909.09436) data with six PLs                                   |
 | concode   | none                               | text-to-code generation on [Concode](https://aclanthology.org/D18-1192.pdf) data                                                 |
 | translate | java-cs/cs-java                    | code-to-code translation between [Java and C#](https://arxiv.org/pdf/2102.04664.pdf)                                             |
 | refine    | small/medium                       | code refinement on [code repair data](https://arxiv.org/pdf/1812.08693.pdf) with small/medium functions                          |
 | defect    | none                               | code defect detection in [C/C++ data](https://proceedings.neurips.cc/paper/2019/file/49265d2447bc3bbfe9e76306ce40a31f-Paper.pdf) |
 | clone     | none                               | code clone detection in [Java data](https://arxiv.org/pdf/2002.08653.pdf)                                                        |
 For example, if you want to run CodeT5-base model on the code summarization task for Python, you can simply run:
 ```
 python run_exp.py --model_tag codet5_base --task summarize --sub_task python
 ```
 For multi-task training, you can type:
 ```
 python run_exp.py --model_tag codet5_base --task multi_task --sub_task none
 ```
 Besides, you can specify:
 ```
 model_dir: where to save fine-tuning checkpoints
 res_dir: where to save the performance results 
 summary_dir: where to save the training curves
 data_num: how many data instances to use, the default -1 is for using the full data
 gpu: the index of the GPU to use in the cluster
 ``` 
 You can also revise the suggested
 arguments [here](https://github.com/salesforce/CodeT5/blob/0bf3c0c43e92fcf54d9df68c793ac22f2b60aad4/sh/run_exp.py#L14) or directly customize the [exp_with_args.sh](https://github.com/salesforce/CodeT5/blob/main/sh/exp_with_args.sh) bash file.
 Please refer to the argument flags in [configs.py](https://github.com/salesforce/CodeT5/blob/main/configs.py) for the full
 available options. The saved training curves in `summary_dir` can be visualized using [tensorboard](https://pypi.org/project/tensorboard/).
 Note that we employ one A100 GPU for all fine-tuning experiments.
 ### How to reproduce the results using the released finetuned checkpoints?
 * Remove the `--do_train --do_eval --do_eval_bleu` and reserve only `--do_test` at [here](https://github.com/salesforce/CodeT5/blob/5b37c34f4bbbfcfd972c24a9dd1f45716568ecb5/sh/exp_with_args.sh#L84). 
 * Pass the path of your downloaded finetuned checkpoint to load at [here](https://github.com/salesforce/CodeT5/blob/5b37c34f4bbbfcfd972c24a9dd1f45716568ecb5/run_gen.py#L366), e.g., `file = "CodeT5/finetuned_models/summarize_python_codet5_base.bin"`
 * Run the program: `python run_exp.py --model_tag codet5_base --task summarize --sub_task python`
 ### How to fine-tune on your own task and dataset?
 If you want to fine-tune on your dataset, you can add your own task and sub_task in `configs.py` ([here](https://github.com/salesforce/CodeT5/blob/d27512d23ba6130e089e571d8c3e399760db1c31/configs.py#L11)) and add your data path and the function to read in `utils.py` ([here](https://github.com/salesforce/CodeT5/blob/5bb41e21b07fee73f310476a91ded00e385290d7/utils.py#L103) and [here](https://github.com/salesforce/CodeT5/blob/5bb41e21b07fee73f310476a91ded00e385290d7/utils.py#L149)). The read function can be implemented in `_utils.py` similar to [this one](https://github.com/salesforce/CodeT5/blob/aaf9c4a920c4986abfd54a74f5456b056b6409e0/_utils.py#L213). If your task to add is a generation task, you can simply reuse or customize the `run_gen.py`. For understanding tasks, please refer to `run_defect.py` and `run_clone.py`.
 ## Get Involved
--- a/codet5.gif
+++ b/codet5.gif