CodeT5/CodeT5+
2023-05-15 18:31:23 +08:00
..
README.md add codet5+ 2023-05-15 18:31:23 +08:00

CodeT5+

Official research release for the CodeT5+ models (220M, 770M, 2B, 6B 16B) for a wide range of Code Understanding and Generation tasks.

Title: CodeT5+: Open Code Large Language Models for Code Understanding and Generation

Authors: Yue Wang*, Hung Le*, Akhilesh Deepak Gotmare, Nghi D.Q. Bui, Junnan Li, Steven C.H. Hoi (* indicates equal contribution)

What is this about?

CodeT5+ is a new family of open code large language models with an encoder-decoder architecture that can flexibly operate in different modes (i.e. encoder-only, decoder-only, and encoder-decoder) to support a wide range of code understanding and generation tasks.

To train CodeT5+, we introduce a diverse set of pretraining tasks including span denoising, causal language modeling, contrastive learning, and text-code matching to learn rich representations from both unimodal code data and bimodal code-text data. Additionally, to efficiently scale up the model, we propose a simple yet effective compute-efficient pretraining method to initialize our model with frozen off-the-shelf LLMs such as CodeGen. Furthermore, we explore instruction tuning to align the model with natural language instructions following Code Alpaca.

We implemented a family of CodeT5+ models, with model size ranging from 220M to 16B. Note that CodeT5+ 220M and 770M employ the same architecture of CodeT5-base and large respectively and are pretrained from scratch, while CodeT5+ 2B, 6B, 16B employ a "shallow encoder and deep decoder" architecture with the shallow encoder initialized from CodeGen-mono 350M and the deep decoder initialized from CodeGen-mono 2B, 6B, 16B, respectively.

CodeT5+ overview

Released Models

We release the following CodeT5+ models:

  • CodeT5+ 220M and 770M at Huggingface here and here, respectively.
  • CodeT5+ 220M and 770M that are further tuned on Python subset at Huggingface here and here, respectively.
  • CodeT5+ 2B, 6B, 16B will be released soon.

How to Use?

CodeT5+ 220M and 770M models can be easily loaded using the T5ForConditionalGeneration functionality. They employ the same tokenizer as the original CodeT5.

from transformers import T5ForConditionalGeneration, AutoTokenizer

checkpoint = "Salesforce/codet5p-770m-py"
device = "cuda" # for GPU usage or "cpu" for CPU usage

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = T5ForConditionalGeneration.from_pretrained(checkpoint).to(device)

inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to(device)
outputs = model.generate(inputs, max_length=10)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# ==>     print('Hello World!')

Citation

@article{wang2023codet5plus,
  title={CodeT5+: Open Code Large Language Models for Code Understanding and Generation},
  author={Wang, Yue and Le, Hung and Gotmare, Akhilesh Deepak and Bui, Nghi D.Q. and Li, Junnan and Hoi, Steven C. H.},
  journal={arXiv preprint},
  year={2023}
}