CodeT5/CodeT5+/README.md

74 lines
4.7 KiB
Markdown
Raw Normal View History

2023-05-15 06:31:23 -04:00
# CodeT5+
Official research release for the **CodeT5+** models (`220M`, `770M`, `2B`, `6B` `16B`) for a wide range of **Code Understanding and Generation** tasks.
2023-05-15 20:42:06 -04:00
*Title*: [CodeT5+: Open Code Large Language Models for Code Understanding and Generation](https://arxiv.org/pdf/2305.07922.pdf)
2023-05-15 06:31:23 -04:00
*Authors*: [Yue Wang](https://yuewang-cuhk.github.io/)\*, [Hung Le](https://sites.google.com/view/henryle2018/home?pli=1)\*, [Akhilesh Deepak Gotmare](https://akhileshgotmare.github.io/), [Nghi D.Q. Bui](https://bdqnghi.github.io/), [Junnan Li](https://sites.google.com/site/junnanlics), [Steven C.H. Hoi](https://sites.google.com/view/stevenhoi/home) (* indicates equal contribution)
# What is this about?
2023-05-17 05:34:00 -04:00
CodeT5+ is a new family of open code large language models with an encoder-decoder architecture that can flexibly operate in different modes (i.e. _encoder-only_, _decoder-only_, and _encoder-decoder_) to support a wide range of code understanding and generation tasks.
See the below overview of CodeT5+.
![CodeT5+ overview](codet5p_overview.png)
2023-05-15 06:31:23 -04:00
To train CodeT5+, we introduce a diverse set of pretraining tasks including _span denoising_, _causal language modeling_, _contrastive learning_, and _text-code matching_ to learn rich representations from both unimodal code data and bimodal code-text data.
Additionally, to efficiently scale up the model, we propose a simple yet effective _compute-efficient pretraining_ method to initialize our model with frozen off-the-shelf LLMs such as [CodeGen](https://github.com/salesforce/CodeGen).
Furthermore, we explore instruction tuning to align the model with natural language instructions following [Code Alpaca](https://github.com/sahil280114/codealpaca).
2023-05-17 05:34:00 -04:00
![CodeT5+ architecture](codet5p_architecture.png)
2023-05-15 06:31:23 -04:00
We implemented a family of CodeT5+ models, with model size ranging from 220M to 16B.
2023-05-17 05:34:00 -04:00
Note that CodeT5+ `220M` and `770M` employ the same architecture of CodeT5-base and large respectively and are pretrained from scratch, while CodeT5+ `2B`, `6B`, `16B` employ a "_shallow encoder and deep decoder_" architecture with the shallow encoder initialized from CodeGen-mono 350M and the deep decoder initialized from CodeGen-mono 2B, 6B, 16B, respectively.
2023-05-15 20:42:06 -04:00
InstructCodeT5+ 16B is our instruction-tuned model from CodeT5+ 16B.
2023-05-15 06:31:23 -04:00
2023-05-17 05:34:00 -04:00
2023-05-15 06:31:23 -04:00
# Released Models
2023-05-17 05:34:00 -04:00
We release the following CodeT5+ models at Huggingface:
2023-05-15 06:31:23 -04:00
2023-05-17 05:34:00 -04:00
* CodeT5+ `220M` and `770M`: [Salesforce/codet5p-220m](https://huggingface.co/Salesforce/codet5p-220m) and [codet5p-770m](https://huggingface.co/Salesforce/codet5p-770m).
* CodeT5+ `220M` and `770M` that are further tuned on Python subset: [codet5p-220m-py](https://huggingface.co/Salesforce/codet5p-220m-py) and [codet5p-770m-py](https://huggingface.co/Salesforce/codet5p-770m-py).
* CodeT5+ `2B`, `6B`, `16B`: [Salesforce/codet5p-2b](https://huggingface.co/Salesforce/codet5p-2b), [Salesforce/codet5p-6b](https://huggingface.co/Salesforce/codet5p-6b), and [Salesforce/codet5p-16b](https://huggingface.co/Salesforce/codet5p-16b).
* InstructCodeT5+ `16B`: [Salesforce/instructcodet5p-16b](https://huggingface.co/Salesforce/instructcodet5p-16b).
2023-05-15 06:31:23 -04:00
# How to Use?
2023-05-17 05:34:00 -04:00
All CodeT5+ models and tokenizers can be easily loaded using the `AutoModelForSeq2SeqLM` and `AutoTokenizer` functionality.
For tokenizers, CodeT5+ `220M` and `770M` employ the same tokenizer as the original [CodeT5](https://github.com/salesforce/CodeT5) while CodeT5+ `2B`, `6B`, `16B` employ the same tokenizer as [CodeGen]( https://github.com/salesforce/CodeGen).
To load CodeT5+ `2B`, `6B`, `16B`, please set `trust_remote_code=True` as the [model class](https://huggingface.co/Salesforce/codet5p-16b/blob/main/modeling_codet5p.py) is defined in the Huggingface repo.
2023-05-15 06:31:23 -04:00
```python
2023-05-17 05:34:00 -04:00
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch
2023-05-15 06:31:23 -04:00
2023-05-17 05:34:00 -04:00
checkpoint = "Salesforce/instructcodet5p-16b"
2023-05-15 06:31:23 -04:00
device = "cuda" # for GPU usage or "cpu" for CPU usage
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
2023-05-17 05:34:00 -04:00
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
trust_remote_code=True).to(device)
2023-05-15 06:31:23 -04:00
2023-05-17 05:34:00 -04:00
inputs = tokenizer.encode("def print_hello():", return_tensors="pt").to(device)
outputs = model.generate(inputs, max_length=12)
2023-05-15 06:31:23 -04:00
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
2023-05-17 05:34:00 -04:00
2023-05-15 06:31:23 -04:00
```
2023-05-17 05:34:00 -04:00
# Reproduce the Results
## HumanEval
TBA
2023-05-15 06:31:23 -04:00
## Citation
```bibtex
@article{wang2023codet5plus,
title={CodeT5+: Open Code Large Language Models for Code Understanding and Generation},
author={Wang, Yue and Le, Hung and Gotmare, Akhilesh Deepak and Bui, Nghi D.Q. and Li, Junnan and Hoi, Steven C. H.},
journal={arXiv preprint},
year={2023}
}
```