Official research release for the **CodeT5+** models (`220M`, `770M`, `2B`, `6B``16B`) for a wide range of **Code Understanding and Generation** tasks.
CodeT5+ is a new family of open code large language models with an encoder-decoder architecture that can flexibly operate in different modes (i.e. _encoder-only_, _decoder-only_, and _encoder-decoder_) to support a wide range of code understanding and generation tasks.
To train CodeT5+, we introduce a diverse set of pretraining tasks including _span denoising_, _causal language modeling_, _contrastive learning_, and _text-code matching_ to learn rich representations from both unimodal code data and bimodal code-text data.
Additionally, to efficiently scale up the model, we propose a simple yet effective _compute-efficient pretraining_ method to initialize our model with frozen off-the-shelf LLMs such as [CodeGen](
Furthermore, we explore instruction tuning to align the model with natural language instructions following [Code Alpaca](
Note that CodeT5+ `220M` and `770M` employ the same architecture of CodeT5-base and large respectively and are pretrained from scratch, while CodeT5+ `2B`, `6B`, `16B` employ a "_shallow encoder and deep decoder_" architecture with the shallow encoder initialized from CodeGen-mono 350M and the deep decoder initialized from CodeGen-mono 2B, 6B, 16B, respectively.
* CodeT5+ `220M` and `770M`: [Salesforce/codet5p-220m]( and [codet5p-770m](
* CodeT5+ `220M` and `770M` that are further tuned on Python subset: [codet5p-220m-py]( and [codet5p-770m-py](
* CodeT5+ `2B`, `6B`, `16B`: [Salesforce/codet5p-2b](, [Salesforce/codet5p-6b](, and [Salesforce/codet5p-16b](
All CodeT5+ models and tokenizers can be easily loaded using the `AutoModelForSeq2SeqLM` and `AutoTokenizer` functionality.
For tokenizers, CodeT5+ `220M` and `770M` employ the same tokenizer as the original [CodeT5]( while CodeT5+ `2B`, `6B`, `16B` employ the same tokenizer as [CodeGen](
To load CodeT5+ `2B`, `6B`, `16B`, please set `trust_remote_code=True` as the [model class]( is defined in the Huggingface repo.