From 3b529e206d578453a126b5614ece5a84d3d45b87 Mon Sep 17 00:00:00 2001 From: WANG Yue <337111657@qq.com> Date: Wed, 12 Jul 2023 15:47:00 +0800 Subject: [PATCH] update table of contents --- CodeT5+/README.md | 11 ++++++ CodeT5/README.md | 87 ++++++++++++++++++++++++++--------------------- 2 files changed, 59 insertions(+), 39 deletions(-) diff --git a/CodeT5+/README.md b/CodeT5+/README.md index b3636d3..b34265b 100644 --- a/CodeT5+/README.md +++ b/CodeT5+/README.md @@ -8,6 +8,16 @@ Find out more via our [blog post](https://blog.salesforceairesearch.com/codet5-o *Authors*: [Yue Wang](https://yuewang-cuhk.github.io/)\*, [Hung Le](https://sites.google.com/view/henryle2018/home?pli=1)\*, [Akhilesh Deepak Gotmare](https://akhileshgotmare.github.io/), [Nghi D.Q. Bui](https://bdqnghi.github.io/), [Junnan Li](https://sites.google.com/site/junnanlics), [Steven C.H. Hoi](https://sites.google.com/view/stevenhoi/home) (* indicates equal contribution) +## Table of Contents + +1. [What is this about?](#what-is-this-about) +2. [Released Models](#released-models) +3. [How to Use?](#how-to-use) +4. [Instruction Tuning to Align with Natural Language Instructions](#instruction-tuning-to-align-with-natural-language-instructions) +5. [How to Finetune Using Your Own Data?](#how-to-finetune-using-your-own-data) +6. [Reproduce the Results](#reproduce-the-results) +7. [Citation](#citation) + # What is this about? CodeT5+ is a new family of open code large language models with an encoder-decoder architecture that can flexibly operate in different modes (i.e. _encoder-only_, _decoder-only_, and _encoder-decoder_) to support a wide range of code understanding and generation tasks. @@ -106,6 +116,7 @@ Our CodeT5+ models achieves strong results on HumanEval benchmark in zero-shot s | code-cushman-001 | 33.5 | 54.3 | 77.4 | | StarCoder 15B | 33.6 | - | - | | InstructCodeT5+ 16B | **36.1** | **57.1** | **80.7** | + Please follow the instructions below to reproduce the results. diff --git a/CodeT5/README.md b/CodeT5/README.md index 77c4bdc..d117d95 100644 --- a/CodeT5/README.md +++ b/CodeT5/README.md @@ -9,6 +9,45 @@ This is the official PyTorch implementation for the following EMNLP 2021 paper f ![CodeT5 demo](../codet5.gif) + +## Table of Contents + +1. [Introduction](#introduction) +2. [Updates](#updates) +3. [Download Pretrained and Fine-tuned Checkpoints](#download-pretrained-and-fine-tuned-checkpoints) +4. [Fine-tuning](#fine-tuning) + 1. [How to run?](#how-to-run) + 2. [How to reproduce the results using the released finetuned checkpoints?](#how-to-reproduce-the-results-using-the-released-finetuned-checkpoints) + 3. [How to fine-tune on your own task and dataset?](#how-to-fine-tune-on-your-own-task-and-dataset) +5. [Citation](#citation) + +## Introduction + +This repo provides the code for reproducing the experiments +in [CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation](https://arxiv.org/pdf/2109.00859.pdf) +. CodeT5 is a new pre-trained encoder-decoder model for programming languages, which is pre-trained on **8.35M** +functions in 8 programming languages (Python, Java, JavaScript, PHP, Ruby, Go, C, and C#). In total, it achieves +state-of-the-art results on **14 sub-tasks** in a code intelligence benchmark - [CodeXGLUE](https://github.com/microsoft/CodeXGLUE). + +Paper link: https://arxiv.org/abs/2109.00859 + +Blog link: https://blog.salesforceairesearch.com/codet5/ + +The code currently includes two pre-trained checkpoints ([CodeT5-small](https://huggingface.co/Salesforce/codet5-small) +and [CodeT5-base](https://huggingface.co/Salesforce/codet5-base)) and scripts to fine-tune them on 4 generation tasks ( +code summarization, code generation, translation, and refinement) plus 2 understanding tasks (code defect detection and +clone detection) in CodeXGLUE. We also provide their fine-tuned checkpoints to facilitate the easy replication +of our paper. + +In practice, CodeT5 can be deployed as an AI-powered coding assistant to boost the productivity of software developers. +At Salesforce, we build an [AI coding assistant demo](https://github.com/salesforce/CodeT5/raw/main/codet5.gif) using +CodeT5 as a VS Code plugin to provide three capabilities for Apex developers: + +- **Text-to-code generation**: generate code based on the natural language description. +- **Code autocompletion**: complete the whole function of code given the target function name. +- **Code summarization**: generate the summary of a function in natural language description. + + ## Updates **July 06, 2022** @@ -86,46 +125,8 @@ print(tokenizer.decode(generated_ids[0], skip_special_tokens=True)) # this prints "{user.username}" ``` -## Introduction -This repo provides the code for reproducing the experiments -in [CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation](https://arxiv.org/pdf/2109.00859.pdf) -. CodeT5 is a new pre-trained encoder-decoder model for programming languages, which is pre-trained on **8.35M** -functions in 8 programming languages (Python, Java, JavaScript, PHP, Ruby, Go, C, and C#). In total, it achieves -state-of-the-art results on **14 sub-tasks** in a code intelligence benchmark - [CodeXGLUE](https://github.com/microsoft/CodeXGLUE). - -Paper link: https://arxiv.org/abs/2109.00859 - -Blog link: https://blog.salesforceairesearch.com/codet5/ - -The code currently includes two pre-trained checkpoints ([CodeT5-small](https://huggingface.co/Salesforce/codet5-small) -and [CodeT5-base](https://huggingface.co/Salesforce/codet5-base)) and scripts to fine-tune them on 4 generation tasks ( -code summarization, code generation, translation, and refinement) plus 2 understanding tasks (code defect detection and -clone detection) in CodeXGLUE. We also provide their fine-tuned checkpoints to facilitate the easy replication -of our paper. - -In practice, CodeT5 can be deployed as an AI-powered coding assistant to boost the productivity of software developers. -At Salesforce, we build an [AI coding assistant demo](https://github.com/salesforce/CodeT5/raw/main/codet5.gif) using -CodeT5 as a VS Code plugin to provide three capabilities for Apex developers: - -- **Text-to-code generation**: generate code based on the natural language description. -- **Code autocompletion**: complete the whole function of code given the target function name. -- **Code summarization**: generate the summary of a function in natural language description. - -## Table of Contents - -1. [Dependency](#dependency) -2. [Download](#download) -3. [Fine-tuning](#fine-tuning) - -## Dependency - -- Pytorch 1.7.1 -- tensorboard 2.4.1 -- transformers 4.6.1 -- tree-sitter 0.2.2 - -## Download +## Download Pretrained and Fine-tuned Checkpoints * [Pre-trained checkpoints](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/pretrained_models) * [Fine-tuning data](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/data) @@ -144,6 +145,14 @@ gsutil -m cp -r "gs://sfr-codet5-data-research/finetuned_models" . ## Fine-tuning +### Dependency + +- Pytorch 1.7.1 +- tensorboard 2.4.1 +- transformers 4.6.1 +- tree-sitter 0.2.2 + +### How to run? Go to `sh` folder, set the `WORKDIR` in `exp_with_args.sh` to be your cloned CodeT5 repository path. You can use `run_exp.py` to run a broad set of experiments by simply passing the `model_tag`, `task`, and `sub_task`