turbopilot/README.md

# TurboPilot 🚀

[![Mastodon Follow](https://img.shields.io/mastodon/follow/000117012?domain=https%3A%2F%2Ffosstodon.org%2F&style=social)](https://fosstodon.org/@jamesravey) ![BSD Licensed](https://img.shields.io/github/license/ravenscroftj/turbopilot) ![Time Spent](https://img.shields.io/endpoint?url=https://wakapi.nopro.be/api/compat/shields/v1/jamesravey/all_time/label%3Aturbopilot)


TurboPilot is a self-hosted [copilot](https://github.com/features/copilot) clone which uses the library behind [llama.cpp](https://github.com/ggerganov/llama.cpp) to run the [6 Billion Parameter Salesforce Codegen model](https://github.com/salesforce/CodeGen) in 4GiB of RAM. It is heavily based and inspired by on the [fauxpilot](https://github.com/fauxpilot/fauxpilot) project.

***NB: This is a proof of concept right now rather than a stable tool. Autocompletion is quite slow in this version of the project. Feel free to play with it, but your mileage may vary.***

![a screen recording of turbopilot running through fauxpilot plugin](assets/vscode-status.gif)

## 🤝 Contributing

PRs to this project and the corresponding [GGML fork](https://github.com/ravenscroftj/ggml) are very welcome.

Make a fork, make your changes and then open a [PR](https://github.com/ravenscroftj/turbopilot/pulls).


## 👋 Getting Started

The easiest way to try the project out is to grab the pre-processed models and then run the server in docker.

### Getting The Models

You have 2 options for getting the model

#### Option A: Direct Download - Easy, Quickstart

You can download the pre-converted, pre-quantized models from Huggingface.

The `multi` flavour models can provide auto-complete suggestions for `C`, `C++`, `Go`, `Java`, `JavaScript`, and `Python`.

The `mono` flavour models can provide auto-complete suggestions for `Python` only (but the quality of Python-specific suggestions may be higher).

Pre-converted and pre-quantized models are available for download from here:

| Model Name          | RAM Requirement | Supported Languages       | Direct Download  | HF Project Link |
|---------------------|-----------------|---------------------------|-----------------|-----------------|
| CodeGen 350M multi   | ~800MiB        | `C`, `C++`, `Go`, `Java`, `JavaScript`, `Python`  |   [:arrow_down:](https://huggingface.co/ravenscroftj/CodeGen-350M-multi-ggml-quant/resolve/main/codegen-350M-multi-ggml-4bit-quant.bin)           |   [:hugs:](https://huggingface.co/ravenscroftj/CodeGen-350M-multi-ggml-quant)           |
| CodeGen 350M mono   | ~800MiB   | `Python`          |   [:arrow_down:](https://huggingface.co/Guglielmo/CodeGen-350M-mono-ggml-quant/resolve/main/ggml-model-quant.bin)           |   [:hugs:](https://huggingface.co/Guglielmo/CodeGen-350M-mono-ggml-quant)           |
| CodeGen 2B multi   | ~4GiB  | `C`, `C++`, `Go`, `Java`, `JavaScript`, `Python`          |   [:arrow_down:](https://huggingface.co/ravenscroftj/CodeGen-2B-multi-ggml-quant/resolve/main/codegen-2B-multi-ggml-4bit-quant.bin)           |   [:hugs:](https://huggingface.co/ravenscroftj/CodeGen-2B-multi-ggml-quant)          |
| CodeGen 2B mono   | ~4GiB  | `Python`          |   [:arrow_down:](https://huggingface.co/Guglielmo/CodeGen-2B-mono-ggml-quant/resolve/main/ggml-model-quant.bin)           |   [:hugs:](https://huggingface.co/Guglielmo/CodeGen-2B-mono-ggml-quant/)          |
| CodeGen 6B multi   | ~8GiB  | `C`, `C++`, `Go`, `Java`, `JavaScript`, `Python`          |   [:arrow_down:](https://huggingface.co/ravenscroftj/CodeGen-6B-multi-ggml-quant/resolve/main/codegen-6B-multi-ggml-4bit-quant.bin)           |   [:hugs:](https://huggingface.co/ravenscroftj/CodeGen-6B-multi-ggml-quant)          |
| CodeGen 6B mono   | ~8GiB  | `Python`          |   [:arrow_down:](https://huggingface.co/Guglielmo/CodeGen-6B-mono-ggml-quant/resolve/main/ggml-model-quant.bin)           |   [:hugs:](https://huggingface.co/Guglielmo/CodeGen-6B-mono-ggml-quant/)          |


#### Option B: Convert The Models Yourself - Hard, More Flexible

Follow [this guide](https://github.com/ravenscroftj/turbopilot/wiki/Converting-and-Quantizing-The-Models) if you want to experiment with quantizing the models yourself.

### ⚙️ Running TurboPilot Server

Download the [latest binary](https://github.com/ravenscroftj/turbopilot/releases) and extract it to the root project folder. If a binary is not provided for your OS or you'd prefer to build it yourself follow the [build instructions](BUILD.md)

Run:

```bash
./codegen-serve -m ./models/codegen-6B-multi-ggml-4bit-quant.bin
```

The application should start a server on port `18080`

If you have a multi-core system you can control how many CPUs are used with the `-t` option - for example, on my AMD Ryzen 5000 which has 6 cores/12 threads I use:

```bash
./codegen-serve -t 6 -m ./models/codegen-6B-multi-ggml-4bit-quant.bin
```

### 📦 Running From Docker

You can also run Turbopilot from the pre-built docker image supplied [here](https://github.com/users/ravenscroftj/packages/container/package/turbopilot)

You will still need to download the models separately, then you can run:

```bash
docker run --rm -it \
  -v ./models:/models \
  -e THREADS=6 \
  -e MODEL="/models/codegen-2B-multi-ggml-4bit-quant.bin" \
  -p 18080:18080 \
  ghcr.io/ravenscroftj/turbopilot:latest
```

### 🌐 Using the API

#### Support for the official Copilot Plugin

Support for the official VS Code copilot plugin is underway (See ticket #11). The API should now be broadly compatible with OpenAI.

#### Using the API with FauxPilot Plugin


To use the API from VSCode, I recommend the vscode-fauxpilot plugin. Once you install it, you will need to change a few settings in your settings.json file.

- Open settings (CTRL/CMD + SHIFT + P) and select `Preferences: Open User Settings (JSON)`
- Add the following values:

```json
{
    ... // other settings

    "fauxpilot.enabled": true,
    "fauxpilot.server": "http://localhost:18080/v1/engines",
}
```

Now you can enable fauxpilot with `CTRL + SHIFT + P` and select `Enable Fauxpilot`

The plugin will send API calls to the running `codegen-serve` process when you make a keystroke. It will then wait for each request to complete before sending further requests.

#### Calling the API Directly

You can make requests to `http://localhost:18080/v1/engines/codegen/completions` which will behave just like the same Copilot endpoint.

For example:

```bash
curl --request POST \
  --url http://localhost:18080/v1/engines/codegen/completions \
  --header 'Content-Type: application/json' \
  --data '{
 "model": "codegen",
 "prompt": "def main():",
 "max_tokens": 100
}'
```

Should get you something like this:

```json
{
 "choices": [
  {
   "logprobs": null,
   "index": 0,
   "finish_reason": "length",
   "text": "\n  \"\"\"Main entry point for this script.\"\"\"\n  logging.getLogger().setLevel(logging.INFO)\n  logging.basicConfig(format=('%(levelname)s: %(message)s'))\n\n  parser = argparse.ArgumentParser(\n      description=__doc__,\n      formatter_class=argparse.RawDescriptionHelpFormatter,\n      epilog=__doc__)\n  "
  }
 ],
 "created": 1681113078,
 "usage": {
  "total_tokens": 105,
  "prompt_tokens": 3,
  "completion_tokens": 102
 },
 "object": "text_completion",
 "model": "codegen",
 "id": "01d7a11b-f87c-4261-8c03-8c78cbe4b067"
}
```

## 👉 Known Limitations

Again I want to set expectations around this being a proof-of-concept project. With that in mind. Here are some current known limitations.

As of **v0.0.2**:
- The models can be quite slow - especially the 6B ones. It can take ~30-40s to make suggestions across 4 CPU cores.
- I've only tested the system on Ubuntu 22.04 but I am now supplying ARM docker images and soon I'll be providing ARM binary releases.
- Sometimes suggestions get truncated in nonsensical places - e.g. part way through a variable name or string name. This is due to a hard limit of 2048 on the context length (prompt + suggestion).

## 👏 Acknowledgements

- This project would not have been possible without [Georgi Gerganov's work on GGML and llama.cpp](https://github.com/ggerganov/ggml)
- It was completely inspired by [fauxpilot](https://github.com/fauxpilot/fauxpilot) which I did experiment with for a little while but wanted to try to make the models work without a GPU
- The frontend of the project is powered by [Venthe's vscode-fauxpilot plugin](https://github.com/Venthe/vscode-fauxpilot)
- The project uses the [Salesforce Codegen](https://github.com/salesforce/CodeGen) models.
- Thanks to [Moyix](https://huggingface.co/moyix) for his work on converting the Salesforce models to run in a GPT-J architecture. Not only does this [confer some speed benefits](https://gist.github.com/moyix/7896575befbe1b99162ccfec8d135566) but it also made it much easier for me to port the models to GGML using the [existing gpt-j example code](https://github.com/ggerganov/ggml/tree/master/examples/gpt-j)
- The model server uses [CrowCPP](https://crowcpp.org/master/) to serve suggestions.
- Check out the [original scientific paper](https://arxiv.org/pdf/2203.13474.pdf) for CodeGen for more info.
add colourful things 2023-04-10 11:12:16 +00:00			`# TurboPilot 🚀`

Merge branch 'main' of github.com:ravenscroftj/turbopilot 2023-04-16 11:35:04 +00:00			`[![Mastodon Follow](https://img.shields.io/mastodon/follow/000117012?domain=https%3A%2F%2Ffosstodon.org%2F&style=social)](https://fosstodon.org/@jamesravey) ![BSD Licensed](https://img.shields.io/github/license/ravenscroftj/turbopilot) ![Time Spent](https://img.shields.io/endpoint?url=https://wakapi.nopro.be/api/compat/shields/v1/jamesravey/all_time/label%3Aturbopilot)`
add colourful things 2023-04-10 11:12:16 +00:00
add readme 2023-04-09 16:54:19 +00:00
correct link to llama.cpp 2023-04-11 20:04:53 +00:00			`TurboPilot is a self-hosted [copilot](https://github.com/features/copilot) clone which uses the library behind [llama.cpp](https://github.com/ggerganov/llama.cpp) to run the [6 Billion Parameter Salesforce Codegen model](https://github.com/salesforce/CodeGen) in 4GiB of RAM. It is heavily based and inspired by on the [fauxpilot](https://github.com/fauxpilot/fauxpilot) project.`
add screen recording gif 2023-04-10 07:05:21 +00:00
add readme and license 2023-04-10 07:16:12 +00:00			`*NB: This is a proof of concept right now rather than a stable tool. Autocompletion is quite slow in this version of the project. Feel free to play with it, but your mileage may vary.*`

update readme and changelog with vscode plugin that has progress notifier 2023-04-15 11:29:09 +00:00			`![a screen recording of turbopilot running through fauxpilot plugin](assets/vscode-status.gif)`
add readme 2023-04-09 16:54:19 +00:00
add colourful things 2023-04-10 11:12:16 +00:00			`## 🤝 Contributing`

			`PRs to this project and the corresponding [GGML fork](https://github.com/ravenscroftj/ggml) are very welcome.`

			`Make a fork, make your changes and then open a [PR](https://github.com/ravenscroftj/turbopilot/pulls).`


			`## 👋 Getting Started`
add readme 2023-04-09 16:54:19 +00:00
add link to build from readme 2023-04-10 09:20:04 +00:00			`The easiest way to try the project out is to grab the pre-processed models and then run the server in docker.`
add readme 2023-04-09 16:54:19 +00:00
add link to build from readme 2023-04-10 09:20:04 +00:00			`### Getting The Models`
add readme 2023-04-09 16:54:19 +00:00
update readme 2023-04-10 09:45:47 +00:00			`You have 2 options for getting the model`

			`#### Option A: Direct Download - Easy, Quickstart`
added acknowledgement of crow 2023-04-10 08:21:58 +00:00
update model links 2023-04-23 13:42:56 +00:00			`You can download the pre-converted, pre-quantized models from Huggingface.`

upfate readme models 2023-04-23 14:00:11 +00:00			The `multi` flavour models can provide auto-complete suggestions for `C`, `C++`, `Go`, `Java`, `JavaScript`, and `Python`.
update model links 2023-04-23 13:42:56 +00:00
upfate readme models 2023-04-23 14:00:11 +00:00			The `mono` flavour models can provide auto-complete suggestions for `Python` only (but the quality of Python-specific suggestions may be higher).
update model links 2023-04-23 13:42:56 +00:00
upfate readme models 2023-04-23 14:00:11 +00:00			`Pre-converted and pre-quantized models are available for download from here:`
update model links 2023-04-23 13:42:56 +00:00
tidy up readme some more 2023-04-23 14:04:52 +00:00			`\| Model Name \| RAM Requirement \| Supported Languages \| Direct Download \| HF Project Link \|`
			`\|---------------------\|-----------------\|---------------------------\|-----------------\|-----------------\|`
upfate readme models 2023-04-23 14:00:11 +00:00			\| CodeGen 350M multi \| ~800MiB \| `C`, `C++`, `Go`, `Java`, `JavaScript`, `Python` \| [:arrow_down:](https://huggingface.co/ravenscroftj/CodeGen-350M-multi-ggml-quant/resolve/main/codegen-350M-multi-ggml-4bit-quant.bin) \| [:hugs:](https://huggingface.co/ravenscroftj/CodeGen-350M-multi-ggml-quant) \|
Fix model links 2023-05-01 14:10:18 +00:00			\| CodeGen 350M mono \| ~800MiB \| `Python` \| [:arrow_down:](https://huggingface.co/Guglielmo/CodeGen-350M-mono-ggml-quant/resolve/main/ggml-model-quant.bin) \| [:hugs:](https://huggingface.co/Guglielmo/CodeGen-350M-mono-ggml-quant) \|
upfate readme models 2023-04-23 14:00:11 +00:00			\| CodeGen 2B multi \| ~4GiB \| `C`, `C++`, `Go`, `Java`, `JavaScript`, `Python` \| [:arrow_down:](https://huggingface.co/ravenscroftj/CodeGen-2B-multi-ggml-quant/resolve/main/codegen-2B-multi-ggml-4bit-quant.bin) \| [:hugs:](https://huggingface.co/ravenscroftj/CodeGen-2B-multi-ggml-quant) \|
			\| CodeGen 2B mono \| ~4GiB \| `Python` \| [:arrow_down:](https://huggingface.co/Guglielmo/CodeGen-2B-mono-ggml-quant/resolve/main/ggml-model-quant.bin) \| [:hugs:](https://huggingface.co/Guglielmo/CodeGen-2B-mono-ggml-quant/) \|
Fix model links 2023-05-01 14:10:18 +00:00			\| CodeGen 6B multi \| ~8GiB \| `C`, `C++`, `Go`, `Java`, `JavaScript`, `Python` \| [:arrow_down:](https://huggingface.co/ravenscroftj/CodeGen-6B-multi-ggml-quant/resolve/main/codegen-6B-multi-ggml-4bit-quant.bin) \| [:hugs:](https://huggingface.co/ravenscroftj/CodeGen-6B-multi-ggml-quant) \|
upfate readme models 2023-04-23 14:00:11 +00:00			\| CodeGen 6B mono \| ~8GiB \| `Python` \| [:arrow_down:](https://huggingface.co/Guglielmo/CodeGen-6B-mono-ggml-quant/resolve/main/ggml-model-quant.bin) \| [:hugs:](https://huggingface.co/Guglielmo/CodeGen-6B-mono-ggml-quant/) \|
update model links 2023-04-23 13:42:56 +00:00
added acknowledgement of crow 2023-04-10 08:21:58 +00:00
update readme 2023-04-10 09:45:47 +00:00			`#### Option B: Convert The Models Yourself - Hard, More Flexible`
added acknowledgement of crow 2023-04-10 08:21:58 +00:00
add link to build from readme 2023-04-10 09:20:04 +00:00			`Follow [this guide](https://github.com/ravenscroftj/turbopilot/wiki/Converting-and-Quantizing-The-Models) if you want to experiment with quantizing the models yourself.`
add readme 2023-04-09 16:54:19 +00:00
add colourful things 2023-04-10 11:12:16 +00:00			`### ⚙️ Running TurboPilot Server`
Add instructions for getting the models 2023-04-10 08:39:58 +00:00
add link to build from readme 2023-04-10 09:20:04 +00:00			`Download the [latest binary](https://github.com/ravenscroftj/turbopilot/releases) and extract it to the root project folder. If a binary is not provided for your OS or you'd prefer to build it yourself follow the [build instructions](BUILD.md)`
Add instructions for getting the models 2023-04-10 08:39:58 +00:00
add link to build from readme 2023-04-10 09:20:04 +00:00			`Run:`
add readme 2023-04-09 16:54:19 +00:00
			```bash
add link to build from readme 2023-04-10 09:20:04 +00:00			`./codegen-serve -m ./models/codegen-6B-multi-ggml-4bit-quant.bin`
add readme 2023-04-09 16:54:19 +00:00			```

add link to build from readme 2023-04-10 09:20:04 +00:00			The application should start a server on port `18080`
Add instructions for getting the models 2023-04-10 08:39:58 +00:00
add readme 2023-04-10 09:44:41 +00:00			If you have a multi-core system you can control how many CPUs are used with the `-t` option - for example, on my AMD Ryzen 5000 which has 6 cores/12 threads I use:
Add instructions for getting the models 2023-04-10 08:39:58 +00:00
add readme 2023-04-10 09:44:41 +00:00			```bash
			`./codegen-serve -t 6 -m ./models/codegen-6B-multi-ggml-4bit-quant.bin`
			```

add colourful things 2023-04-10 11:12:16 +00:00			`### 📦 Running From Docker`
add docker instructions 2023-04-10 09:48:42 +00:00
update readme 2023-04-14 07:15:35 +00:00			`You can also run Turbopilot from the pre-built docker image supplied [here](https://github.com/users/ravenscroftj/packages/container/package/turbopilot)`
add docker instructions 2023-04-10 09:48:42 +00:00
			`You will still need to download the models separately, then you can run:`

			```bash
			`docker run --rm -it \`
			`-v ./models:/models \`
			`-e THREADS=6 \`
			`-e MODEL="/models/codegen-2B-multi-ggml-4bit-quant.bin" \`
			`-p 18080:18080 \`
updated path to new docker repo 2023-04-13 07:29:03 +00:00			`ghcr.io/ravenscroftj/turbopilot:latest`
add docker instructions 2023-04-10 09:48:42 +00:00			```

add colourful things 2023-04-10 11:12:16 +00:00			`### 🌐 Using the API`
add readme 2023-04-10 09:44:41 +00:00
update readme - add mention of official plugin 2023-04-15 11:34:09 +00:00			`#### Support for the official Copilot Plugin`
add readme 2023-04-10 09:44:41 +00:00
update readme - add mention of official plugin 2023-04-15 11:34:09 +00:00			`Support for the official VS Code copilot plugin is underway (See ticket #11). The API should now be broadly compatible with OpenAI.`
update readme and changelog with vscode plugin that has progress notifier 2023-04-15 11:29:09 +00:00
update readme - add mention of official plugin 2023-04-15 11:34:09 +00:00			`#### Using the API with FauxPilot Plugin`
update readme and changelog with vscode plugin that has progress notifier 2023-04-15 11:29:09 +00:00

Remove prompt to install my version of fauxcode 2023-04-22 07:06:13 +00:00			`To use the API from VSCode, I recommend the vscode-fauxpilot plugin. Once you install it, you will need to change a few settings in your settings.json file.`
add readme 2023-04-10 09:44:41 +00:00
			- Open settings (CTRL/CMD + SHIFT + P) and select `Preferences: Open User Settings (JSON)`
			`- Add the following values:`

			```json
			`{`
			`... // other settings`

			`"fauxpilot.enabled": true,`
			`"fauxpilot.server": "http://localhost:18080/v1/engines",`
			`}`
			```
add readme 2023-04-09 16:56:27 +00:00
add readme 2023-04-10 09:44:41 +00:00			Now you can enable fauxpilot with `CTRL + SHIFT + P` and select `Enable Fauxpilot`

			The plugin will send API calls to the running `codegen-serve` process when you make a keystroke. It will then wait for each request to complete before sending further requests.

			`#### Calling the API Directly`

			You can make requests to `http://localhost:18080/v1/engines/codegen/completions` which will behave just like the same Copilot endpoint.

			`For example:`

			```bash
			`curl --request POST \`
			`--url http://localhost:18080/v1/engines/codegen/completions \`
			`--header 'Content-Type: application/json' \`
			`--data '{`
			`"model": "codegen",`
			`"prompt": "def main():",`
			`"max_tokens": 100`
			`}'`
			```

			`Should get you something like this:`

			```json
			`{`
			`"choices": [`
			`{`
			`"logprobs": null,`
			`"index": 0,`
			`"finish_reason": "length",`
			`"text": "\n \"\"\"Main entry point for this script.\"\"\"\n logging.getLogger().setLevel(logging.INFO)\n logging.basicConfig(format=('%(levelname)s: %(message)s'))\n\n parser = argparse.ArgumentParser(\n description=__doc__,\n formatter_class=argparse.RawDescriptionHelpFormatter,\n epilog=__doc__)\n "`
			`}`
			`],`
			`"created": 1681113078,`
			`"usage": {`
			`"total_tokens": 105,`
			`"prompt_tokens": 3,`
			`"completion_tokens": 102`
			`},`
			`"object": "text_completion",`
			`"model": "codegen",`
			`"id": "01d7a11b-f87c-4261-8c03-8c78cbe4b067"`
			`}`
			```
add readme 2023-04-09 16:56:27 +00:00
add colourful things 2023-04-10 11:12:16 +00:00			`## 👉 Known Limitations`
add readme 2023-04-09 16:56:27 +00:00
add readme 2023-04-10 09:44:41 +00:00			`Again I want to set expectations around this being a proof-of-concept project. With that in mind. Here are some current known limitations.`
add readme and license 2023-04-10 07:16:12 +00:00
update readme 2023-04-14 07:15:35 +00:00			`As of v0.0.2:`
add readme 2023-04-10 09:44:41 +00:00			`- The models can be quite slow - especially the 6B ones. It can take ~30-40s to make suggestions across 4 CPU cores.`
update readme 2023-04-14 07:15:35 +00:00			`- I've only tested the system on Ubuntu 22.04 but I am now supplying ARM docker images and soon I'll be providing ARM binary releases.`
			`- Sometimes suggestions get truncated in nonsensical places - e.g. part way through a variable name or string name. This is due to a hard limit of 2048 on the context length (prompt + suggestion).`
add colourful things 2023-04-10 11:12:16 +00:00
			`## 👏 Acknowledgements`
add readme and license 2023-04-10 07:16:12 +00:00
			`- This project would not have been possible without [Georgi Gerganov's work on GGML and llama.cpp](https://github.com/ggerganov/ggml)`
			`- It was completely inspired by [fauxpilot](https://github.com/fauxpilot/fauxpilot) which I did experiment with for a little while but wanted to try to make the models work without a GPU`
			`- The frontend of the project is powered by [Venthe's vscode-fauxpilot plugin](https://github.com/Venthe/vscode-fauxpilot)`
			`- The project uses the [Salesforce Codegen](https://github.com/salesforce/CodeGen) models.`
added acknowledgement of crow 2023-04-10 08:21:58 +00:00			`- Thanks to [Moyix](https://huggingface.co/moyix) for his work on converting the Salesforce models to run in a GPT-J architecture. Not only does this [confer some speed benefits](https://gist.github.com/moyix/7896575befbe1b99162ccfec8d135566) but it also made it much easier for me to port the models to GGML using the [existing gpt-j example code](https://github.com/ggerganov/ggml/tree/master/examples/gpt-j)`
Add instructions for getting the models 2023-04-10 08:39:58 +00:00			`- The model server uses [CrowCPP](https://crowcpp.org/master/) to serve suggestions.`
add readme 2023-04-10 09:44:41 +00:00			`- Check out the [original scientific paper](https://arxiv.org/pdf/2203.13474.pdf) for CodeGen for more info.`