Turbopilot is an open source large-language-model based code completion engine that runs locally on CPU
Go to file
2023-05-14 09:29:57 +01:00
.github/workflows add cuda to cmake args 2023-05-14 09:29:57 +01:00
.vscode attempt arm builD 2023-04-15 10:17:54 +01:00
assets update readme and changelog with vscode plugin that has progress notifier 2023-04-15 12:29:09 +01:00
ggml@d3b152bf2d reset batch size in ggml to 8 2023-04-23 18:35:17 +01:00
models add models readme 2023-04-10 09:41:09 +01:00
.dockerignore add docker build stuff 2023-04-10 08:51:48 +01:00
.gitmodules add ggml 2023-04-09 17:49:03 +01:00
BUILD.md add build instructions for blas and cublas 2023-04-23 08:38:17 +01:00
CHANGELOG.md update changelog to reflect changes 2023-05-08 10:29:18 +01:00
convert-codegen-to-ggml.py Set UTF-8 encoding on vocab.json 2023-04-12 19:29:53 +01:00
Dockerfile.cuda update batch size arg in run.sh 2023-05-08 14:24:49 +01:00
Dockerfile.default update batch size arg in run.sh 2023-05-08 14:24:49 +01:00
LICENSE.md add readme and license 2023-04-10 08:16:12 +01:00
README.md add cuda stuff to readme 2023-05-08 14:21:18 +01:00
requirements.txt add requirements file for python 2023-04-10 09:31:54 +01:00
run.sh update batch size arg in run.sh 2023-05-08 14:24:49 +01:00

TurboPilot 🚀

Mastodon Follow BSD Licensed Time Spent

TurboPilot is a self-hosted copilot clone which uses the library behind llama.cpp to run the 6 Billion Parameter Salesforce Codegen model in 4GiB of RAM. It is heavily based and inspired by on the fauxpilot project.

NB: This is a proof of concept right now rather than a stable tool. Autocompletion is quite slow in this version of the project. Feel free to play with it, but your mileage may vary.

a screen recording of turbopilot running through fauxpilot plugin

NEW: As of v0.0.5 turbopilot supports cuda inference which greatly accelerates suggestions when working with longer prompts (i.e. longer existing code files).

🤝 Contributing

PRs to this project and the corresponding GGML fork are very welcome.

Make a fork, make your changes and then open a PR.

👋 Getting Started

The easiest way to try the project out is to grab the pre-processed models and then run the server in docker.

Getting The Models

You have 2 options for getting the model

Option A: Direct Download - Easy, Quickstart

You can download the pre-converted, pre-quantized models from Huggingface.

The multi flavour models can provide auto-complete suggestions for C, C++, Go, Java, JavaScript, and Python.

The mono flavour models can provide auto-complete suggestions for Python only (but the quality of Python-specific suggestions may be higher).

Pre-converted and pre-quantized models are available for download from here:

Model Name RAM Requirement Supported Languages Direct Download HF Project Link
CodeGen 350M multi ~800MiB C, C++, Go, Java, JavaScript, Python ⬇️ 🤗
CodeGen 350M mono ~800MiB Python ⬇️ 🤗
CodeGen 2B multi ~4GiB C, C++, Go, Java, JavaScript, Python ⬇️ 🤗
CodeGen 2B mono ~4GiB Python ⬇️ 🤗
CodeGen 6B multi ~8GiB C, C++, Go, Java, JavaScript, Python ⬇️ 🤗
CodeGen 6B mono ~8GiB Python ⬇️ 🤗

Option B: Convert The Models Yourself - Hard, More Flexible

Follow this guide if you want to experiment with quantizing the models yourself.

⚙️ Running TurboPilot Server

Download the latest binary and extract it to the root project folder. If a binary is not provided for your OS or you'd prefer to build it yourself follow the build instructions

Run:

./codegen-serve -m ./models/codegen-6B-multi-ggml-4bit-quant.bin

The application should start a server on port 18080

If you have a multi-core system you can control how many CPUs are used with the -t option - for example, on my AMD Ryzen 5000 which has 6 cores/12 threads I use:

./codegen-serve -t 6 -m ./models/codegen-6B-multi-ggml-4bit-quant.bin

📦 Running From Docker

You can also run Turbopilot from the pre-built docker image supplied here

You will still need to download the models separately, then you can run:

docker run --rm -it \
  -v ./models:/models \
  -e THREADS=6 \
  -e MODEL="/models/codegen-2B-multi-ggml-4bit-quant.bin" \
  -p 18080:18080 \
  ghcr.io/ravenscroftj/turbopilot:latest

Docker and CUDA

As of release v0.0.5 turbocode now supports CUDA inference. In order to run the cuda-enabled container you will need to have nvidia-docker enabled, use the cuda tagged versions and pass in --gpus=all to docker with access to your GPU like so:

docker run --gpus=all --rm -it \
  -v ./models:/models \
  -e THREADS=6 \
  -e MODEL="/models/codegen-2B-multi-ggml-4bit-quant.bin" \
  -p 18080:18080 \
  ghcr.io/ravenscroftj/turbopilot:v0.0.5-cuda

You should be able to see /app/codegen-serve listed when you run nvidia-smi.

🌐 Using the API

Support for the official Copilot Plugin

Support for the official VS Code copilot plugin is underway (See ticket #11). The API should now be broadly compatible with OpenAI.

Using the API with FauxPilot Plugin

To use the API from VSCode, I recommend the vscode-fauxpilot plugin. Once you install it, you will need to change a few settings in your settings.json file.

  • Open settings (CTRL/CMD + SHIFT + P) and select Preferences: Open User Settings (JSON)
  • Add the following values:
{
    ... // other settings

    "fauxpilot.enabled": true,
    "fauxpilot.server": "http://localhost:18080/v1/engines",
}

Now you can enable fauxpilot with CTRL + SHIFT + P and select Enable Fauxpilot

The plugin will send API calls to the running codegen-serve process when you make a keystroke. It will then wait for each request to complete before sending further requests.

Calling the API Directly

You can make requests to http://localhost:18080/v1/engines/codegen/completions which will behave just like the same Copilot endpoint.

For example:

curl --request POST \
  --url http://localhost:18080/v1/engines/codegen/completions \
  --header 'Content-Type: application/json' \
  --data '{
 "model": "codegen",
 "prompt": "def main():",
 "max_tokens": 100
}'

Should get you something like this:

{
 "choices": [
  {
   "logprobs": null,
   "index": 0,
   "finish_reason": "length",
   "text": "\n  \"\"\"Main entry point for this script.\"\"\"\n  logging.getLogger().setLevel(logging.INFO)\n  logging.basicConfig(format=('%(levelname)s: %(message)s'))\n\n  parser = argparse.ArgumentParser(\n      description=__doc__,\n      formatter_class=argparse.RawDescriptionHelpFormatter,\n      epilog=__doc__)\n  "
  }
 ],
 "created": 1681113078,
 "usage": {
  "total_tokens": 105,
  "prompt_tokens": 3,
  "completion_tokens": 102
 },
 "object": "text_completion",
 "model": "codegen",
 "id": "01d7a11b-f87c-4261-8c03-8c78cbe4b067"
}

👉 Known Limitations

Again I want to set expectations around this being a proof-of-concept project. With that in mind. Here are some current known limitations.

As of v0.0.2:

  • The models can be quite slow - especially the 6B ones. It can take ~30-40s to make suggestions across 4 CPU cores.
  • I've only tested the system on Ubuntu 22.04 but I am now supplying ARM docker images and soon I'll be providing ARM binary releases.
  • Sometimes suggestions get truncated in nonsensical places - e.g. part way through a variable name or string name. This is due to a hard limit of 2048 on the context length (prompt + suggestion).

👏 Acknowledgements