Turbopilot is an open source large-language-model based code completion engine that runs locally on CPU
Go to file
James Ravenscroft d79e05bef8 update readme
2023-04-10 10:45:47 +01:00
.github/workflows Update cmake.yml 2023-04-10 10:05:02 +01:00
assets update screen recording 2023-04-10 08:32:00 +01:00
ggml@560ee1aaa0 use latest ggml submodule 2023-04-10 09:13:21 +01:00
models add models readme 2023-04-10 09:41:09 +01:00
.dockerignore add docker build stuff 2023-04-10 08:51:48 +01:00
.gitmodules add ggml 2023-04-09 17:49:03 +01:00
BUILD.md add build markdown 2023-04-10 10:19:46 +01:00
convert-codegen-to-ggml.py add conversion script 2023-04-09 17:49:42 +01:00
Dockerfile update model name in docker 2023-04-10 09:18:04 +01:00
LICENSE.md add readme and license 2023-04-10 08:16:12 +01:00
README.md update readme 2023-04-10 10:45:47 +01:00
requirements.txt add requirements file for python 2023-04-10 09:31:54 +01:00
run.sh add docker build stuff 2023-04-10 08:51:48 +01:00

TurboPilot

TurboPilot is a self-hosted copilot clone which uses the library behind llama.cpp to run the 6 Billion Parameter Salesforce Codegen model in 4GiB of RAM. It is heavily based and inspired by on the fauxpilot project.

NB: This is a proof of concept right now rather than a stable tool. Autocompletion is quite slow in this version of the project. Feel free to play with it, but your mileage may vary.

a screen recording of turbopilot running through fauxpilot plugin

Getting Started

The easiest way to try the project out is to grab the pre-processed models and then run the server in docker.

Getting The Models

You have 2 options for getting the model

Option A: Direct Download - Easy, Quickstart

You can download the pre-converted, pre-quantized models from Google Drive. I've made the multi flavour models with 2B and 6B parameters available - these models are pre-trained on C, C++, Go, Java, JavaScript, and Python

Option B: Convert The Models Yourself - Hard, More Flexible

Follow this guide if you want to experiment with quantizing the models yourself.

Running TurboPilot Server

Download the latest binary and extract it to the root project folder. If a binary is not provided for your OS or you'd prefer to build it yourself follow the build instructions

Run:

./codegen-serve -m ./models/codegen-6B-multi-ggml-4bit-quant.bin

The application should start a server on port 18080

If you have a multi-core system you can control how many CPUs are used with the -t option - for example, on my AMD Ryzen 5000 which has 6 cores/12 threads I use:

./codegen-serve -t 6 -m ./models/codegen-6B-multi-ggml-4bit-quant.bin

Using the API

Using the API with FauxPilot Plugin

To use the API from VSCode, I recommend the vscode-fauxpilot plugin. Once you install it, you will need to change a few settings in your settings.json file.

  • Open settings (CTRL/CMD + SHIFT + P) and select Preferences: Open User Settings (JSON)
  • Add the following values:
{
    ... // other settings

    "fauxpilot.enabled": true,
    "fauxpilot.server": "http://localhost:18080/v1/engines",
}

Now you can enable fauxpilot with CTRL + SHIFT + P and select Enable Fauxpilot

The plugin will send API calls to the running codegen-serve process when you make a keystroke. It will then wait for each request to complete before sending further requests.

Calling the API Directly

You can make requests to http://localhost:18080/v1/engines/codegen/completions which will behave just like the same Copilot endpoint.

For example:

curl --request POST \
  --url http://localhost:18080/v1/engines/codegen/completions \
  --header 'Content-Type: application/json' \
  --data '{
 "model": "codegen",
 "prompt": "def main():",
 "max_tokens": 100
}'

Should get you something like this:

{
 "choices": [
  {
   "logprobs": null,
   "index": 0,
   "finish_reason": "length",
   "text": "\n  \"\"\"Main entry point for this script.\"\"\"\n  logging.getLogger().setLevel(logging.INFO)\n  logging.basicConfig(format=('%(levelname)s: %(message)s'))\n\n  parser = argparse.ArgumentParser(\n      description=__doc__,\n      formatter_class=argparse.RawDescriptionHelpFormatter,\n      epilog=__doc__)\n  "
  }
 ],
 "created": 1681113078,
 "usage": {
  "total_tokens": 105,
  "prompt_tokens": 3,
  "completion_tokens": 102
 },
 "object": "text_completion",
 "model": "codegen",
 "id": "01d7a11b-f87c-4261-8c03-8c78cbe4b067"
}

Known Limitations

Again I want to set expectations around this being a proof-of-concept project. With that in mind. Here are some current known limitations.

As of v0.0.1:

  • The models can be quite slow - especially the 6B ones. It can take ~30-40s to make suggestions across 4 CPU cores.
  • I've only tested the system on Ubuntu 22.04. Your mileage may vary on other operating systems. Please let me know if you try it elsewhere. I'm particularly interested in performance on Apple Silicon.
  • Sometimes suggestions get truncated in nonsensical places - e.g. part way through a variable name or string name. This is due to a hard limit on suggestion length.
  • Sometimes the server will run out of memory and crash. This is because it will try to use everything above your current location as context during generation. I'm working on a fix.

Acknowledgements