mirror of https://github.com/ravenscroftj/turbopilot.git synced 2024-10-01 01:06:01 -04:00

Turbopilot is an open source large-language-model based code completion engine that runs locally on CPU

code-completion cpp language-model machine-learning

Go to file

James Ravenscroft 8fd357e0a5 Merge pull request #65 from ravenscroftj/ravenscroftj-patch-1 Update README.md		2023-08-26 17:11:21 +01:00
.github/workflows	fix docker build for tags	2023-08-26 17:01:35 +01:00
.vscode	add gpu offload for gptneox	2023-08-26 15:14:02 +01:00
assets	update readme and changelog with vscode plugin that has progress notifier	2023-04-15 12:29:09 +01:00
extern	remove crow submodule	2023-08-26 15:19:17 +01:00
include/turbopilot	update for gpu build	2023-08-26 15:13:08 +01:00
models	add models readme	2023-04-10 09:41:09 +01:00
src	recomment the cuda preprocessor check	2023-08-26 16:21:42 +01:00
.dockerignore	add some data folders to ignore	2023-08-05 08:28:51 +01:00
.gitignore	add gpu offload for gptneox	2023-08-26 15:14:02 +01:00
.gitmodules	use latest upstream ggml instead of mine	2023-08-26 15:34:15 +01:00
BUILD.md	update build instructions	2023-08-05 09:22:15 +01:00
CHANGELOG.md	update changelog to reflect changes	2023-05-08 10:29:18 +01:00
CMakeLists.txt	tidy cmakelist	2023-08-26 15:20:02 +01:00
convert-codegen-to-ggml.py	Set UTF-8 encoding on vocab.json	2023-04-12 19:29:53 +01:00
Dockerfile.default	correct runtime libs for openblas and clblast	2023-08-26 14:16:39 +01:00
LICENSE.md	add readme and license	2023-04-10 08:16:12 +01:00
MODELS.md	Fix download link on MODELS.md	2023-08-26 19:16:15 +09:00
README.md	Update README.md	2023-08-26 17:11:12 +01:00
requirements.txt	add requirements file for python	2023-04-10 09:31:54 +01:00
run.sh	update run script to incorporate GPU layers	2023-08-26 16:03:16 +01:00
test_codegen2.py	add gpu offload for gptneox	2023-08-26 15:14:02 +01:00
test_santa.py	add gpu offload for gptneox	2023-08-26 15:14:02 +01:00
test.txt	add gpu offload for gptneox	2023-08-26 15:14:02 +01:00
turbopilot.code-workspace	add gpu offload for gptneox	2023-08-26 15:14:02 +01:00

README.md

TurboPilot 🚀

TurboPilot is a self-hosted copilot clone which uses the library behind llama.cpp to run the 6 Billion Parameter Salesforce Codegen model in 4GiB of RAM. It is heavily based and inspired by on the fauxpilot project.

NB: This is a proof of concept right now rather than a stable tool. Autocompletion is quite slow in this version of the project. Feel free to play with it, but your mileage may vary.

✨ Now Supports StableCode 3B Instruct simply use TheBloke's Quantized GGML models and set -m stablecode.

✨ New: Refactored + Simplified: The source code has been improved to make it easier to extend and add new models to Turbopilot. The system now supports multiple flavours of model

✨ New: Wizardcoder, Starcoder, Santacoder support - Turbopilot now supports state of the art local code completion models which provide more programming languages and "fill in the middle" support.

🤝 Contributing

PRs to this project and the corresponding GGML fork are very welcome.

Make a fork, make your changes and then open a PR.

👋 Getting Started

The easiest way to try the project out is to grab the pre-processed models and then run the server in docker.

Getting The Models

You have 2 options for getting the model

Option A: Direct Download - Easy, Quickstart

You can download the pre-converted, pre-quantized models from Huggingface.

For low RAM users (4-8 GiB), I recommend StableCode and for high power users (16+ GiB RAM, discrete GPU or apple silicon) I recomnmend WizardCoder.

Turbopilot still supports the first generation codegen models from v0.0.5 and earlier builds. Although old models do need to be requantized.

You can find a full catalogue of models in MODELS.md.

Option B: Convert The Models Yourself - Hard, More Flexible

Follow this guide if you want to experiment with quantizing the models yourself.

⚙️ Running TurboPilot Server

Download the latest binary and extract it to the root project folder. If a binary is not provided for your OS or you'd prefer to build it yourself follow the build instructions

Run:

./turbopilot -m starcoder -f ./models/santacoder-q4_0.bin

The application should start a server on port 18080, you can change this with the -p option but this is the default port that vscode-fauxpilot tries to connect to so you probably want to leave this alone unless you are sure you know what you're doing.

If you have a multi-core system you can control how many CPUs are used with the -t option - for example, on my AMD Ryzen 5000 which has 6 cores/12 threads I use:

./codegen-serve -t 6 -m starcoder -f ./models/santacoder-q4_0.bin

To run the legacy codegen models. Just change the model type flag -m to codegen instead.

NOTE: Turbopilot 0.1.0 and newer re-quantize your codegen models old models from v0.0.5 and older. I am working on providing updated quantized codegen models

📦 Running From Docker

You can also run Turbopilot from the pre-built docker image supplied here

You will still need to download the models separately, then you can run:

docker run --rm -it \
  -v ./models:/models \
  -e THREADS=6 \
  -e MODEL_TYPE=starcoder \
  -e MODEL="/models/santacoder-q4_0.bin" \
  -p 18080:18080 \
  ghcr.io/ravenscroftj/turbopilot:latest

Docker and CUDA

As of release v0.0.5 turbocode now supports CUDA inference. In order to run the cuda-enabled container you will need to have nvidia-docker enabled, use the cuda tagged versions and pass in --gpus=all to docker with access to your GPU like so:

docker run --gpus=all --rm -it \
  -v ./models:/models \
  -e THREADS=6 \
  -e MODEL_TYPE=starcoder \
  -e MODEL="/models/santacoder-q4_0.bin" \
  -e GPU_LAYERS=32 \
  -p 18080:18080 \
  ghcr.io/ravenscroftj/turbopilot:v0.2.0-cuda11-7

If you have a big enough GPU then setting GPU_LAYERS will allow turbopilot to fully offload computation onto your GPU rather than copying data backwards and forwards, dramatically speeding up inference.

Swap ghcr.io/ravenscroftj/turbopilot:v0.1.0-cuda11 for ghcr.io/ravenscroftj/turbopilot:v0.2.0-cuda12-0 or ghcr.io/ravenscroftj/turbopilot:v0.2.0-cuda12-2 if you are using CUDA 12.0 or 12.2 respectively.

You will need CUDA 11 or CUDA 12 later to run this container. You should be able to see /app/turbopilot listed when you run nvidia-smi.

Executable and CUDA

As of v0.0.5 a CUDA version of the linux executable is available - it requires that libcublas 11 be installed on the machine - I might build ubuntu debs at some point but for now running in docker may be more convenient if you want to use a CUDA GPU.

You can use GPU offloading via the --ngl option.

🌐 Using the API

Support for the official Copilot Plugin

Support for the official VS Code copilot plugin is underway (See ticket #11). The API should now be broadly compatible with OpenAI.

Using the API with FauxPilot Plugin

To use the API from VSCode, I recommend the vscode-fauxpilot plugin. Once you install it, you will need to change a few settings in your settings.json file.

Open settings (CTRL/CMD + SHIFT + P) and select Preferences: Open User Settings (JSON)
Add the following values:

{
    ... // other settings

    "fauxpilot.enabled": true,
    "fauxpilot.server": "http://localhost:18080/v1/engines",
}

Now you can enable fauxpilot with CTRL + SHIFT + P and select Enable Fauxpilot

The plugin will send API calls to the running codegen-serve process when you make a keystroke. It will then wait for each request to complete before sending further requests.

Calling the API Directly

You can make requests to http://localhost:18080/v1/engines/codegen/completions which will behave just like the same Copilot endpoint.

For example:

curl --request POST \
  --url http://localhost:18080/v1/engines/codegen/completions \
  --header 'Content-Type: application/json' \
  --data '{
 "model": "codegen",
 "prompt": "def main():",
 "max_tokens": 100
}'

Should get you something like this:

{
 "choices": [
  {
   "logprobs": null,
   "index": 0,
   "finish_reason": "length",
   "text": "\n  \"\"\"Main entry point for this script.\"\"\"\n  logging.getLogger().setLevel(logging.INFO)\n  logging.basicConfig(format=('%(levelname)s: %(message)s'))\n\n  parser = argparse.ArgumentParser(\n      description=__doc__,\n      formatter_class=argparse.RawDescriptionHelpFormatter,\n      epilog=__doc__)\n  "
  }
 ],
 "created": 1681113078,
 "usage": {
  "total_tokens": 105,
  "prompt_tokens": 3,
  "completion_tokens": 102
 },
 "object": "text_completion",
 "model": "codegen",
 "id": "01d7a11b-f87c-4261-8c03-8c78cbe4b067"
}

👉 Known Limitations

Currently Turbopilot only supports one GPU device at a time (it will not try to make use of multiple devices).

👏 Acknowledgements

This project would not have been possible without Georgi Gerganov's work on GGML and llama.cpp
It was completely inspired by fauxpilot which I did experiment with for a little while but wanted to try to make the models work without a GPU
The frontend of the project is powered by Venthe's vscode-fauxpilot plugin
The project uses the Salesforce Codegen models.
Thanks to Moyix for his work on converting the Salesforce models to run in a GPT-J architecture. Not only does this confer some speed benefits but it also made it much easier for me to port the models to GGML using the existing gpt-j example code
The model server uses CrowCPP to serve suggestions.
Check out the original scientific paper for CodeGen for more info.