text-generation-webui/docs/llama.cpp.md

# llama.cpp

llama.cpp is the best backend in two important scenarios:

1) You don't have a GPU.
2) You want to run a model that doesn't fit into your GPU.

## Setting up the models

#### Pre-converted

Download the GGUF models directly into your `text-generation-webui/models` folder. It will be a single file.

* Make sure its name ends in `.gguf`.
* `q4_K_M` quantization is recommended.

#### Convert Llama yourself

Follow the instructions in the llama.cpp README to generate a GGUF: https://github.com/ggerganov/llama.cpp#prepare-data--run

## GPU acceleration

Enabled with the `--n-gpu-layers` parameter. 

* If you have enough VRAM, use a high number like `--n-gpu-layers 1000` to offload all layers to the GPU. 
* Otherwise, start with a low number like `--n-gpu-layers 10` and then gradually increase it until you run out of memory.

This feature works out of the box for NVIDIA GPUs on Linux (amd64) or Windows. For other GPUs, you need to uninstall `llama-cpp-python` with

```
pip uninstall -y llama-cpp-python
```

and then recompile it using the commands here: https://pypi.org/project/llama-cpp-python/

#### macOS

For macOS, these are the commands:

```
pip uninstall -y llama-cpp-python
CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install llama-cpp-python --no-cache-dir
```
Update llama.cpp.md 2023-07-25 18:31:05 -04:00			`# llama.cpp`

			`llama.cpp is the best backend in two important scenarios:`

			`1) You don't have a GPU.`
			`2) You want to run a model that doesn't fit into your GPU.`

			`## Setting up the models`

			`#### Pre-converted`

Remove GGML support 2023-09-11 10:30:56 -04:00			Download the GGUF models directly into your `text-generation-webui/models` folder. It will be a single file.
Update llama.cpp.md instructions (#3702) 2023-08-29 16:56:50 -04:00
Remove GGML support 2023-09-11 10:30:56 -04:00			* Make sure its name ends in `.gguf`.
			* `q4_K_M` quantization is recommended.
Update llama.cpp.md 2023-07-25 18:31:05 -04:00
			`#### Convert Llama yourself`

Remove GGML support 2023-09-11 10:30:56 -04:00			`Follow the instructions in the llama.cpp README to generate a GGUF: https://github.com/ggerganov/llama.cpp#prepare-data--run`
Update llama.cpp.md 2023-07-25 18:31:05 -04:00
			`## GPU acceleration`

			Enabled with the `--n-gpu-layers` parameter.

			* If you have enough VRAM, use a high number like `--n-gpu-layers 1000` to offload all layers to the GPU.
			* Otherwise, start with a low number like `--n-gpu-layers 10` and then gradually increase it until you run out of memory.

Update llama.cpp.md 2023-07-25 18:33:16 -04:00			This feature works out of the box for NVIDIA GPUs on Linux (amd64) or Windows. For other GPUs, you need to uninstall `llama-cpp-python` with
Update llama.cpp.md 2023-07-25 18:31:05 -04:00
			```
			`pip uninstall -y llama-cpp-python`
			```

			`and then recompile it using the commands here: https://pypi.org/project/llama-cpp-python/`

			`#### macOS`

			`For macOS, these are the commands:`

			```
			`pip uninstall -y llama-cpp-python`
			`CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install llama-cpp-python --no-cache-dir`
			```