mirror of
https://github.com/oobabooga/text-generation-webui.git
synced 2024-10-01 01:26:03 -04:00
Streamline GPTQ-for-LLaMa support (#3526 from jllllll/gptqllama)
This commit is contained in:
commit
e3d2ddd170
@ -280,9 +280,6 @@ Optionally, you can use the following command-line flags:
|
|||||||
| `--pre_layer PRE_LAYER [PRE_LAYER ...]` | The number of layers to allocate to the GPU. Setting this parameter enables CPU offloading for 4-bit models. For multi-gpu, write the numbers separated by spaces, eg `--pre_layer 30 60`. |
|
| `--pre_layer PRE_LAYER [PRE_LAYER ...]` | The number of layers to allocate to the GPU. Setting this parameter enables CPU offloading for 4-bit models. For multi-gpu, write the numbers separated by spaces, eg `--pre_layer 30 60`. |
|
||||||
| `--checkpoint CHECKPOINT` | The path to the quantized checkpoint file. If not specified, it will be automatically detected. |
|
| `--checkpoint CHECKPOINT` | The path to the quantized checkpoint file. If not specified, it will be automatically detected. |
|
||||||
| `--monkey-patch` | Apply the monkey patch for using LoRAs with quantized models.
|
| `--monkey-patch` | Apply the monkey patch for using LoRAs with quantized models.
|
||||||
| `--quant_attn` | (triton) Enable quant attention. |
|
|
||||||
| `--warmup_autotune` | (triton) Enable warmup autotune. |
|
|
||||||
| `--fused_mlp` | (triton) Enable fused mlp. |
|
|
||||||
|
|
||||||
#### DeepSpeed
|
#### DeepSpeed
|
||||||
|
|
||||||
|
@ -64,59 +64,19 @@ python server.py --autogptq --gpu-memory 3000MiB 6000MiB --model model_name
|
|||||||
|
|
||||||
### Using LoRAs with AutoGPTQ
|
### Using LoRAs with AutoGPTQ
|
||||||
|
|
||||||
Not supported yet.
|
Works fine for a single LoRA.
|
||||||
|
|
||||||
## GPTQ-for-LLaMa
|
## GPTQ-for-LLaMa
|
||||||
|
|
||||||
GPTQ-for-LLaMa is the original adaptation of GPTQ for the LLaMA model. It was made possible by [@qwopqwop200](https://github.com/qwopqwop200/GPTQ-for-LLaMa): https://github.com/qwopqwop200/GPTQ-for-LLaMa
|
GPTQ-for-LLaMa is the original adaptation of GPTQ for the LLaMA model. It was made possible by [@qwopqwop200](https://github.com/qwopqwop200/GPTQ-for-LLaMa): https://github.com/qwopqwop200/GPTQ-for-LLaMa
|
||||||
|
|
||||||
Different branches of GPTQ-for-LLaMa are currently available, including:
|
A Python package containing both major CUDA versions of GPTQ-for-LLaMa is used to simplify installation and compatibility: https://github.com/jllllll/GPTQ-for-LLaMa-CUDA
|
||||||
|
|
||||||
| Branch | Comment |
|
|
||||||
|----|----|
|
|
||||||
| [Old CUDA branch (recommended)](https://github.com/oobabooga/GPTQ-for-LLaMa/) | The fastest branch, works on Windows and Linux. |
|
|
||||||
| [Up-to-date triton branch](https://github.com/qwopqwop200/GPTQ-for-LLaMa) | Slightly more precise than the old CUDA branch from 13b upwards, significantly more precise for 7b. 2x slower for small context size and only works on Linux. |
|
|
||||||
| [Up-to-date CUDA branch](https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/cuda) | As precise as the up-to-date triton branch, 10x slower than the old cuda branch for small context size. |
|
|
||||||
|
|
||||||
Overall, I recommend using the old CUDA branch. It is included by default in the one-click-installer for this web UI.
|
|
||||||
|
|
||||||
### Installation
|
|
||||||
|
|
||||||
Start by cloning GPTQ-for-LLaMa into your `text-generation-webui/repositories` folder:
|
|
||||||
|
|
||||||
```
|
|
||||||
mkdir repositories
|
|
||||||
cd repositories
|
|
||||||
git clone https://github.com/oobabooga/GPTQ-for-LLaMa.git -b cuda
|
|
||||||
```
|
|
||||||
|
|
||||||
If you want to you to use the up-to-date CUDA or triton branches instead of the old CUDA branch, use these commands:
|
|
||||||
|
|
||||||
```
|
|
||||||
git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa.git -b cuda
|
|
||||||
```
|
|
||||||
|
|
||||||
```
|
|
||||||
git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa.git -b triton
|
|
||||||
```
|
|
||||||
|
|
||||||
Next you need to install the CUDA extensions. You can do that either by installing the precompiled wheels, or by compiling the wheels yourself.
|
|
||||||
|
|
||||||
### Precompiled wheels
|
### Precompiled wheels
|
||||||
|
|
||||||
Kindly provided by our friend jllllll: https://github.com/jllllll/GPTQ-for-LLaMa-Wheels
|
Kindly provided by our friend jllllll: https://github.com/jllllll/GPTQ-for-LLaMa-CUDA/releases
|
||||||
|
|
||||||
Windows:
|
Wheels are included in requirements.txt and are installed with the webui on supported systems.
|
||||||
|
|
||||||
```
|
|
||||||
pip install https://github.com/jllllll/GPTQ-for-LLaMa-Wheels/raw/main/quant_cuda-0.0.0-cp310-cp310-win_amd64.whl
|
|
||||||
```
|
|
||||||
|
|
||||||
Linux:
|
|
||||||
|
|
||||||
```
|
|
||||||
pip install https://github.com/jllllll/GPTQ-for-LLaMa-Wheels/raw/Linux-x64/quant_cuda-0.0.0-cp310-cp310-linux_x86_64.whl
|
|
||||||
```
|
|
||||||
|
|
||||||
### Manual installation
|
### Manual installation
|
||||||
|
|
||||||
@ -124,20 +84,19 @@ pip install https://github.com/jllllll/GPTQ-for-LLaMa-Wheels/raw/Linux-x64/quant
|
|||||||
|
|
||||||
```
|
```
|
||||||
conda activate textgen
|
conda activate textgen
|
||||||
conda install -c conda-forge cudatoolkit-dev
|
conda install cuda -c nvidia/label/cuda-11.7.1
|
||||||
```
|
```
|
||||||
|
|
||||||
The command above takes some 10 minutes to run and shows no progress bar or updates along the way.
|
The command above takes some 10 minutes to run and shows no progress bar or updates along the way.
|
||||||
|
|
||||||
You are also going to need to have a C++ compiler installed. On Linux, `sudo apt install build-essential` or equivalent is enough.
|
You are also going to need to have a C++ compiler installed. On Linux, `sudo apt install build-essential` or equivalent is enough. On Windows, Visual Studio or Visual Studio Build Tools is required.
|
||||||
|
|
||||||
If you're using an older version of CUDA toolkit (e.g. 11.7) but the latest version of `gcc` and `g++` (12.0+), you should downgrade with: `conda install -c conda-forge gxx==11.3.0`. Kernel compilation will fail otherwise.
|
If you're using an older version of CUDA toolkit (e.g. 11.7) but the latest version of `gcc` and `g++` (12.0+) on Linux, you should downgrade with: `conda install -c conda-forge gxx==11.3.0`. Kernel compilation will fail otherwise.
|
||||||
|
|
||||||
#### Step 2: compile the CUDA extensions
|
#### Step 2: compile the CUDA extensions
|
||||||
|
|
||||||
```
|
```
|
||||||
cd repositories/GPTQ-for-LLaMa
|
python -m pip install git+https://github.com/jllllll/GPTQ-for-LLaMa-CUDA -v
|
||||||
python setup_cuda.py install
|
|
||||||
```
|
```
|
||||||
|
|
||||||
### Getting pre-converted LLaMA weights
|
### Getting pre-converted LLaMA weights
|
||||||
|
@ -1,6 +1,5 @@
|
|||||||
import inspect
|
import inspect
|
||||||
import re
|
import re
|
||||||
import sys
|
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
|
|
||||||
import accelerate
|
import accelerate
|
||||||
@ -11,26 +10,9 @@ from transformers import AutoConfig, AutoModelForCausalLM
|
|||||||
import modules.shared as shared
|
import modules.shared as shared
|
||||||
from modules.logging_colors import logger
|
from modules.logging_colors import logger
|
||||||
|
|
||||||
sys.path.insert(0, str(Path("repositories/GPTQ-for-LLaMa")))
|
from gptq_for_llama import llama_inference_offload
|
||||||
|
from gptq_for_llama.modelutils import find_layers
|
||||||
try:
|
from gptq_for_llama.quant import make_quant
|
||||||
import llama_inference_offload
|
|
||||||
except ImportError:
|
|
||||||
logger.error('Failed to load GPTQ-for-LLaMa')
|
|
||||||
logger.error('See https://github.com/oobabooga/text-generation-webui/blob/main/docs/GPTQ-models-(4-bit-mode).md')
|
|
||||||
sys.exit(-1)
|
|
||||||
|
|
||||||
try:
|
|
||||||
from modelutils import find_layers
|
|
||||||
except ImportError:
|
|
||||||
from utils import find_layers
|
|
||||||
|
|
||||||
try:
|
|
||||||
from quant import make_quant
|
|
||||||
is_triton = False
|
|
||||||
except ImportError:
|
|
||||||
import quant
|
|
||||||
is_triton = True
|
|
||||||
|
|
||||||
|
|
||||||
# This function is a replacement for the load_quant function in the
|
# This function is a replacement for the load_quant function in the
|
||||||
@ -59,7 +41,6 @@ def _load_quant(model, checkpoint, wbits, groupsize=-1, faster_kernel=False, exc
|
|||||||
if name in layers:
|
if name in layers:
|
||||||
del layers[name]
|
del layers[name]
|
||||||
|
|
||||||
if not is_triton:
|
|
||||||
gptq_args = inspect.getfullargspec(make_quant).args
|
gptq_args = inspect.getfullargspec(make_quant).args
|
||||||
|
|
||||||
make_quant_kwargs = {
|
make_quant_kwargs = {
|
||||||
@ -75,8 +56,6 @@ def _load_quant(model, checkpoint, wbits, groupsize=-1, faster_kernel=False, exc
|
|||||||
make_quant_kwargs['kernel_switch_threshold'] = kernel_switch_threshold
|
make_quant_kwargs['kernel_switch_threshold'] = kernel_switch_threshold
|
||||||
|
|
||||||
make_quant(**make_quant_kwargs)
|
make_quant(**make_quant_kwargs)
|
||||||
else:
|
|
||||||
quant.make_quant_linear(model, layers, wbits, groupsize)
|
|
||||||
|
|
||||||
del layers
|
del layers
|
||||||
if checkpoint.endswith('.safetensors'):
|
if checkpoint.endswith('.safetensors'):
|
||||||
@ -85,18 +64,6 @@ def _load_quant(model, checkpoint, wbits, groupsize=-1, faster_kernel=False, exc
|
|||||||
else:
|
else:
|
||||||
model.load_state_dict(torch.load(checkpoint), strict=False)
|
model.load_state_dict(torch.load(checkpoint), strict=False)
|
||||||
|
|
||||||
if is_triton:
|
|
||||||
if shared.args.quant_attn:
|
|
||||||
quant.make_quant_attn(model)
|
|
||||||
|
|
||||||
if eval and shared.args.fused_mlp:
|
|
||||||
quant.make_fused_mlp(model)
|
|
||||||
|
|
||||||
if shared.args.warmup_autotune:
|
|
||||||
quant.autotune_warmup_linear(model, transpose=not eval)
|
|
||||||
if eval and shared.args.fused_mlp:
|
|
||||||
quant.autotune_warmup_fused(model)
|
|
||||||
|
|
||||||
model.seqlen = 2048
|
model.seqlen = 2048
|
||||||
return model
|
return model
|
||||||
|
|
||||||
|
@ -138,9 +138,6 @@ parser.add_argument('--groupsize', type=int, default=-1, help='Group size.')
|
|||||||
parser.add_argument('--pre_layer', type=int, nargs="+", help='The number of layers to allocate to the GPU. Setting this parameter enables CPU offloading for 4-bit models. For multi-gpu, write the numbers separated by spaces, eg --pre_layer 30 60.')
|
parser.add_argument('--pre_layer', type=int, nargs="+", help='The number of layers to allocate to the GPU. Setting this parameter enables CPU offloading for 4-bit models. For multi-gpu, write the numbers separated by spaces, eg --pre_layer 30 60.')
|
||||||
parser.add_argument('--checkpoint', type=str, help='The path to the quantized checkpoint file. If not specified, it will be automatically detected.')
|
parser.add_argument('--checkpoint', type=str, help='The path to the quantized checkpoint file. If not specified, it will be automatically detected.')
|
||||||
parser.add_argument('--monkey-patch', action='store_true', help='Apply the monkey patch for using LoRAs with quantized models.')
|
parser.add_argument('--monkey-patch', action='store_true', help='Apply the monkey patch for using LoRAs with quantized models.')
|
||||||
parser.add_argument('--quant_attn', action='store_true', help='(triton) Enable quant attention.')
|
|
||||||
parser.add_argument('--warmup_autotune', action='store_true', help='(triton) Enable warmup autotune.')
|
|
||||||
parser.add_argument('--fused_mlp', action='store_true', help='(triton) Enable fused mlp.')
|
|
||||||
|
|
||||||
# AutoGPTQ
|
# AutoGPTQ
|
||||||
parser.add_argument('--triton', action='store_true', help='Use triton.')
|
parser.add_argument('--triton', action='store_true', help='Use triton.')
|
||||||
|
@ -110,7 +110,7 @@ def create_ui():
|
|||||||
shared.gradio['mlock'] = gr.Checkbox(label="mlock", value=shared.args.mlock)
|
shared.gradio['mlock'] = gr.Checkbox(label="mlock", value=shared.args.mlock)
|
||||||
shared.gradio['llama_cpp_seed'] = gr.Number(label='Seed (0 for random)', value=shared.args.llama_cpp_seed)
|
shared.gradio['llama_cpp_seed'] = gr.Number(label='Seed (0 for random)', value=shared.args.llama_cpp_seed)
|
||||||
shared.gradio['trust_remote_code'] = gr.Checkbox(label="trust-remote-code", value=shared.args.trust_remote_code, info='Make sure to inspect the .py files inside the model folder before loading it with this option enabled.')
|
shared.gradio['trust_remote_code'] = gr.Checkbox(label="trust-remote-code", value=shared.args.trust_remote_code, info='Make sure to inspect the .py files inside the model folder before loading it with this option enabled.')
|
||||||
shared.gradio['gptq_for_llama_info'] = gr.Markdown('GPTQ-for-LLaMa is currently 2x faster than AutoGPTQ on some systems. It is installed by default with the one-click installers. Otherwise, it has to be installed manually following the instructions here: [instructions](https://github.com/oobabooga/text-generation-webui/blob/main/docs/GPTQ-models-(4-bit-mode).md#installation-1).')
|
shared.gradio['gptq_for_llama_info'] = gr.Markdown('GPTQ-for-LLaMa support is currently only kept for compatibility with older GPUs. AutoGPTQ or ExLlama is preferred when compatible. GPTQ-for-LLaMa is installed by default with the webui on supported systems. Otherwise, it has to be installed manually following the instructions here: [instructions](https://github.com/oobabooga/text-generation-webui/blob/main/docs/GPTQ-models-(4-bit-mode).md#installation-1).')
|
||||||
shared.gradio['exllama_info'] = gr.Markdown('For more information, consult the [docs](https://github.com/oobabooga/text-generation-webui/blob/main/docs/ExLlama.md).')
|
shared.gradio['exllama_info'] = gr.Markdown('For more information, consult the [docs](https://github.com/oobabooga/text-generation-webui/blob/main/docs/ExLlama.md).')
|
||||||
shared.gradio['exllama_HF_info'] = gr.Markdown('ExLlama_HF is a wrapper that lets you use ExLlama like a Transformers model, which means it can use the Transformers samplers. It\'s a bit slower than the regular ExLlama.')
|
shared.gradio['exllama_HF_info'] = gr.Markdown('ExLlama_HF is a wrapper that lets you use ExLlama like a Transformers model, which means it can use the Transformers samplers. It\'s a bit slower than the regular ExLlama.')
|
||||||
shared.gradio['llamacpp_HF_info'] = gr.Markdown('llamacpp_HF is a wrapper that lets you use llama.cpp like a Transformers model, which means it can use the Transformers samplers. To use it, make sure to first download oobabooga/llama-tokenizer under "Download custom model or LoRA".')
|
shared.gradio['llamacpp_HF_info'] = gr.Markdown('llamacpp_HF is a wrapper that lets you use llama.cpp like a Transformers model, which means it can use the Transformers samplers. To use it, make sure to first download oobabooga/llama-tokenizer under "Download custom model or LoRA".')
|
||||||
|
@ -36,3 +36,7 @@ https://github.com/abetlen/llama-cpp-python/releases/download/v0.1.77/llama_cpp_
|
|||||||
# llama-cpp-python with CUDA support
|
# llama-cpp-python with CUDA support
|
||||||
https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.1.77+cu117-cp310-cp310-win_amd64.whl; platform_system == "Windows"
|
https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.1.77+cu117-cp310-cp310-win_amd64.whl; platform_system == "Windows"
|
||||||
https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.1.77+cu117-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64"
|
https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.1.77+cu117-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64"
|
||||||
|
|
||||||
|
# GPTQ-for-LLaMa
|
||||||
|
https://github.com/jllllll/GPTQ-for-LLaMa-CUDA/releases/download/0.1.0/gptq_for_llama-0.1.0+cu117-cp310-cp310-win_amd64.whl; platform_system == "Windows"
|
||||||
|
https://github.com/jllllll/GPTQ-for-LLaMa-CUDA/releases/download/0.1.0/gptq_for_llama-0.1.0+cu117-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64"
|
||||||
|
Loading…
Reference in New Issue
Block a user