diff --git a/docs/Custom-chat-characters.md b/docs/Custom-chat-characters.md new file mode 100644 index 00000000..b25a2294 --- /dev/null +++ b/docs/Custom-chat-characters.md @@ -0,0 +1,30 @@ +Custom chat mode characters are defined by `.yaml` files inside the `characters` folder. An example is included: [Example.yaml](https://github.com/oobabooga/text-generation-webui/blob/main/characters/Example.yaml) + +The following fields may be defined: + +| Field | Description | +|-------|-------------| +| `name` | The character's name. | +| `context` | A string that appears at the top of the prompt. It usually contains a description of the character's personality. | +| `greeting` (optional) | The character's opening message when a new conversation is started. | +| `example_dialogue` (optional) | A few example messages to guide the model. | +| `your_name` (optional) | Your name. This overwrites what you had previously written in the `Your name` field in the interface. | + +#### Special tokens + +* `{{char}}` or ``: are replaced with the character's name +* `{{user}}` or ``: are replaced with your name + +These replacements happen when the character is loaded, and they apply to the `context`, `greeting`, and `example_dialogue` fields. + +#### How do I add a profile picture for my character? + +Put an image with the same name as your character's yaml file into the `characters` folder. For example, if your bot is `Character.yaml`, add `Character.jpg` or `Character.png` to the folder. + +#### Is the chat history truncated in the prompt? + +Once your prompt reaches the 2048 token limit, old messages will be removed one at a time. The context string will always stay at the top of the prompt and will never get truncated. + +#### Pygmalion format characters + +These are also supported out of the box. Simply put the JSON file in the `characters` folder, or upload it directly from the web UI by clicking on the "Upload character" tab at the bottom. \ No newline at end of file diff --git a/docs/DeepSpeed.md b/docs/DeepSpeed.md new file mode 100644 index 00000000..70cd8151 --- /dev/null +++ b/docs/DeepSpeed.md @@ -0,0 +1,23 @@ +An alternative way of reducing the GPU memory usage of models is to use the `DeepSpeed ZeRO-3` optimization. + +With this, I have been able to load a 6b model (GPT-J 6B) with less than 6GB of VRAM. The speed of text generation is very decent and much better than what would be accomplished with `--auto-devices --gpu-memory 6`. + +As far as I know, DeepSpeed is only available for Linux at the moment. + +### How to use it + +1. Install DeepSpeed: + +``` +pip install deepspeed +``` + +2. Start the web UI replacing `python` with `deepspeed --num_gpus=1` and adding the `--deepspeed` flag. Example: + +``` +deepspeed --num_gpus=1 server.py --deepspeed --chat --model gpt-j-6B +``` + +### Learn more + +For more information, check out [this comment](https://github.com/oobabooga/text-generation-webui/issues/40#issuecomment-1412038622) by 81300, who came up with the DeepSpeed support in this web UI. \ No newline at end of file diff --git a/docs/Extensions.md b/docs/Extensions.md new file mode 100644 index 00000000..184ace55 --- /dev/null +++ b/docs/Extensions.md @@ -0,0 +1,157 @@ +This web UI supports extensions. They are simply files under + +``` +extensions/your_extension_name/script.py +``` + +which can be invoked with the + +``` +--extension your_extension_name +``` + +command-line flag. + +## [text-generation-webui-extensions](https://github.com/oobabooga/text-generation-webui-extensions) + +The link above contains a directory of user extensions for text-generation-webui. + +## Built-in extensions + +|Extension|Description| +|---------|-----------| +|[google_translate](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/google_translate)| Automatically translates inputs and outputs using Google Translate.| +|[character_bias](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/character_bias)| Just a very simple example that biases the bot's responses in chat mode.| +|[gallery](https://github.com/oobabooga/text-generation-webui/blob/main/extensions/gallery/)| Creates a gallery with the chat characters and their pictures. | +|[silero_tts](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/silero_tts)| Text-to-speech extension using [Silero](https://github.com/snakers4/silero-models). When used in chat mode, it replaces the responses with an audio widget. | +|[elevenlabs_tts](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/elevenlabs_tts)| Text-to-speech extension using the [ElevenLabs](https://beta.elevenlabs.io/) API. You need an API key to use it. Author: [@MetaIX](https://github.com/MetaIX). | +|[send_pictures](https://github.com/oobabooga/text-generation-webui/blob/main/extensions/send_pictures/)| Creates an image upload field that can be used to send images to the bot in chat mode. Captions are automatically generated using BLIP. Author: [@SillyLossy](https://github.com/sillylossy).| +|[api](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/api)| Creates an API similar to the one provided by KoboldAI. Works with TavernAI: start the web UI with `python server.py --no-stream --extensions api` and set the API URL to `http://127.0.0.1:5000/api`. Author: [@mayaeary](https://github.com/mayaeary).| +|[whisper_stt](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/whisper_stt)| Allows you to enter your inputs in chat mode using your microphone. Author: [@EliasVincent](https://github.com/EliasVincent).| +|[sd_api_pictures](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/sd_api_pictures)| Allows you to request pictures from the bot in chat mode, which will be generated using the AUTOMATIC1111 Stable Diffusion API. See examples [here](https://github.com/oobabooga/text-generation-webui/pull/309). Author: [@Brawlence](https://github.com/Brawlence).| + +## How to write an extension + +`script.py` has access to all variables in the UI through the `modules.shared` module, and it may define the following functions: + +| Function | Description | +|-------------|-------------| +| `def ui()` | Creates custom gradio elements when the UI is launched. | +| `def input_modifier(string)` | Modifies the input string before it enters the model. In chat mode, it is applied to the user message. Otherwise, it is applied to the entire prompt. | +| `def output_modifier(string)` | Modifies the output string before it is presented in the UI. In chat mode, it is applied to the bot's reply. Otherwise, it is applied to the entire output. | +| `def bot_prefix_modifier(string)` | Applied in chat mode to the prefix for the bot's reply (more on that below). | +| `def custom_generate_chat_prompt(...)` | Overrides the prompt generator in chat mode. | + +Additionally, the script may define two special global variables: + +#### `params` dictionary + +```python +params = { + "language string": "ja", +} +``` + +This dicionary can be used to make the extension parameters customizable by adding entries to a `settings.json` file like this: + +```python +"google_translate-language string": "fr", +``` + +#### `input_hijack` dictionary + +```python +input_hijack = { + 'state': False, + 'value': ["", ""] +} +``` +This is only relevant in chat mode. If your extension sets `input_hijack['state']` to `True` at any moment, the next call to `modules.chat.chatbot_wrapper` will use the vales inside `input_hijack['value']` as the user input for text generation. See the `send_pictures` extension above for an example. + +## The `bot_prefix_modifier` + +In chat mode, this function modifies the prefix for a new bot message. For instance, if your bot is named `Marie Antoinette`, the default prefix for a new message will be + +``` +Marie Antoinette: +``` + +Using `bot_prefix_modifier`, you can change it to: + +``` +Marie Antoinette: *I am very enthusiastic* +``` + +Marie Antoinette will become very enthusiastic in all her messages. + +## Using multiple extensions at the same time + +In order to use your extension, you must start the web UI with the `--extensions` flag followed by the name of your extension (the folder under `text-generation-webui/extension` where `script.py` resides). + +You can activate more than one extension at a time by providing their names separated by spaces. The input, output and bot prefix modifiers will be applied in the specified order. For `custom_generate_chat_prompt`, only the first declaration encountered will be used and the rest will be ignored. + +``` +python server.py --extensions enthusiasm translate # First apply enthusiasm, then translate +python server.py --extensions translate enthusiasm # First apply translate, then enthusiasm +``` + +## `custom_generate_chat_prompt` example + +Below is an extension that just reproduces the default prompt generator in `modules/chat.py`. You can modify it freely to come up with your own prompts in chat mode. + +```python +def custom_generate_chat_prompt(user_input, state, **kwargs): + impersonate = kwargs['impersonate'] if 'impersonate' in kwargs else False + _continue = kwargs['_continue'] if '_continue' in kwargs else False + also_return_rows = kwargs['also_return_rows'] if 'also_return_rows' in kwargs else False + is_instruct = state['mode'] == 'instruct' + rows = [f"{state['context'].strip()}\n"] + + # Finding the maximum prompt size + chat_prompt_size = state['chat_prompt_size'] + if shared.soft_prompt: + chat_prompt_size -= shared.soft_prompt_tensor.shape[1] + max_length = min(get_max_prompt_length(state), chat_prompt_size) + + if is_instruct: + prefix1 = f"{state['name1']}\n" + prefix2 = f"{state['name2']}\n" + else: + prefix1 = f"{state['name1']}: " + prefix2 = f"{state['name2']}: " + + i = len(shared.history['internal']) - 1 + while i >= 0 and len(encode(''.join(rows))[0]) < max_length: + if _continue and i == len(shared.history['internal']) - 1: + rows.insert(1, f"{prefix2}{shared.history['internal'][i][1]}") + else: + rows.insert(1, f"{prefix2}{shared.history['internal'][i][1].strip()}{state['end_of_turn']}\n") + string = shared.history['internal'][i][0] + if string not in ['', '<|BEGIN-VISIBLE-CHAT|>']: + rows.insert(1, f"{prefix1}{string.strip()}{state['end_of_turn']}\n") + i -= 1 + + if impersonate: + rows.append(f"{prefix1.strip() if not is_instruct else prefix1}") + limit = 2 + elif _continue: + limit = 3 + else: + # Adding the user message + user_input = fix_newlines(user_input) + if len(user_input) > 0: + rows.append(f"{prefix1}{user_input}{state['end_of_turn']}\n") + + # Adding the Character prefix + rows.append(apply_extensions(f"{prefix2.strip() if not is_instruct else prefix2}", "bot_prefix")) + limit = 3 + + while len(rows) > limit and len(encode(''.join(rows))[0]) >= max_length: + rows.pop(1) + prompt = ''.join(rows) + + if also_return_rows: + return prompt, rows + else: + return prompt +``` diff --git a/docs/FlexGen.md b/docs/FlexGen.md new file mode 100644 index 00000000..dce71f9e --- /dev/null +++ b/docs/FlexGen.md @@ -0,0 +1,64 @@ +>FlexGen is a high-throughput generation engine for running large language models with limited GPU memory (e.g., a 16GB T4 GPU or a 24GB RTX3090 gaming card!). + +https://github.com/FMInference/FlexGen + +## Installation + +No additional installation steps are necessary. FlexGen is in the `requirements.txt` file for this project. + +## Converting a model + +FlexGen only works with the OPT model, and it needs to be converted to numpy format before starting the web UI: + +``` +python convert-to-flexgen.py models/opt-1.3b/ +``` + +The output will be saved to `models/opt-1.3b-np/`. + +## Usage + +The basic command is the following: + +``` +python server.py --model opt-1.3b --flexgen +``` + +For large models, the RAM usage may be too high and your computer may freeze. If that happens, you can try this: + +``` +python server.py --model opt-1.3b --flexgen --compress-weight +``` + +With this second command, I was able to run both OPT-6.7b and OPT-13B with **2GB VRAM**, and the speed was good in both cases. + +You can also manually set the offload strategy with + +``` +python server.py --model opt-1.3b --flexgen --percent 0 100 100 0 100 0 +``` + +where the six numbers after `--percent` are: + +``` +the percentage of weight on GPU +the percentage of weight on CPU +the percentage of attention cache on GPU +the percentage of attention cache on CPU +the percentage of activations on GPU +the percentage of activations on CPU +``` + +You should typically only change the first two numbers. If their sum is less than 100, the remaining layers will be offloaded to the disk, by default into the `text-generation-webui/cache` folder. + +## Performance + +In my experiments with OPT-30B using a RTX 3090 on Linux, I have obtained these results: + +* `--flexgen --compress-weight --percent 0 100 100 0 100 0`: 0.99 seconds per token. +* `--flexgen --compress-weight --percent 100 0 100 0 100 0`: 0.765 seconds per token. + +## Limitations + +* Only works with the OPT models. +* Only two generation parameters are available: `temperature` and `do_sample`. \ No newline at end of file diff --git a/docs/GPTQ-models-(4-bit-mode).md b/docs/GPTQ-models-(4-bit-mode).md new file mode 100644 index 00000000..9ed7cc37 --- /dev/null +++ b/docs/GPTQ-models-(4-bit-mode).md @@ -0,0 +1,128 @@ +In 4-bit mode, models are loaded with just 25% of their regular VRAM usage. So LLaMA-7B fits into a 6GB GPU, and LLaMA-30B fits into a 24GB GPU. + +This is possible thanks to [@qwopqwop200](https://github.com/qwopqwop200/GPTQ-for-LLaMa)'s adaptation of the GPTQ algorithm for LLaMA: https://github.com/qwopqwop200/GPTQ-for-LLaMa + +GPTQ is a clever quantization algorithm that lightly reoptimizes the weights during quantization so that the accuracy loss is compensated relative to a round-to-nearest quantization. See the paper for more details: https://arxiv.org/abs/2210.17323 + +## Installation + +### Step 0: install nvcc + +``` +conda activate textgen +conda install -c conda-forge cudatoolkit-dev +``` + +The command above takes some 10 minutes to run and shows no progress bar or updates along the way. + +See this issue for more details: https://github.com/oobabooga/text-generation-webui/issues/416#issuecomment-1475078571 + +### Step 1: install GPTQ-for-LLaMa + +Clone the GPTQ-for-LLaMa repository into the `text-generation-webui/repositories` subfolder and install it: + +``` +mkdir repositories +cd repositories +git clone https://github.com/oobabooga/GPTQ-for-LLaMa.git -b cuda +cd GPTQ-for-LLaMa +python setup_cuda.py install +``` + +You are going to need to have a C++ compiler installed into your system for the last command. On Linux, `sudo apt install build-essential` or equivalent is enough. + +https://github.com/oobabooga/GPTQ-for-LLaMa corresponds to commit `a6f363e3f93b9fb5c26064b5ac7ed58d22e3f773` in the `cuda` branch of the original repository and is recommended by default for stability. Some models might require you to use the up-to-date CUDA or triton branches: + +``` +cd repositories +rm -r GPTQ-for-LLaMa +pip uninstall -y quant-cuda +git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa.git -b cuda +... +``` + +``` +cd repositories +rm -r GPTQ-for-LLaMa +pip uninstall -y quant-cuda +git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa.git -b triton +... +``` + + +https://github.com/qwopqwop200/GPTQ-for-LLaMa + +### Step 2: get the pre-converted weights + +* Converted without `group-size` (better for the 7b model): https://github.com/oobabooga/text-generation-webui/pull/530#issuecomment-1483891617 +* Converted with `group-size` (better from 13b upwards): https://github.com/oobabooga/text-generation-webui/pull/530#issuecomment-1483941105 + +Note: the tokenizer files in those torrents are not up to date. + +### Step 3: Start the web UI: + +For the models converted without `group-size`: + +``` +python server.py --model llama-7b-4bit +``` + +For the models converted with `group-size`: + +``` +python server.py --model llama-13b-4bit-128g +``` + +The command-line flags `--wbits` and `--groupsize` are automatically detected based on the folder names, but you can also specify them manually like + +``` +python server.py --model llama-13b-4bit-128g --wbits 4 --groupsize 128 +``` + +## CPU offloading + +It is possible to offload part of the layers of the 4-bit model to the CPU with the `--pre_layer` flag. The higher the number after `--pre_layer`, the more layers will be allocated to the GPU. + +With this command, I can run llama-7b with 4GB VRAM: + +``` +python server.py --model llama-7b-4bit --pre_layer 20 +``` + +This is the performance: + +``` +Output generated in 123.79 seconds (1.61 tokens/s, 199 tokens) +``` + +## Using LoRAs in 4-bit mode + +At the moment, this feature is not officially supported by the relevant libraries, but a patch exists and is supported by this web UI: https://github.com/johnsmith0031/alpaca_lora_4bit + +In order to use it: + +1. Make sure that your requirements are up to date: + +``` +cd text-generation-webui +pip install -r requirements.txt --upgrade +``` + +2. Clone `johnsmith0031/alpaca_lora_4bit` into the repositories folder: + +``` +cd text-generation-webui/repositories +git clone https://github.com/johnsmith0031/alpaca_lora_4bit +``` + +3. Install https://github.com/sterlind/GPTQ-for-LLaMa with this command: + +``` +pip install git+https://github.com/sterlind/GPTQ-for-LLaMa.git@lora_4bit +``` + +4. Start the UI with the `--monkey-patch` flag: + +``` +python server.py --model llama-7b-4bit-128g --listen --lora tloen_alpaca-lora-7b --monkey-patch +``` diff --git a/docs/Home.md b/docs/Home.md new file mode 100644 index 00000000..8448d13c --- /dev/null +++ b/docs/Home.md @@ -0,0 +1 @@ +Welcome to the text-generation-webui wiki! diff --git a/docs/LLaMA-model.md b/docs/LLaMA-model.md new file mode 100644 index 00000000..95e0d35e --- /dev/null +++ b/docs/LLaMA-model.md @@ -0,0 +1,45 @@ +LLaMA is a Large Language Model developed by Meta AI. + +It was trained on more tokens than previous models. The result is that the smallest version with 7 billion parameters has similar performance to GPT-3 with 175 billion parameters. + +This guide will cover usage through the official `transformers` implementation. For 4-bit mode, head over to [GPTQ models (4 bit mode) +](https://github.com/oobabooga/text-generation-webui/wiki/GPTQ-models-(4-bit-mode)). + +## Getting the weights + +### Option 1: pre-converted weights + +* Torrent: https://github.com/oobabooga/text-generation-webui/pull/530#issuecomment-1484235789 +* Direct download: https://huggingface.co/Neko-Institute-of-Science + +⚠️ The tokenizers for the sources above and also for many LLaMA fine-tunes available on Hugging Face may be outdated, so I recommend downloading the following universal LLaMA tokenizer: + +``` +python download-model.py oobabooga/llama-tokenizer +``` + +Once downloaded, it will be automatically applied to **every** `LlamaForCausalLM` model that you try to load. + +### Option 2: convert the weights yourself + +1. Install the `protobuf` library: + +``` +pip install protobuf +``` + +2. Use the script below to convert the model in `.pth` format that you, a fellow academic, downloaded using Meta's official link: + +### [convert_llama_weights_to_hf.py](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/convert_llama_weights_to_hf.py) + +``` +python convert_llama_weights_to_hf.py --input_dir /path/to/LLaMA --model_size 7B --output_dir /tmp/outputs/llama-7b +``` + +3. Move the `llama-7b` folder inside your `text-generation-webui/models` folder. + +## Starting the web UI + +```python +python server.py --model llama-7b +``` \ No newline at end of file diff --git a/docs/Low-VRAM-guide.md b/docs/Low-VRAM-guide.md new file mode 100644 index 00000000..d504d4e7 --- /dev/null +++ b/docs/Low-VRAM-guide.md @@ -0,0 +1,51 @@ +If you GPU is not large enough to fit a model, try these in the following order: + +### Load the model in 8-bit mode + +``` +python server.py --load-in-8bit +``` + +This reduces the memory usage by half with no noticeable loss in quality. Only newer GPUs support 8-bit mode. + +### Split the model across your GPU and CPU + +``` +python server.py --auto-devices +``` + +If you can load the model with this command but it runs out of memory when you try to generate text, try increasingly limiting the amount of memory allocated to the GPU until the error stops happening: + +``` +python server.py --auto-devices --gpu-memory 10 +python server.py --auto-devices --gpu-memory 9 +python server.py --auto-devices --gpu-memory 8 +... +``` + +where the number is in GiB. + +For finer control, you can also specify the unit in MiB explicitly: + +``` +python server.py --auto-devices --gpu-memory 8722MiB +python server.py --auto-devices --gpu-memory 4725MiB +python server.py --auto-devices --gpu-memory 3500MiB +... +``` + +Additionally, you can also set the `--no-cache` value to reduce the GPU usage while generating text at a performance cost. This may allow you to set a higher value for `--gpu-memory`, resulting in a net performance gain. + +### Send layers to a disk cache + +As a desperate last measure, you can split the model across your GPU, CPU, and disk: + +``` +python server.py --auto-devices --disk +``` + +With this, I am able to load a 30b model into my RTX 3090, but it takes 10 seconds to generate 1 word. + +### DeepSpeed (experimental) + +An experimental alternative to all of the above is to use DeepSpeed: [guide](https://github.com/oobabooga/text-generation-webui/wiki/DeepSpeed). \ No newline at end of file diff --git a/docs/RWKV-model.md b/docs/RWKV-model.md new file mode 100644 index 00000000..27db3d10 --- /dev/null +++ b/docs/RWKV-model.md @@ -0,0 +1,54 @@ +> RWKV: RNN with Transformer-level LLM Performance +> +> It combines the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding (using the final hidden state). + +https://github.com/BlinkDL/RWKV-LM + +https://github.com/BlinkDL/ChatRWKV + +## Using RWKV in the web UI + +#### 1. Download the model + +It is available in different sizes: + +* https://huggingface.co/BlinkDL/rwkv-4-pile-3b/ +* https://huggingface.co/BlinkDL/rwkv-4-pile-7b/ +* https://huggingface.co/BlinkDL/rwkv-4-pile-14b/ + +There are also older releases with smaller sizes like: + +* https://huggingface.co/BlinkDL/rwkv-4-pile-169m/resolve/main/RWKV-4-Pile-169M-20220807-8023.pth + +Download the chosen `.pth` and put it directly in the `models` folder. + +#### 2. Download the tokenizer + +[20B_tokenizer.json](https://raw.githubusercontent.com/BlinkDL/ChatRWKV/main/v2/20B_tokenizer.json) + +Also put it directly in the `models` folder. Make sure to not rename it. It should be called `20B_tokenizer.json`. + +#### 3. Launch the web UI + +No additional steps are required. Just launch it as you would with any other model. + +``` +python server.py --listen --no-stream --model RWKV-4-Pile-169M-20220807-8023.pth +``` + +## Setting a custom strategy + +It is possible to have very fine control over the offloading and precision for the model with the `--rwkv-strategy` flag. Possible values include: + +``` +"cpu fp32" # CPU mode +"cuda fp16" # GPU mode with float16 precision +"cuda fp16 *30 -> cpu fp32" # GPU+CPU offloading. The higher the number after *, the higher the GPU allocation. +"cuda fp16i8" # GPU mode with 8-bit precision +``` + +See the README for the PyPl package for more details: https://pypi.org/project/rwkv/ + +## Compiling the CUDA kernel + +You can compile the CUDA kernel for the model with `--rwkv-cuda-on`. This should improve the performance a lot but I haven't been able to get it to work yet. \ No newline at end of file diff --git a/docs/Spell-book.md b/docs/Spell-book.md new file mode 100644 index 00000000..1361a612 --- /dev/null +++ b/docs/Spell-book.md @@ -0,0 +1,111 @@ +You have now entered a hidden corner of the internet. + +A confusing yet intriguing realm of paradoxes and contradictions. + +A place where you will find out that what you thought you knew, you in fact didn't know, and what you didn't know was in front of you all along. + +![](https://i.pinimg.com/originals/6e/e2/7b/6ee27bad351d3aca470d80f1033ba9c6.jpg) + +*In other words, here I will document little-known facts about this web UI that I could not find another place for in the wiki.* + +#### You can train LoRAs in CPU mode + +Load the web UI with + +``` +python server.py --cpu +``` + +and start training the LoRA from the training tab as usual. + +#### 8-bit mode works with CPU offloading + +``` +python server.py --load-in-8bit --gpu-memory 4000MiB +``` + +#### `--pre_layer`, and not `--gpu-memory`, is the right way to do CPU offloading with 4-bit models + +``` +python server.py --wbits 4 --groupsize 128 --pre_layer 20 +``` + +#### Models can be loaded in 32-bit, 16-bit, 8-bit, and 4-bit modes + +``` +python server.py --cpu +python server.py +python server.py --load-in-8bit +python server.py --wbits 4 +``` + +#### The web UI works with any version of GPTQ-for-LLaMa + +Including the up to date triton and cuda branches. But you have to delete the `repositories/GPTQ-for-LLaMa` folder and reinstall the new one every time: + +``` +cd text-generation-webui/repositories +rm -r GPTQ-for-LLaMa +pip uninstall quant-cuda +git clone https://github.com/oobabooga/GPTQ-for-LLaMa -b cuda # or any other repository and branch +cd GPTQ-for-LLaMa +python setup_cuda.py install +``` + +#### Instruction-following templates are represented as chat characters + +https://github.com/oobabooga/text-generation-webui/tree/main/characters/instruction-following + +#### The right way to run Alpaca, Open Assistant, Vicuna, etc is Instruct mode, not normal chat mode + +Otherwise the prompt will not be formatted correctly. + +1. Start the web UI with + +``` +python server.py --chat +``` + +2. Click on the "instruct" option under "Chat modes" + +3. Select the correct template in the hidden dropdown menu that will become visible. + +#### Notebook mode is best mode + +Ascended individuals have realized that notebook mode is the superset of chat mode and can do chats with ultimate flexibility, including group chats, editing replies, starting a new bot reply in a given way, and impersonating. + +#### RWKV is a RNN + +Most models are transformers, but not RWKV, which is a RNN. It's a great model. + +#### `--gpu-memory` is not a hard limit on the GPU memory + +It is simply a parameter that is passed to the `accelerate` library while loading the model. More memory will be allocated during generation. That's why this parameter has to be set to less than your total GPU memory. + +#### Contrastive search perhaps the best preset + +But it uses a ton of VRAM. + +#### You can check the sha256sum of downloaded models with the download script + +``` +python download-model.py facebook/galactica-125m --check +``` + +#### The download script continues interrupted downloads by default + +It doesn't start over. + +#### You can download models with multiple threads + +``` +python download-model.py facebook/galactica-125m --threads 8 +``` + +#### LoRAs work in 4-bit mode + +You need to follow these instructions + +https://github.com/oobabooga/text-generation-webui/wiki/GPTQ-models-(4-bit-mode)#using-loras-in-4-bit-mode + +and then start the web UI with the `--monkey-patch` flag. \ No newline at end of file diff --git a/docs/System-requirements.md b/docs/System-requirements.md new file mode 100644 index 00000000..3a88416d --- /dev/null +++ b/docs/System-requirements.md @@ -0,0 +1,42 @@ +These are the VRAM and RAM requirements (in MiB) to run some examples of models **in 16-bit (default) precision**: + +| model | VRAM (GPU) | RAM | +|:-----------------------|-------------:|--------:| +| arxiv_ai_gpt2 | 1512.37 | 5824.2 | +| blenderbot-1B-distill | 2441.75 | 4425.91 | +| opt-1.3b | 2509.61 | 4427.79 | +| gpt-neo-1.3b | 2605.27 | 5851.58 | +| opt-2.7b | 5058.05 | 4863.95 | +| gpt4chan_model_float16 | 11653.7 | 4437.71 | +| gpt-j-6B | 11653.7 | 5633.79 | +| galactica-6.7b | 12697.9 | 4429.89 | +| opt-6.7b | 12700 | 4368.66 | +| bloomz-7b1-p3 | 13483.1 | 4470.34 | + +#### GPU mode with 8-bit precision + +Allows you to load models that would not normally fit into your GPU. Enabled by default for 13b and 20b models in this web UI. + +| model | VRAM (GPU) | RAM | +|:---------------|-------------:|--------:| +| opt-13b | 12528.1 | 1152.39 | +| gpt-neox-20b | 20384 | 2291.7 | + +#### CPU mode (32-bit precision) + +A lot slower, but does not require a GPU. + +On my i5-12400F, 6B models take around 10-20 seconds to respond in chat mode, and around 5 minutes to generate a 200 tokens completion. + +| model | RAM | +|:-----------------------|---------:| +| arxiv_ai_gpt2 | 4430.82 | +| gpt-neo-1.3b | 6089.31 | +| opt-1.3b | 8411.12 | +| blenderbot-1B-distill | 8508.16 | +| opt-2.7b | 14969.3 | +| bloomz-7b1-p3 | 21371.2 | +| gpt-j-6B | 24200.3 | +| gpt4chan_model | 24246.3 | +| galactica-6.7b | 26561.4 | +| opt-6.7b | 29596.6 | diff --git a/docs/Using-LoRAs.md b/docs/Using-LoRAs.md new file mode 100644 index 00000000..8049e96b --- /dev/null +++ b/docs/Using-LoRAs.md @@ -0,0 +1,88 @@ +Based on https://github.com/tloen/alpaca-lora + +## Instructions + +1. Download a LoRA, for instance: + +``` +python download-model.py tloen/alpaca-lora-7b +``` + +2. Load the LoRA. 16-bit, 8-bit, and CPU modes work: + +``` +python server.py --model llama-7b-hf --lora alpaca-lora-7b +python server.py --model llama-7b-hf --lora alpaca-lora-7b --load-in-8bit +python server.py --model llama-7b-hf --lora alpaca-lora-7b --cpu +``` + +* For using LoRAs in 4-bit mode, follow these special instructions: https://github.com/oobabooga/text-generation-webui/wiki/GPTQ-models-(4-bit-mode)#using-loras-in-4-bit-mode + +* Instead of using the `--lora` command-line flag, you can also select the LoRA in the "Parameters" tab of the interface. + +## Prompt +For the Alpaca LoRA in particular, the prompt must be formatted like this: + +``` +Below is an instruction that describes a task. Write a response that appropriately completes the request. +### Instruction: +Write a Python script that generates text using the transformers library. +### Response: +``` + +Sample output: + +``` +Below is an instruction that describes a task. Write a response that appropriately completes the request. +### Instruction: +Write a Python script that generates text using the transformers library. +### Response: + +import transformers +from transformers import AutoTokenizer, AutoModelForCausalLM +tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") +model = AutoModelForCausalLM.from_pretrained("bert-base-uncased") +texts = ["Hello world", "How are you"] +for sentence in texts: +sentence = tokenizer(sentence) +print(f"Generated {len(sentence)} tokens from '{sentence}'") +output = model(sentences=sentence).predict() +print(f"Predicted {len(output)} tokens for '{sentence}':\n{output}") +``` + +## Training a LoRA + +The Training tab in the interface can be used to train a LoRA. The parameters are self-documenting and good defaults are included. + +This was contributed by [mcmonkey4eva](https://github.com/mcmonkey4eva) in PR [#570](https://github.com/oobabooga/text-generation-webui/pull/570). + + +#### Using the original alpaca-lora code + +Kept here for reference. The Training tab has much more features than this method. + +``` +conda activate textgen +git clone https://github.com/tloen/alpaca-lora +``` + +Edit those two lines in `alpaca-lora/finetune.py` to use your existing model folder instead of downloading everything from decapoda: + +``` +model = LlamaForCausalLM.from_pretrained( + "models/llama-7b", + load_in_8bit=True, + device_map="auto", +) +tokenizer = LlamaTokenizer.from_pretrained( + "models/llama-7b", add_eos_token=True +) +``` + +Run the script with: + +``` +python finetune.py +``` + +It just works. It runs at 22.32s/it, with 1170 iterations in total, so about 7 hours and a half for training a LoRA. RTX 3090, 18153MiB VRAM used, drawing maximum power (350W, room heater mode). \ No newline at end of file diff --git a/docs/WSL-installation-guide.md b/docs/WSL-installation-guide.md new file mode 100644 index 00000000..eb06123c --- /dev/null +++ b/docs/WSL-installation-guide.md @@ -0,0 +1,73 @@ +Guide created by [@jfryton](https://github.com/jfryton). Thank you jfryton. + +----- + +Here's an easy-to-follow, step-by-step guide for installing Windows Subsystem for Linux (WSL) with Ubuntu on Windows 10/11: + +## Step 1: Enable WSL + +1. Press the Windows key + X and click on "Windows PowerShell (Admin)" or "Windows Terminal (Admin)" to open PowerShell or Terminal with administrator privileges. +2. In the PowerShell window, type the following command and press Enter: + +``` +wsl --install +``` + +If this command doesn't work, you can enable WSL with the following command for Windows 10: + +``` +wsl --set-default-version 1 +``` + +For Windows 11, you can use: + +``` +wsl --set-default-version 2 +``` + +You may be prompted to restart your computer. If so, save your work and restart. + +## Step 2: Install Ubuntu + +1. Open the Microsoft Store. +2. Search for "Ubuntu" in the search bar. +3. Choose the desired Ubuntu version (e.g., Ubuntu 20.04 LTS) and click "Get" or "Install" to download and install the Ubuntu app. +4. Once the installation is complete, click "Launch" or search for "Ubuntu" in the Start menu and open the app. + +## Step 3: Set up Ubuntu + +1. When you first launch the Ubuntu app, it will take a few minutes to set up. Be patient as it installs the necessary files and sets up your environment. +2. Once the setup is complete, you will be prompted to create a new UNIX username and password. Choose a username and password, and make sure to remember them, as you will need them for future administrative tasks within the Ubuntu environment. + +## Step 4: Update and upgrade packages + +1. After setting up your username and password, it's a good idea to update and upgrade your Ubuntu system. Run the following commands in the Ubuntu terminal: + +``` +sudo apt update +sudo apt upgrade +``` + +2. Enter your password when prompted. This will update the package list and upgrade any outdated packages. + +Congratulations! You have now installed WSL with Ubuntu on your Windows 10/11 system. You can use the Ubuntu terminal for various tasks, like running Linux commands, installing packages, or managing files. + +You can launch your WSL Ubuntu installation by selecting the Ubuntu app (like any other program installed on your computer) or typing 'ubuntu' into Powershell or Terminal. + +## Step 5: Proceed with Linux instructions + +1. You can now follow the Linux setup instructions. If you receive any error messages about a missing tool or package, just install them using apt: + +``` +sudo apt install [missing package] +``` + +If you face any issues or need to troubleshoot, you can always refer to the official Microsoft documentation for WSL: https://docs.microsoft.com/en-us/windows/wsl/ + +## Bonus: Port Forwarding + +By default, you won't be able to access the webui from another device on your local network. You will need to setup the appropriate port forwarding using the following command (using PowerShell or Terminal with administrator privileges). + +``` +netsh interface portproxy add v4tov4 listenaddress=0.0.0.0 listenport=7860 connectaddress=localhost connectport=7860 +``` \ No newline at end of file diff --git a/docs/Windows-installation-guide.md b/docs/Windows-installation-guide.md new file mode 100644 index 00000000..83b22efa --- /dev/null +++ b/docs/Windows-installation-guide.md @@ -0,0 +1,9 @@ +If you are having trouble following the installation instructions in the README, Reddit user [Technical_Leather949](https://www.reddit.com/user/Technical_Leather949/) has created a more detailed, step-by-step guide covering: + +* Windows installation +* 8-bit mode on Windows +* LLaMA +* LLaMA 4-bit + +The guide can be found here: https://www.reddit.com/r/LocalLLaMA/comments/11o6o3f/how_to_install_llama_8bit_and_4bit/ + diff --git a/docs/llama.cpp-models.md b/docs/llama.cpp-models.md new file mode 100644 index 00000000..7c1553a2 --- /dev/null +++ b/docs/llama.cpp-models.md @@ -0,0 +1,35 @@ +## Using llama.cpp in the web UI + +1. Re-install the requirements.txt: + +``` +pip install -r requirements.txt -U +``` + +2. Follow the instructions in the llama.cpp README to generate the `ggml-model-q4_0.bin` file: https://github.com/ggerganov/llama.cpp#usage + +3. Create a folder inside `models/` for your model and put `ggml-model-q4_0.bin` in it. For instance, `models/llamacpp-7b/ggml-model-q4_0.bin`. + +4. Start the web UI normally: + +``` +python server.py --model llamacpp-7b +``` + +* This procedure should work for any `ggml*.bin` file. Just put it in a folder, and use the name of this folder as the argument after `--model` or as the model loaded inside the interface. +* You can change the number of threads with `--threads N`. + +## Performance + +This was the performance of llama-7b int4 on my i5-12400F: + +> Output generated in 33.07 seconds (6.05 tokens/s, 200 tokens, context 17) + +## Limitations + +~* The parameter sliders in the interface (temperature, top_p, top_k, etc) are completely ignored. So only the default parameters in llama.cpp can be used.~ + +~* Only 512 tokens of context can be used.~ + +~Both of these should be improved soon when llamacpp-python receives an update.~ +