Add a proper documentation (#3885)

2024-10-01 01:26:03 -04:00 · 2023-10-21 19:15:54 -03:00 · 2023-10-21 19:15:54 -03:00 · 6efb990b60
commit 6efb990b60
parent 5a5bc135e9
30 changed files with 707 additions and 932 deletions
--- a/README.md
+++ b/README.md
@ -24,7 +24,7 @@ Its goal is to become the [AUTOMATIC1111/stable-diffusion-webui](https://github.
 * Markdown output with LaTeX rendering, to use for instance with [GALACTICA](https://github.com/paperswithcode/galai)
 * API, including endpoints for websocket streaming ([see the examples](https://github.com/oobabooga/text-generation-webui/blob/main/api-examples))
-To learn how to use the various features, check out the Documentation: https://github.com/oobabooga/text-generation-webui/tree/main/docs
+To learn how to use the various features, check out the Documentation: https://github.com/oobabooga/text-generation-webui/wiki
 ## Installation
--- a/docs/01
+++ b/docs/01
@ -0,0 +1,151 @@
 Used to have multi-turn conversations with the model.
 ## Input area
 The following buttons can be found. Note that the hover menu can be replaced with always-visible buttons with the `--chat-buttons` flag.
 * **Generate**: sends your message and makes the model start a reply.
 * **Stop**: stops an ongoing generation as soon as the next token is generated (which can take a while for a slow model).
 * **Continue**: makes the model attempt to continue the existing reply. In some cases, the model may simply end the existing turn immediately without generating anything new, but in other cases, it may generate a longer reply.
 * **Regenerate**: similar to Generate, but your last message is used as input instead of the text in the input field. Note that if the temperature/top_p/top_k parameters are low in the "Parameters" tab of the UI, the new reply may end up identical to the previous one.
 * **Remove last reply**: removes the last input/output pair from the history and sends your last message back into the input field.
 * **Replace last reply**: replaces the last reply with whatever you typed into the input field. Useful in conjunction with "Copy last reply" if you want to edit the bot response.
 * **Copy last reply**: sends the contents of the bot's last reply to the input field.
 * **Impersonate**: makes the model generate a new message on your behalf in the input field, taking into consideration the existing chat history.
 * **Send dummy message**: adds a new message to the chat history without causing the model to generate a reply.
 * **Send dummy reply**: adds a new reply to the chat history as if the model had generated this reply. Useful in conjunction with "Send dummy message".
 * **Start new chat**: starts a new conversation while keeping the old one saved. If you are talking to a character that has a "Greeting" message defined, this message will be automatically added to the new history.
 * **Send to default**: sends the entire chat prompt up to now to the "Default" tab.
 * **Send to notebook**: sends the entire chat prompt up to now to the "Notebook" tab.
 The **Show controls** checkbox causes the input fields below the input textbox to disappear. It is useful for making the page fit entirely into view and not scroll.
 ## Past chats
 Allows you to switch between the current and previous conversations with the current character, or between the current and previous instruct conversations (if in "instruct" mode). The **Rename** menu can be used to give a unique name to the selected conversation, and the 🗑️ button allows you to delete it.
 ## Start reply with
 Whatever you type there will appear at the start of every reply by the bot. This is useful to guide the response in the desired direction.
 ## Mode
 The most important input field. It defines how the chat prompt is formatted. There are 3 options: chat, chat-instruct, and instruct. It is worth going into more detail about this because it seems to not be obvious to a lot of people.
 ### Instruction-following models
 There are two kinds of models: base models, like Llama and GPT-J, and fine-tuned models, like Alpaca and Vicuna. Fine-tuned models are trained starting from base models, most often with the goal of getting the model to understand and respond to instructions just like ChatGPT does. Let's call such models *instruction-following models*.
 Each instruction-following model was trained on a specific prompt format, and you have to use that exact prompt format if you want the model to follow your instructions as accurately as it can.
 As an example, this is the Alpaca format:
 ```
 Below is an instruction that describes a task. Write a response that appropriately completes the request.
 ### Instruction:
 Hi there!
 ### Response:
 Hello! It's nice to meet you. What can I help with?
 ### Instruction:
 How are you?
 ### Response:
 I'm doing well, thank you for asking! Is there something specific you would like to talk about or ask me? I'm here to help answer any questions you may have.
 ```
 This format is characterized by a context string at the top, and alternating turns where each user input starts with `### Instruction:` and each bot turn starts with `### Response:`. There are also weirder formats, like the one used by the Llama-2-chat models released by Meta AI:
 ```
 [INST] <<SYS>>
 Answer the questions.
 <</SYS>>
 Hi there! [/INST] Hello! It's nice to meet you. What can I help with? </s><s>[INST] How are you? [/INST] I'm doing well, thank you for asking! Is there something specific you would like to talk about or ask me? I'm here to help answer any questions you may have.
 ```
 In this format, there are special tokens at the end of each bot reply (`</s>`, the end of sequence token, and `<s>`, the beginning of sequence token); no new lines separating the turns; and the context string is written between `<<SYS>>` and `<</SYS>>`. Despite the intimidating look of this format, the logic is the same: there are user turns and bot turns, and each one appears in a specific place in the template.
 It is important to emphasize that instruction-following models **have to be used with the exact prompt format that they were trained on**. Using those models with any other prompt format should be considered undefined behavior. The model will still generate replies, but they will be less accurate to your inputs.
 Now that an instruction-following model is defined, we can move on to describing the 3 chat modes.
 ### Chat
 Used for talking to the character defined under "Parameters" > "Character" using a simple chat prompt in this format:
 ```
 Chiharu Yamada's Persona: Chiharu Yamada is a young, computer engineer-nerd with a knack for problem solving and a passion for technology.
 You: Hi there!
 Chiharu Yamada: Hello! It's nice to meet you. What can I help with?
 You: How are you?
 Chiharu Yamada: I'm doing well, thank you for asking! Is there something specific you would like to talk about or ask me? I'm here to help answer any questions you may have.
 ```
 There are 3 adjustable parameters in "Parameters" > "Character" being used in this prompt:
 * The **Context** string appears at the top of the prompt. Most often it describes the bot's personality and adds a few example messages to guide the model towards the desired reply length and format. 
 * The **Your name** string appears at the beginning of each user reply. By default, this string is "You".
 * The **Character's name** string appears at the beginning of each bot reply.
 Additionally, the **Greeting** string appears as the bot's opening message whenever the history is cleared.
 The "Chat" option should typically be used only for base models or non-instruct fine tunes, and should not be used for instruction-following models.
 ### Instruct
 Used for talking to an instruction-following model using the prompt format defined under "Parameters" > "Instruction template". Think of this option as an offline ChatGPT.
 The prompt format is defined by the following adjustable parameters in "Parameters" > "Instruction template":
 * **Context**: appears at the top of the prompt exactly as it is written, including the newline characters at the end (if any). Often the context includes a customizable system message. For instance, instead of "Answer the questions." for Llama-2-chat, you can write "Answer the questions as if you were a pirate.", and the model will comply.
 * **Turn template**: defines a single input/reply turn. In this string, `<|user|>` and `<|bot|>` are placeholders that get replaced with whatever you type in the **User string** and **Bot string** fields respectively; they are mandatory and should be present even if those fields are empty. `<|user-message|>` and `<|bot-message|>` get replaced with the user and bot messages at that turn. If the prompt format uses newline characters, they should be written inline as `\n` in the turn template.
 Note that when you load a model in the "Model" tab, the web UI will try to automatically detect its instruction template (if any), and will update the values under "Parameters" > "Instruction template" accordingly. This is done using a set of regular expressions defined in `models/config.yaml`. This detection is not guaranteed to be accurate. You should check the model card on Hugging Face to see if you are using the correct prompt format.
 ### Chat-instruct
 As said above, instruction-following models are meant to be used with their specific prompt templates. The chat-instruct mode allows you to use those templates to generate a chat reply, thus mixing Chat and Instruct modes (hence the name).
 It works by creating a single instruction-following turn where a command is given followed by the regular chat prompt. Here is an example in Alpaca format:
 ```
 Below is an instruction that describes a task. Write a response that appropriately completes the request.
 ### Instruction:
 Continue the chat dialogue below. Write a single reply for the character "Chiharu Yamada".
 Chiharu Yamada's Persona: Chiharu Yamada is a young, computer engineer-nerd with a knack for problem solving and a passion for technology.
 You: Hi there!
 Chiharu Yamada: Hello! It's nice to meet you. What can I help with?
 You: How are you?
 ### Response:
 Chiharu Yamada:
 ```
 Here, the command is
 > Continue the chat dialogue below. Write a single reply for the character "Chiharu Yamada".
 Below this command, the regular chat prompt is added, including its Context string and the chat history, and then the user turn ends. The bot turn starts with the "Character's name" string followed by `:`, thus prompting the instruction-following model to write a single reply for the character.
 The chat-instruct command can be customized under "Parameters" > "Instruction template" > "Command for chat-instruct mode". Inside that command string, `<|character|>` is a placeholder that gets replaced with the bot name, and `<|prompt|>` is a placeholder that gets replaced with the full chat prompt.
 Note that you can get creative: instead of writing something trivial like "Write a single reply for the character", you could add more complex instructions like
 > This is an adventure game, and your task is to write a reply in name of "<|character|>" where 3 options are given for the user to then choose from.
 And it works:
 ![chat-instruct](https://github.com/oobabooga/text-generation-webui/assets/112222186/e38e3469-8263-4a10-b1a1-3c955026b8e7)
 ## Chat style
 This defines the visual style of the chat UI. Each option is a CSS file defined under `text-generation-webui/css/chat_style-name.css`, where "name" is how this style is called in the dropdown menu. You can add new styles by simply copying `chat_style-cai-chat.css` to `chat_style-myNewStyle.css` and editing the contents of this new file. If you end up with a style that you like, you are highly encouraged to submit it to the repository.
 The styles are only applied to chat and chat-instruct modes. Instruct mode has its separate style defined in `text-generation-webui/css/html_instruct_style.css`.
 ## Character gallery
 This menu is a built-in extension defined under `text-generation-webui/extensions/gallery`. It displays a gallery with your characters, and if you click on a character, it will be automatically selected in the menu under "Parameters" > "Character".
--- a/Tabs.md
+++ b/Tabs.md
@ -0,0 +1,35 @@
 Used to generate raw completions starting from your prompt.
 ## Default tab
 This tab contains two main text boxes: Input, where you enter your prompt, and Output, where the model output will appear.
 ### Input
 The number on the lower right of the Input box counts the number of tokens in the input. It gets updated whenever you update the input text as long as a model is loaded (otherwise there is no tokenizer to count the tokens).
 Below the Input box, the following buttons can be found:
 * **Generate**: starts a new generation.
 * **Stop**: stops an ongoing generation as soon as the next token is generated (which can take a while for a slow model).
 * **Continue**: starts a new generation taking as input the text in the "Output" box.
 In the **Prompt** menu, you can select from some predefined prompts defined under `text-generation-webui/prompts`. The 💾 button saves your current input as a new prompt, the 🗑️ button deletes the selected prompt, and the 🔄 button refreshes the list. If you come up with an interesting prompt for a certain task, you are welcome to submit it to the repository.
 ### Output
 Four tabs can be found:
 * **Raw**: where the raw text generated by the model appears.
 * **Markdown**: it contains a "Render" button. You can click on it at any time to render the current output as markdown. This is particularly useful for models that generate LaTeX equations like GALACTICA.
 * **HTML**: displays the output in an HTML style that is meant to be easier to read. Its style is defined under `text-generation-webui/css/html_readable_style.css`.
 * **Logits**: when you click on "Get next token probabilities", this tab displays the 50 most likely next tokens and their probabilities based on your current input. If "Use samplers" is checked, the probabilities will be the ones after the sampling parameters in the "Parameters" > "Generation" tab are applied. Otherwise, they will be the raw probabilities generated by the model.
 * **Tokens**: allows you to tokenize your prompt and see the ID numbers for the individuals tokens.
 ## Notebook tab
 Precisely the same thing as the Default tab, with the difference that the output appears in the same text box as the input. 
 It contains the following additional button:
 * **Regenerate**: uses your previous input for generation while discarding the last output.
--- a/docs/03
+++ b/docs/03
@ -0,0 +1,120 @@
 ## Generation
 Contains parameters that control the text generation. 
 ### Quick rundown
 LLMs work by generating one token at a time. Given your prompt, the model calculates the probabilities for every possible next token. The actual token generation is done after that. 
 * In *greedy decoding*, the most likely token is always picked.
 * Most commonly, *sampling* techniques are used to choose from the next-token distribution in a more non-trivial way with the goal of improving the quality of the generated text.
 ### Preset menu
 Can be used to save combinations of parameters for reuse. 
 The built-in presets were not manually chosen. They were obtained after a blind contest called "Preset Arena" where hundreds of people voted. The full results can be found [here](https://github.com/oobabooga/oobabooga.github.io/blob/main/arena/results.md).
 A key takeaway is that the best presets are:
 * **For Instruct**: Divine Intellect, Big O, simple-1, Space Alien, StarChat, Titanic, tfs-with-top-a, Asterism, Contrastive Search (only works for the Transformers loader at the moment).
 * **For Chat**: Midnight Enigma, Yara, Shortwave.
 The other presets are:
 * Mirostat: a special decoding technique first implemented in llama.cpp and then adapted into this repository for all loaders. Many people have obtained positive results with it for chat.
 * LLaMA-Precise: a legacy preset that was the default for the web UI before the Preset Arena.
 * Debug-deterministic: disables sampling. It is useful for debugging, or if you intentionally want to use greedy decoding.
 ### Parameters description
 For more information about the parameters, the [transformers documentation](https://huggingface.co/docs/transformers/main_classes/text_generation#transformers.GenerationConfig) is a good reference.
 * **max_new_tokens**: Maximum number of tokens to generate. Don't set it higher than necessary: it is used in the truncation calculation through the formula `(prompt_length) = min(truncation_length - max_new_tokens, prompt_length)`, so your prompt will get truncated if you set it too high.
 * **temperature**: Primary factor to control the randomness of outputs. 0 = deterministic (only the most likely token is used). Higher value = more randomness.
 * **top_p**: If not set to 1, select tokens with probabilities adding up to less than this number. Higher value = higher range of possible random results.
 * **top_k**: Similar to top_p, but select instead only the top_k most likely tokens. Higher value = higher range of possible random results.
 * **repetition_penalty**: Penalty factor for repeating prior tokens. 1 means no penalty, higher value = less repetition, lower value = more repetition.
 * **repetition_penalty_range**: The number of most recent tokens to consider for repetition penalty. 0 makes all tokens be used.
 * **typical_p**: If not set to 1, select only tokens that are at least this much more likely to appear than random tokens, given the prior text.
 * **tfs**: Tries to detect a tail of low-probability tokens in the distribution and removes those tokens. See [this blog post](https://www.trentonbricken.com/Tail-Free-Sampling/) for details. The closer to 0, the more discarded tokens.
 * **top_a**: Tokens with probability smaller than `(top_a) * (probability of the most likely token)^2` are discarded.
 * **epsilon_cutoff**: In units of 1e-4; a reasonable value is 3. This sets a probability floor below which tokens are excluded from being sampled.
 * **eta_cutoff**: In units of 1e-4; a reasonable value is 3. The main parameter of the special Eta Sampling technique. See [this paper](https://arxiv.org/pdf/2210.15191.pdf) for a description.
 * **guidance_scale**: The main parameter for Classifier-Free Guidance (CFG). [The paper](https://arxiv.org/pdf/2306.17806.pdf) suggests that 1.5 is a good value. It can be used in conjunction with a negative prompt or not.
 * **Negative prompt**: Only used when `guidance_scale != 1`. It is most useful for instruct models and custom system messages. You place your full prompt with the default system message for the model (like "You are Llama, a helpful assistant...") in this field to make the model pay more attention to your custom message.
 * **penalty_alpha**: Contrastive Search is enabled by setting this to greater than zero and unchecking "do_sample". It should be used with a low value of top_k, for instance, top_k = 4.
 * **mirostat_mode**: Activates the Mirostat sampling technique. It aims to control perplexity during sampling. See the [paper](https://arxiv.org/abs/2007.14966).
 * **mirostat_tau**: No idea, see the paper for details. According to the Preset Arena, 8 is a good value. 
 * **mirostat_tau**: No idea, see the paper for details. According to the Preset Arena, 0.1 is a good value.
 * **do_sample**: When unchecked, sampling is entirely disabled, and greedy decoding is used instead (the most likely token is always picked).
 * **Seed**: Set the Pytorch seed to this number. Note that some loaders do not use Pytorch (notably llama.cpp), and others are not deterministic (notably ExLlama v1 and v2). For these loaders, the seed has no effect.
 * **encoder_repetition_penalty**: Also known as the "Hallucinations filter". Used to penalize tokens that are *not* in the prior text. Higher value = more likely to stay in context, lower value = more likely to diverge.
 * **no_repeat_ngram_size**: If not set to 0, specifies the length of token sets that are completely blocked from repeating at all. Higher values = blocks larger phrases, lower values = blocks words or letters from repeating. Only 0 or high values are a good idea in most cases.
 * **min_length**: Minimum generation length in tokens. This is a built-in parameter in the transformers library that has never been very useful. Typically you want to check "Ban the eos_token" instead.
 * **num_beams**: Number of beams for beam search. 1 means no beam search.
 * **length_penalty**: Used by beam search only. `length_penalty > 0.0` promotes longer sequences, while `length_penalty < 0.0` encourages shorter sequences.
 * **early_stopping**: Used by beam search only. When checked, the generation stops as soon as there are "num_beams" complete candidates; otherwise, a heuristic is applied and the generation stops when is it very unlikely to find better candidates (I just copied this from the transformers documentation and have never gotten beam search to generate good results).
 To the right (or below if you are on mobile), the following parameters are present:
 * **Truncate the prompt up to this length**: Used to prevent the prompt from getting bigger than the model's context length. In the case of the transformers loader, which allocates memory dynamically, this parameter can also be used to set a VRAM ceiling and prevent out-of-memory errors. This parameter is automatically updated with the model's context length (from "n_ctx" or "max_seq_len" for loaders that use these parameters, and from the model metadata directly for loaders that do not) when you load a model.
 * **Maximum number of tokens/second**: to make text readable in real-time in case the model is generating too fast. Good if you want to flex and tell everyone how good your GPU is.
 * **Custom stopping strings**: The model stops generating as soon as any of the strings set in this field is generated. Note that when generating text in the Chat tab, some default stopping strings are set regardless of this parameter, like "\nYour Name:" and "\nBot name:" for chat mode. That's why this parameter has a "Custom" in its name.
 * **Custom token bans**: Allows you to ban the model from generating certain tokens altogether. You need to find the token IDs under "Default" > "Tokens" or "Notebook" > "Tokens", or by looking at the `tokenizer.json` for the model directly.
 * **auto_max_new_tokens**: When checked, the max_new_tokens parameter is expanded in the backend to the available context length. The maximum length is given by the "truncation_length" parameter. This is useful for getting long replies in the Chat tab without having to click on "Continue" many times.
 * **Ban the eos_token**: One of the possible tokens that a model can generate is the EOS (End of Sequence) token. When it is generated, the generation stops prematurely. When this parameter is checked, that token is banned from being generated, and the generation will always generate "max_new_tokens" tokens.
 * **Add the bos_token to the beginning of prompts**: By default, the tokenizer will add a BOS (Beginning of Sequence) token to your prompt. During training, BOS tokens are used to separate different documents. If unchecked, no BOS token will be added, and the model will interpret your prompt as being in the middle of a document instead of at the start of one. This significantly changes the output and can make it more creative.
 * **Skip special tokens**: When decoding the generated tokens, skip special tokens from being converted to their text representation. Otherwise, BOS appears as `<s>`, EOS as `</s>`, etc.
 * **Activate text streaming**: When unchecked, the full response is outputted at once, without streaming the words one at a time. I recommend unchecking this parameter on high latency networks like running the webui on Google Colab or using `--share`.
 * **Load grammar from file**: Loads a GBNF grammar from a file under `text-generation-webui/grammars`. The output is written to the "Grammar" box below. You can also save and delete custom grammars using this menu.
 * **Grammar**: Allows you to constrain the model output to a particular format. For instance, you can make the model generate lists, JSON, specific words, etc. Grammar is extremely powerful and I highly recommend it. The syntax looks a bit daunting at first sight, but it gets very easy once you understand it. See the [GBNF Guide](https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md) for details.
 ## Character
 Parameters that define the character that is used in the Chat tab when "chat" or "chat-instruct" are selected under "Mode".
 * **Character**: A dropdown menu where you can select from saved characters, save a new character (💾 button), and delete the selected character (🗑️).
 * **Your name**: Your name as it appears in the prompt.
 * **Character's name**: The bot name as it appears in the prompt.
 * **Context**: A string that is always at the top of the prompt. It never gets truncated. It usually defines the bot's personality and some key elements of the conversation.
 * **Greeting**: An opening message for the bot. When set, it appears whenever you start a new chat.
 * **Character picture**: A profile picture for the bot. To make it apply, you need to save the bot by clicking on 💾.
 * **Your picture**: Your profile picture. It will be used in all conversations.
 Note: the following replacements take place in the context and greeting fields when the chat prompt is generated:
 * `{{char}}` and `<BOT>` get replaced with "Character's name".
 * `{{user}}` and `<USER>` get replaced with "Your name".
 So you can use those special placeholders in your character definitions. They are commonly found in TavernAI character cards.
 ## Instruction template
 Defines the instruction template that is used in the Chat tab when "instruct" or "chat-instruct" are selected under "Mode".
 * **Instruction template**: A dropdown menu where you can select from saved templates, save a new template (💾 button), and delete the currently selected template (🗑️).
 * **User string**: In the turn template, `<|user|>` gets replaced with this string.
 * **Bot string**: In the turn template, `<|bot|>` gets replaced with this string.
 * **Context**: A string that appears as-is at the top of the prompt, including the new line characters at the end (if any). The system message for the model can be edited inside this string to customize its behavior.
 * **Turn template**: Defines the positioning of spaces and new line characters in a single turn of the dialogue. `<|user-message|>` gets replaced with the user input and `<|bot-message|>` gets replaced with the bot reply. It is necessary to include `<|user|>` and `<|bot|>` even if "User string" and "Bot string" above are empty, as those placeholders are used to split the template in parts in the backend.
 * **Send to default**: Send the full instruction template in string format to the Default tab.
 * **Send to notebook**: Send the full instruction template in string format to the Notebook tab.
 * **Send to negative prompt**: Send the full instruction template in string format to the "Negative prompt" field under "Parameters" > "Generation".
 * **Command for chat-instruct mode**: The command that is used in chat-instruct mode to query the model to generate a reply on behalf of the character. Can be used creatively to generate specific kinds of responses.
 ## Chat history
 In this tab, you can download the current chat history in JSON format and upload a previously saved chat history. 
 When a history is uploaded, a new chat is created to hold it. That is, you don't lose your current chat in the Chat tab.
 ## Upload character
 ### YAML or JSON
 Allows you to upload characters in the YAML format used by the web UI, including optionally a profile picture. 
 ### TavernAI PNG
 Allows you to upload a TavernAI character card. It will be converted to the internal YAML format of the web UI after upload.
--- a/docs/04
+++ b/docs/04
@ -0,0 +1,145 @@
 This is where you load models, apply LoRAs to a loaded model, and download new models.
 ## Model loaders
 ### Transformers
 Loads: full precision (16-bit or 32-bit) models. The repository usually has a clean name without GGUF, EXL2, GPTQ, or AWQ in its name, and the model files are named `pytorch_model.bin` or `model.safetensors`. 
 Example: [https://huggingface.co/lmsys/vicuna-7b-v1.5](https://huggingface.co/lmsys/vicuna-7b-v1.5). 
 Full precision models use a ton of VRAM, so you will usually want to select the "load_in_4bit" and "use_double_quant" options to load the model in 4-bit precision using bitsandbytes.
 This loader can also load GPTQ models and train LoRAs with them. For that, make sure to check the "auto-devices" and "disable_exllama" options before loading the model.
 Options:
 * **gpu-memory**: When set to greater than 0, activates CPU offloading using the accelerate library, where part of the layers go to the CPU. The performance is very bad. Note that accelerate doesn't treat this parameter very literally, so if you want the VRAM usage to be at most 10 GiB, you may need to set this parameter to 9 GiB or 8 GiB. It can be used in conjunction with "load_in_8bit" but not with "load-in-4bit" as far as I'm aware.
 * **cpu-memory**: Similarly to the parameter above, you can also set a limit on the amount of CPU memory used. Whatever doesn't fit either in the GPU or the CPU will go to a disk cache, so to use this option you should also check the "disk" checkbox.
 * **compute_dtype**: Used when "load-in-4bit" is checked. I recommend leaving the default value.
 * **quant_type**: Used when "load-in-4bit" is checked. I recommend leaving the default value.
 * **alpha_value**: Used to extend the context length of a model with a minor loss in quality. I have measured 1.75 to be optimal for 1.5x context, and 2.5 for 2x context. That is, with alpha = 2.5 you can make a model with 4096 context length go to 8192 context length.
 * **rope_freq_base**: Originally another way to write "alpha_value", it ended up becoming a necessary parameter for some models like CodeLlama, which was fine-tuned with this set to 1000000 and hence needs to be loaded with it set to 1000000 as well.
 * **compress_pos_emb**: The first and original context-length extension method, discovered by [kaiokendev](https://kaiokendev.github.io/til). When set to 2, the context length is doubled, 3 and it's tripled, etc. It should only be used for models that have been fine-tuned with this parameter set to different than 1. For models that have not been tuned to have greater context length, alpha_value will lead to a smaller accuracy loss.
 * **cpu**: Loads the model in CPU mode using Pytorch. The model will be loaded in 32-bit precision, so a lot of RAM will be used. CPU inference with transformers is older than llama.cpp and it works, but it's a lot slower. Note: this parameter has a different interpretation in the llama.cpp loader (see below).
 * **load-in-8bit**: Load the model in 8-bit precision using bitsandbytes. The 8-bit kernel in that library has been optimized for training and not inference, so load-in-8bit is slower than load-in-4bit (but more accurate).
 * **bf16**: Use bfloat16 precision instead of float16 (the default). Only applies when quantization is not used.
 * **auto-devices**: When checked, the backend will try to guess a reasonable value for "gpu-memory" to allow you to load a model with CPU offloading. I recommend just setting "gpu-memory" manually instead. This parameter is also needed for loading GPTQ models, in which case it needs to be checked before loading the model.
 * **disk**: Enable disk offloading for layers that don't fit into the GPU and CPU combined.
 * **load-in-4bit**: Load the model in 4-bit precision using bitsandbytes.
 * **trust-remote-code**: Some models use custom Python code to load the model or the tokenizer. For such models, this option needs to be set. It doesn't download any remote content: all it does is execute the .py files that get downloaded with the model. Those files can potentially include malicious code; I have never seen it happen, but it is in principle possible.
 * **use_fast**: Use the "fast" version of the tokenizer. Especially useful for Llama models, which originally had a "slow" tokenizer that received an update. If your local files are in the old "slow" format, checking this option may trigger a conversion that takes several minutes. The fast tokenizer is mostly useful if you are generating 50+ tokens/second using ExLlama_HF or if you are tokenizing a huge dataset for training.
 * **disable_exllama**: Only applies when you are loading a GPTQ model through the transformers loader. It needs to be checked if you intend to train LoRAs with the model.
 ### ExLlama_HF
 Loads: GPTQ models. They usually have GPTQ in the model name, or alternatively something like "-4bit-128g" in the name.
 Example: https://huggingface.co/TheBloke/Llama-2-13B-chat-GPTQ
 ExLlama_HF is the v1 of ExLlama (https://github.com/turboderp/exllama) connected to the transformers library for sampling, tokenizing, and detokenizing. It is very fast and memory-efficient.
 * **gpu-split**: If you have multiple GPUs, the amount of memory to allocate per GPU should be set in this field. Make sure to set a lower value for the first GPU, as that's where the cache is allocated.
 * **max_seq_len**: The maximum sequence length for the model. In ExLlama, the cache is preallocated, so the higher this value, the higher the VRAM. It is automatically set to the maximum sequence length for the model based on its metadata, but you may need to lower this value be able to fit the model into your GPU. After loading the model, the "Truncate the prompt up to this length" parameter under "Parameters" > "Generation" is automatically set to your chosen "max_seq_len" so that you don't have to set the same thing twice.
 * **cfg-cache**: Creates a second cache to hold the CFG negative prompts. You need to set this if and only if you intend to use CFG in the "Parameters" > "Generation" tab. Checking this parameter doubles the cache VRAM usage.
 ### ExLlamav2_HF
 Loads: GPTQ and EXL2 models. EXL2 models usually have "EXL2" in the model name.
 Example: https://huggingface.co/turboderp/Llama2-70B-exl2
 The parameters are the same as in ExLlama_HF.
 ### ExLlama
 The same as ExLlama_HF but using the internal samplers of ExLlama instead of the ones in the Transformers library.
 ### ExLlamav2
 The same as ExLlamav2_HF but using the internal samplers of ExLlamav2 instead of the ones in the Transformers library.
 ### AutoGPTQ
 Loads: GPTQ models.
 * **wbits**: For ancient models without proper metadata, sets the model precision in bits manually. Can usually be ignored.
 * **groupsize**: For ancient models without proper metadata, sets the model group size manually. Can usually be ignored.
 * **triton**: Only available on Linux. Necessary to use models with both act-order and groupsize simultaneously. Note that ExLlama can load these same models on Windows without triton.
 * **no_inject_fused_attention**: Improves performance while increasing the VRAM usage.
 * **no_inject_fused_mlp**: Similar to the previous parameter but for Triton only.
 * **no_use_cuda_fp16**: On some systems, the performance can be very bad with this unset. Can usually be ignored.
 * **desc_act**: For ancient models without proper metadata, sets the model "act-order" parameter manually. Can usually be ignored.
 ### GPTQ-for-LLaMa
 Loads: GPTQ models.
 Ancient loader, the first one to implement 4-bit quantization. It works on older GPUs for which ExLlama and AutoGPTQ do not work, and it doesn't work with "act-order", so you should use it with simple 4-bit-128g models.
 * **pre_layer**: Used for CPU offloading. The higher the number, the more layers will be sent to the GPU. GPTQ-for-LLaMa CPU offloading was faster than the one implemented in AutoGPTQ the last time I checked.
 ### llama.cpp
 Loads: GGUF models. Note: GGML models have been deprecated and do not work anymore.
 Example: https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF
 * **n-gpu-layers**: The number of layers to allocate to the GPU. If set to 0, only the CPU will be used. If you want to offload all layers, you can simply set this to the maximum value.
 * **n-ctx**: Context length of the model. In llama.cpp, the context is preallocated, so the higher this value, the higher the RAM/VRAM usage will be. It gets automatically updated with the value in the GGUF metadata for the model when you select it in the Model dropdown.
 * **threads**: Number of threads. Recommended value: your number of physical cores. 
 * **threads_batch**: Number of threads for batch processing. Recommended value: your total number of cores (physical + virtual).
 * **n_batch**: Batch size for prompt processing. Higher values are supposed to make generation faster, but I have never obtained any benefit from changing this value.
 * **mul_mat_q**: Use the mul_mat_q kernel. This usually improves generation speed significantly.
 * **no-mmap**: Loads the model into memory at once, possibly preventing I/O operations later on at the cost of a longer load time.
 * **mlock**: Force the system to keep the model in RAM rather than swapping or compressing (no idea what this means, never used it).
 * **numa**: May improve performance on certain multi-cpu systems.
 * **cpu**: Force a version of llama.cpp compiled without GPU acceleration to be used. Can usually be ignored. Only set this if you want to use CPU only and llama.cpp doesn't work otherwise. 
 * **tensor_split**: For multi-gpu only. Sets the amount of memory to allocate per GPU.
 * **Seed**: The seed for the llama.cpp random number generator. Not very useful as it can only be set once (that I'm aware).
 ### llamacpp_HF
 The same as llama.cpp but with transformers samplers, and using the transformers tokenizer instead of the internal llama.cpp tokenizer.
 To use it, you need to download a tokenizer. There are two options:
 1) Download `oobabooga/llama-tokenizer` under "Download model or LoRA". That's a default Llama tokenizer.
 2) Place your .gguf in a subfolder of `models/` along with these 3 files: `tokenizer.model`, `tokenizer_config.json`, and `special_tokens_map.json`. This takes precedence over Option 1.
 ### ctransformers
 Loads: GGUF/GGML models.
 Similar to llama.cpp but it works for certain GGUF/GGML models not originally supported by llama.cpp like Falcon, StarCoder, StarChat, and GPT-J.
 ### AutoAWQ
 Loads: AWQ models.
 Example: https://huggingface.co/TheBloke/Phind-CodeLlama-34B-v2-AWQ
 The parameters are overall similar to AutoGPTQ.
 ## Model dropdown
 Here you can select a model to be loaded, refresh the list of available models (🔄), load/unload/reload the selected model, and save the settings for the model. The "settings" are the values in the input fields (checkboxes, sliders, dropdowns) below this dropdown. 
 After saving, those settings will get restored whenever you select that model again in the dropdown menu.
 If the **Autoload the model** checkbox is selected, the model will be loaded as soon as it is selected in this menu. Otherwise, you will have to click on the "Load" button.
 ## LoRA dropdown
 Used to apply LoRAs to the model. Note that LoRA support is not implemented for all loaders. Check this [page](https://github.com/oobabooga/text-generation-webui/wiki) for details.
 ## Download model or LoRA
 Here you can download a model or LoRA directly from the https://huggingface.co/ website.
 * Models will be saved to `text-generation-webui/models`.
 * LoRAs will be saved to `text-generation-webui/loras`.
 In the input field, you can enter either the Hugging Face username/model path (like `facebook/galactica-125m`) or the full model URL (like `https://huggingface.co/facebook/galactica-125m`). To specify a branch, add it at the end after a ":" character like this: `facebook/galactica-125m:main`. 
 To download a single file, as necessary for models in GGUF format, you can click on "Get file list" after entering the model path in the input field, and then copy and paste the desired file name in the "File name" field before clicking on "Download".
--- a/docs/Training-LoRAs.md
+++ b/docs/Training-LoRAs.md
@ -4,7 +4,6 @@ The WebUI seeks to make training your own LoRAs as easy as possible. It comes do
 ### **Step 1**: Make a plan.
 - What base model do you want to use? The LoRA you make has to be matched up to a single architecture (eg LLaMA-13B) and cannot be transferred to others (eg LLaMA-7B, StableLM, etc. would all be different). Derivatives of the same model (eg Alpaca finetune of LLaMA-13B) might be transferrable, but even then it's best to train exactly on what you plan to use.
 - What model format do you want? At time of writing, 8-bit models are most stable, and 4-bit are supported but experimental. In the near future it is likely that 4-bit will be the best option for most users.
 - What are you training it on? Do you want it to learn real information, a simple format, ...?
 ### **Step 2**: Gather a dataset.
@ -138,37 +137,3 @@ The [4-bit LoRA monkeypatch](GPTQ-models-(4-bit-mode).md#using-loras-in-4-bit-mo
 - Models do funky things. LoRAs apply themselves, or refuse to apply, or spontaneously error out, or etc. It can be helpful to reload base model or restart the WebUI between training/usage to minimize chances of anything going haywire.
 - Loading or working with multiple LoRAs at the same time doesn't currently work.
 - Generally, recognize and treat the monkeypatch as the dirty temporary hack it is - it works, but isn't very stable. It will get better in time when everything is merged upstream for full official support.
 ## Legacy notes
 LoRA training was contributed by [mcmonkey4eva](https://github.com/mcmonkey4eva) in PR [#570](https://github.com/oobabooga/text-generation-webui/pull/570).
 ### Using the original alpaca-lora code
 Kept here for reference. The Training tab has much more features than this method.
 ```
 conda activate textgen
 git clone https://github.com/tloen/alpaca-lora
 ```
 Edit those two lines in `alpaca-lora/finetune.py` to use your existing model folder instead of downloading everything from decapoda:
 ```
 model = LlamaForCausalLM.from_pretrained(
    "models/llama-7b",
    load_in_8bit=True,
    device_map="auto",
 )
 tokenizer = LlamaTokenizer.from_pretrained(
    "models/llama-7b", add_eos_token=True
 )
 ```
 Run the script with:
 ```
 python finetune.py
 ```
 It just works. It runs at 22.32s/it, with 1170 iterations in total, so about 7 hours and a half for training a LoRA. RTX 3090, 18153MiB VRAM used, drawing maximum power (350W, room heater mode).
--- a/docs/06
+++ b/docs/06
@ -0,0 +1,32 @@
 Here you can restart the UI with new settings.
 * **Available extensions**: shows a list of extensions available under `text-generation-webui/extensions`.
 * **Boolean command-line flags**: shows command-line flags of bool (true/false) type.
 After selecting your desired flags and extensions, you can restart the UI by clicking on **Apply flags/extensions and restart**.
 ## Install or update an extension
 In this field, you can enter the GitHub URL for an extension and press enter to either install it (i.e. cloning it into `text-generation-webui/extensions`) or update it with `git pull` in case it is already cloned.
 Note that some extensions may include additional Python requirements. In this case, to install those you have to run the command
 ```
 pip install -r extensions/extension-name/requirements.txt
 ```
 or
 ```
 pip install -r extensions\extension-name\requirements.txt
 ```
 if you are on Windows.
 If you used the one-click installer, this command should be executed in the terminal window that appears when you run the "cmd_" script for your OS.
 ## Saving UI defaults
 The **Save UI defaults to settings.yaml** button gathers the visible values in the UI and saves them to settings.yaml so that your settings will persist across multiple restarts of the UI.
 Note that preset parameters like temperature are not individually saved, so you need to first save your preset and select it in the preset menu before saving the defaults.
--- a/docs/Extensions.md
+++ b/docs/Extensions.md
--- a/docs/RWKV-model.md
+++ b/docs/RWKV-model.md
@ -1,3 +1,63 @@
 ## Audio notification
 If your computer takes a long time to generate each response for the model that you are using, you can enable an audio notification for when the response is completed. This feature was kindly contributed by HappyWorldGames in [#1277](https://github.com/oobabooga/text-generation-webui/pull/1277).
 ### Installation
 Simply place a file called "notification.mp3" in the same folder as `server.py`. Here you can find some examples:
 * https://pixabay.com/sound-effects/search/ding/?duration=0-30
 * https://pixabay.com/sound-effects/search/notification/?duration=0-30
 Source: https://github.com/AUTOMATIC1111/stable-diffusion-webui/pull/1126
 This file will be automatically detected the next time you start the web UI.
 ## Using LoRAs with GPTQ-for-LLaMa
 This requires using a monkey patch that is supported by this web UI: https://github.com/johnsmith0031/alpaca_lora_4bit
 To use it:
 Install alpaca_lora_4bit using pip
 ```
 git clone https://github.com/johnsmith0031/alpaca_lora_4bit.git
 cd alpaca_lora_4bit
 git fetch origin winglian-setup_pip
 git checkout winglian-setup_pip
 pip install .
 ```
 Start the UI with the --monkey-patch flag:
 ```
 python server.py --model llama-7b-4bit-128g --listen --lora tloen_alpaca-lora-7b --monkey-patch
 ```
 ## DeepSpeed
 `DeepSpeed ZeRO-3` is an alternative offloading strategy for full-precision (16-bit) transformers models.
 With this, I have been able to load a 6b model (GPT-J 6B) with less than 6GB of VRAM. The speed of text generation is very decent and much better than what would be accomplished with `--auto-devices --gpu-memory 6`.
 As far as I know, DeepSpeed is only available for Linux at the moment.
 ### How to use it
 1. Install DeepSpeed: 
 ```
 conda install -c conda-forge mpi4py mpich
 pip install -U deepspeed
 ```
 2. Start the web UI replacing `python` with `deepspeed --num_gpus=1` and adding the `--deepspeed` flag. Example:
 ```
 deepspeed --num_gpus=1 server.py --deepspeed --chat --model gpt-j-6B
 ```
 > RWKV: RNN with Transformer-level LLM Performance
 >
 > It combines the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding (using the final hidden state).
@ -70,3 +130,26 @@ See the README for the PyPl package for more details: https://pypi.org/project/r
 #### Compiling the CUDA kernel
 You can compile the CUDA kernel for the model with `--rwkv-cuda-on`. This should improve the performance a lot but I haven't been able to get it to work yet.
 ## Miscellaneous info
 ### You can train LoRAs in CPU mode
 Load the web UI with
 ```
 python server.py --cpu
 ```
 and start training the LoRA from the training tab as usual.
 ### You can check the sha256sum of downloaded models with the download script
 ```
 python download-model.py facebook/galactica-125m --check
 ```
 ### The download script continues interrupted downloads by default
 It doesn't start over.
--- a/docs/Docker.md
+++ b/docs/Docker.md
@ -17,12 +17,12 @@ cp docker/.env.example .env
 docker compose up --build
 ```
-# Table of contents
+## Table of contents
 * [Docker Compose installation instructions](#docker-compose-installation-instructions)
 * [Repository with additional Docker files](#dedicated-docker-repository)
-# Docker Compose installation instructions 
+## Docker Compose installation instructions 
 By [@loeken](https://github.com/loeken).
@ -52,21 +52,21 @@ By [@loeken](https://github.com/loeken).
  - [7. startup](#7-startup)
 - [notes](#notes)
-## Ubuntu 22.04
+### Ubuntu 22.04
-### 0. youtube video
+#### 0. youtube video
 A video walking you through the setup can be found here:
 [![oobabooga text-generation-webui setup in docker on ubuntu 22.04](https://img.youtube.com/vi/ELkKWYh8qOk/0.jpg)](https://www.youtube.com/watch?v=ELkKWYh8qOk)
-### 1. update the drivers
+#### 1. update the drivers
 in the the “software updater” update drivers to the last version of the prop driver.
-### 2. reboot
+#### 2. reboot
 to switch using to new driver
-### 3. install docker
+#### 3. install docker
 ```bash
 sudo apt update
 sudo apt-get install curl
@ -82,7 +82,7 @@ sudo usermod -aG docker $USER
 newgrp docker
 ```
-### 4. docker & container toolkit
+#### 4. docker & container toolkit
 ```bash
 curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
 echo "deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://nvidia.github.io/libnvidia-container/stable/ubuntu22.04/amd64 /" | \
@ -92,13 +92,13 @@ sudo apt install nvidia-docker2 nvidia-container-runtime -y
 sudo systemctl restart docker
 ```
-### 5. clone the repo
+#### 5. clone the repo
 ```
 git clone https://github.com/oobabooga/text-generation-webui
 cd text-generation-webui
 ```
-### 6. prepare models
+#### 6. prepare models
 download and place the models inside the models folder. tested with:
 4bit
@ -108,30 +108,30 @@ https://github.com/oobabooga/text-generation-webui/pull/530#issuecomment-1483941
 8bit:
 https://github.com/oobabooga/text-generation-webui/pull/530#issuecomment-1484235789
-### 7. prepare .env file
+#### 7. prepare .env file
 edit .env values to your needs.
 ```bash
 cp .env.example .env
 nano .env
 ```
-### 8. startup docker container
+#### 8. startup docker container
 ```bash
 docker compose up --build
 ```
-## Manjaro
+### Manjaro
 manjaro/arch is similar to ubuntu just the dependency installation is more convenient
-### update the drivers
+#### update the drivers
 ```bash
 sudo mhwd -a pci nonfree 0300
 ```
-### reboot
+#### reboot
 ```bash
 reboot
 ```
-### docker & container toolkit
+#### docker & container toolkit
 ```bash
 yay -S docker docker-compose buildkit gcc nvidia-docker
 sudo usermod -aG docker $USER
@ -139,32 +139,32 @@ newgrp docker
 sudo systemctl restart docker # required by nvidia-container-runtime
 ```
-### continue with ubuntu task
+#### continue with ubuntu task
 continue at [5. clone the repo](#5-clone-the-repo)
-## Windows
+### Windows
-### 0. youtube video
+#### 0. youtube video
 A video walking you through the setup can be found here:
 [![oobabooga text-generation-webui setup in docker on windows 11](https://img.youtube.com/vi/ejH4w5b5kFQ/0.jpg)](https://www.youtube.com/watch?v=ejH4w5b5kFQ)
-### 1. choco package manager
+#### 1. choco package manager
 install package manager  (https://chocolatey.org/ )
 ```
 Set-ExecutionPolicy Bypass -Scope Process -Force; [System.Net.ServicePointManager]::SecurityProtocol = [System.Net.ServicePointManager]::SecurityProtocol -bor 3072; iex ((New-Object System.Net.WebClient).DownloadString('https://community.chocolatey.org/install.ps1'))
 ```
-### 2. install drivers/dependencies
+#### 2. install drivers/dependencies
 ```
 choco install nvidia-display-driver cuda git docker-desktop
 ```
-### 3. install wsl
+#### 3. install wsl
 wsl --install
-### 4. reboot
+#### 4. reboot
 after reboot enter username/password in wsl
-### 5. git clone && startup
+#### 5. git clone && startup
 clone the repo and edit .env values to your needs.
 ```
 cd Desktop
@ -174,19 +174,19 @@ COPY .env.example .env
 notepad .env
 ```
-### 6. prepare models
+#### 6. prepare models
 download and place the models inside the models folder. tested with:
 4bit https://github.com/oobabooga/text-generation-webui/pull/530#issuecomment-1483891617 https://github.com/oobabooga/text-generation-webui/pull/530#issuecomment-1483941105
 8bit: https://github.com/oobabooga/text-generation-webui/pull/530#issuecomment-1484235789
-### 7. startup
+#### 7. startup
 ```
 docker compose up
 ```
-## notes
+### notes
 on older ubuntus you can manually install the docker compose plugin like this:
 ```
@ -197,7 +197,6 @@ chmod +x $DOCKER_CONFIG/cli-plugins/docker-compose
 export PATH="$HOME/.docker/cli-plugins:$PATH"
 ```
-# Dedicated docker repository
+## Dedicated docker repository
 An external repository maintains a docker wrapper for this project as well as several pre-configured 'one-click' `docker compose` variants (e.g., updated branches of GPTQ). It can be found at: [Atinoda/text-generation-webui-docker](https://github.com/Atinoda/text-generation-webui-docker).
--- a/docs/WSL-installation-guide.md
+++ b/docs/WSL-installation-guide.md
@ -1,10 +1,69 @@
 ## WSL instructions
 If you do not have WSL installed, follow the [instructions below](https://github.com/oobabooga/text-generation-webui/wiki/10-%E2%80%90-WSL#wsl-installation) first.
 ### Additional WSL setup info
 If you want to install Linux to a drive other than C, open powershell and enter these commands:
 ```
 cd D:\Path\To\Linux
 $ProgressPreference = 'SilentlyContinue'
 Invoke-WebRequest -Uri <LinuxDistroURL> -OutFile Linux.appx -UseBasicParsing
 mv Linux.appx Linux.zip
 ```
 Then open Linux.zip and you should see several .appx files inside.
 The one with _x64.appx contains the exe installer that you need.
 Extract the contents of that _x64.appx file and run <distro>.exe to install.
 Linux Distro URLs: https://learn.microsoft.com/en-us/windows/wsl/install-manual#downloading-distributions
 **ENSURE THAT THE WSL LINUX DISTRO THAT YOU WISH TO USE IS SET AS THE DEFAULT!**
 Do this by using these commands:
 ```
 wsl -l
 wsl -s <DistroName>
 ```
 ### Web UI Installation
 Run the "start" script. By default it will install the web UI in WSL:
 /home/{username}/text-gen-install
 To launch the web UI in the future after it is already installed, run
 the same "start" script. Ensure that one_click.py and wsl.sh are next to it!
 ### Updating the web UI
 As an alternative to running the "update" script, you can also run "wsl.sh update" in WSL.
 ### Running an interactive shell
 As an alternative to running the "cmd" script, you can also run "wsl.sh cmd" in WSL.
 ### Changing the default install location
 To change this, you will need to edit the scripts as follows:
 wsl.sh: line ~22   INSTALL_DIR="/path/to/install/dir"
 Keep in mind that there is a long-standing bug in WSL that significantly
 slows drive read/write speeds when using a physical drive as opposed to
 the virtual one that Linux is installed in.
 ## WSL installation
 Guide created by [@jfryton](https://github.com/jfryton). Thank you jfryton.
 -----
 Here's an easy-to-follow, step-by-step guide for installing Windows Subsystem for Linux (WSL) with Ubuntu on Windows 10/11:
-## Step 1: Enable WSL
+### Step 1: Enable WSL
 1. Press the Windows key + X and click on "Windows PowerShell (Admin)" or "Windows Terminal (Admin)" to open PowerShell or Terminal with administrator privileges.
 2. In the PowerShell window, type the following command and press Enter:
@ -27,19 +86,19 @@ wsl --set-default-version 2
 You may be prompted to restart your computer. If so, save your work and restart.
-## Step 2: Install Ubuntu
+### Step 2: Install Ubuntu
 1. Open the Microsoft Store.
 2. Search for "Ubuntu" in the search bar.
 3. Choose the desired Ubuntu version (e.g., Ubuntu 20.04 LTS) and click "Get" or "Install" to download and install the Ubuntu app.
 4. Once the installation is complete, click "Launch" or search for "Ubuntu" in the Start menu and open the app.
-## Step 3: Set up Ubuntu
+### Step 3: Set up Ubuntu
 1. When you first launch the Ubuntu app, it will take a few minutes to set up. Be patient as it installs the necessary files and sets up your environment.
 2. Once the setup is complete, you will be prompted to create a new UNIX username and password. Choose a username and password, and make sure to remember them, as you will need them for future administrative tasks within the Ubuntu environment.
-## Step 4: Update and upgrade packages
+### Step 4: Update and upgrade packages
 1. After setting up your username and password, it's a good idea to update and upgrade your Ubuntu system. Run the following commands in the Ubuntu terminal:
@ -54,7 +113,7 @@ Congratulations! You have now installed WSL with Ubuntu on your Windows 10/11 sy
 You can launch your WSL Ubuntu installation by selecting the Ubuntu app (like any other program installed on your computer) or typing 'ubuntu' into Powershell or Terminal.
-## Step 5: Proceed with Linux instructions
+### Step 5: Proceed with Linux instructions
 1. You can now follow the Linux setup instructions. If you receive any error messages about a missing tool or package, just install them using apt:
@ -70,13 +129,15 @@ sudo apt install build-essential
 If you face any issues or need to troubleshoot, you can always refer to the official Microsoft documentation for WSL: https://docs.microsoft.com/en-us/windows/wsl/
-#### WSL2 performance using /mnt: 
+### WSL2 performance using /mnt: 
 when you git clone a repository, put it inside WSL and not outside. To understand more, take a look at this [issue](https://github.com/microsoft/WSL/issues/4197#issuecomment-604592340)
-## Bonus: Port Forwarding
+When you git clone a repository, put it inside WSL and not outside. To understand more, take a look at this [issue](https://github.com/microsoft/WSL/issues/4197#issuecomment-604592340)
 ### Bonus: Port Forwarding
 By default, you won't be able to access the webui from another device on your local network. You will need to setup the appropriate port forwarding using the following command (using PowerShell or Terminal with administrator privileges). 
 ```
 netsh interface portproxy add v4tov4 listenaddress=0.0.0.0 listenport=7860 connectaddress=localhost connectport=7860
 ```
--- a/Setup.md
+++ b/Setup.md
@ -0,0 +1,13 @@
 ## Using an AMD GPU in Linux
 Requires ROCm SDK 5.4.2 or 5.4.3 to be installed. Some systems may also
 need: 
 ```
 sudo apt-get install libstdc++-12-dev
 ```
 Edit the "one_click.py" script using a text editor and un-comment and
 modify the lines near the top of the script according to your setup. In
 particular, modify the `os.environ["ROCM_PATH"] = '/opt/rocm'` line to
 point to your ROCm installation.
--- a/docs/Audio-Notification.md
+++ b/docs/Audio-Notification.md
@ -1,14 +0,0 @@
 # Audio notification
 If your computer takes a long time to generate each response for the model that you are using, you can enable an audio notification for when the response is completed. This feature was kindly contributed by HappyWorldGames in [#1277](https://github.com/oobabooga/text-generation-webui/pull/1277).
 ### Installation
 Simply place a file called "notification.mp3" in the same folder as `server.py`. Here you can find some examples:
 * https://pixabay.com/sound-effects/search/ding/?duration=0-30
 * https://pixabay.com/sound-effects/search/notification/?duration=0-30
 Source: https://github.com/AUTOMATIC1111/stable-diffusion-webui/pull/1126
 This file will be automatically detected the next time you start the web UI.
--- a/docs/Chat-mode.md
+++ b/docs/Chat-mode.md
@ -1,39 +0,0 @@
 ## Chat characters
 Custom chat mode characters are defined by `.yaml` files inside the `characters` folder. An example is included: [Example.yaml](https://github.com/oobabooga/text-generation-webui/blob/main/characters/Example.yaml).
 The following fields may be defined:
 | Field | Description |
 |-------|-------------|
 | `name` or `bot` | The character's name. |
 | `context` | A string that appears at the top of the prompt. It usually contains a description of the character's personality and a few example messages. |
 | `greeting` (optional) | The character's opening message. It appears when the character is first loaded or when the history is cleared. |
 | `your_name` or `user` (optional) | Your name. This overwrites what you had previously written in the `Your name` field in the interface. |
 #### Special tokens
 The following replacements happen when the prompt is generated, and they apply to the `context` and `greeting` fields:
 * `{{char}}` and `<BOT>` get replaced with the character's name.
 * `{{user}}` and `<USER>` get replaced with your name.
 #### How do I add a profile picture for my character?
 Put an image with the same name as your character's `.yaml` file into the `characters` folder. For example, if your bot is `Character.yaml`, add `Character.jpg` or `Character.png` to the folder.
 #### Is the chat history truncated in the prompt?
 Once your prompt reaches the `truncation_length` parameter (2048 by default), old messages will be removed one at a time. The context string will always stay at the top of the prompt and will never get truncated.
 ## Chat styles
 Custom chat styles can be defined in the `text-generation-webui/css` folder. Simply create a new file with name starting in `chat_style-` and ending in `.css` and it will automatically appear in the "Chat style" dropdown menu in the interface. Examples:
 ```
 chat_style-cai-chat.css
 chat_style-TheEncrypted777.css
 chat_style-wpp.css
 ```
 You should use the same class names as in `chat_style-cai-chat.css` in your custom style.
--- a/docs/DeepSpeed.md
+++ b/docs/DeepSpeed.md
@ -1,24 +0,0 @@
 An alternative way of reducing the GPU memory usage of models is to use the `DeepSpeed ZeRO-3` optimization.
 With this, I have been able to load a 6b model (GPT-J 6B) with less than 6GB of VRAM. The speed of text generation is very decent and much better than what would be accomplished with `--auto-devices --gpu-memory 6`.
 As far as I know, DeepSpeed is only available for Linux at the moment.
 ### How to use it
 1. Install DeepSpeed: 
 ```
 conda install -c conda-forge mpi4py mpich
 pip install -U deepspeed
 ```
 2. Start the web UI replacing `python` with `deepspeed --num_gpus=1` and adding the `--deepspeed` flag. Example:
 ```
 deepspeed --num_gpus=1 server.py --deepspeed --chat --model gpt-j-6B
 ```
 ### Learn more
 For more information, check out [this comment](https://github.com/oobabooga/text-generation-webui/issues/40#issuecomment-1412038622) by 81300, who came up with the DeepSpeed support in this web UI.
--- a/docs/ExLlama.md
+++ b/docs/ExLlama.md
@ -1,22 +0,0 @@
 # ExLlama
 ### About
 ExLlama is an extremely optimized GPTQ backend for LLaMA models. It features much lower VRAM usage and much higher speeds due to not relying on unoptimized transformers code.
 ### Usage
 Configure text-generation-webui to use exllama via the UI or command line:
   - In the "Model" tab, set "Loader" to "exllama"
   - Specify `--loader exllama` on the command line
 ### Manual setup
 No additional installation steps are necessary since an exllama package is already included in the requirements.txt. If this package fails to install for some reason, you can install it manually by cloning the original repository into your `repositories/` folder:
 ```
 mkdir repositories
 cd repositories
 git clone https://github.com/turboderp/exllama
 ```
--- a/docs/GPTQ-models-(4-bit-mode).md
+++ b/docs/GPTQ-models-(4-bit-mode).md
@ -1,182 +0,0 @@
 GPTQ is a clever quantization algorithm that lightly reoptimizes the weights during quantization so that the accuracy loss is compensated relative to a round-to-nearest quantization. See the paper for more details: https://arxiv.org/abs/2210.17323
 4-bit GPTQ models reduce VRAM usage by about 75%. So LLaMA-7B fits into a 6GB GPU, and LLaMA-30B fits into a 24GB GPU.
 ## Overview
 There are two ways of loading GPTQ models in the web UI at the moment:
 * Using AutoGPTQ:
  * supports more models
  * standardized (no need to guess any parameter)
  * is a proper Python library
  * ~no wheels are presently available so it requires manual compilation~
  * supports loading both triton and cuda models
 * Using GPTQ-for-LLaMa directly:
  * faster CPU offloading
  * faster multi-GPU inference
  * supports loading LoRAs using a monkey patch
  * requires you to manually figure out the wbits/groupsize/model_type parameters for the model to be able to load it
  * supports either only cuda or only triton depending on the branch
 For creating new quantizations, I recommend using AutoGPTQ: https://github.com/PanQiWei/AutoGPTQ
 ## AutoGPTQ
 ### Installation
 No additional steps are necessary as AutoGPTQ is already in the `requirements.txt` for the webui. If you still want or need to install it manually for whatever reason, these are the commands:
 ```
 conda activate textgen
 git clone https://github.com/PanQiWei/AutoGPTQ.git && cd AutoGPTQ
 pip install .
 ```
 The last command requires `nvcc` to be installed (see the [instructions above](https://github.com/oobabooga/text-generation-webui/blob/main/docs/GPTQ-models-(4-bit-mode).md#step-1-install-nvcc)).
 ### Usage
 When you quantize a model using AutoGPTQ, a folder containing a filed called `quantize_config.json` will be generated. Place that folder inside your `models/` folder and load it with the `--autogptq` flag:
 ```
 python server.py --autogptq --model model_name
 ```
 Alternatively, check the `autogptq` box in the "Model" tab of the UI before loading the model.
 ### Offloading
 In order to do CPU offloading or multi-gpu inference with AutoGPTQ, use the `--gpu-memory` flag. It is currently somewhat slower than offloading with the `--pre_layer` option in GPTQ-for-LLaMA.
 For CPU offloading:
 ```
 python server.py --autogptq --gpu-memory 3000MiB --model model_name
 ```
 For multi-GPU inference:
 ```
 python server.py --autogptq --gpu-memory 3000MiB 6000MiB --model model_name
 ```
 ### Using LoRAs with AutoGPTQ
 Works fine for a single LoRA.
 ## GPTQ-for-LLaMa
 GPTQ-for-LLaMa is the original adaptation of GPTQ for the LLaMA model. It was made possible by [@qwopqwop200](https://github.com/qwopqwop200/GPTQ-for-LLaMa): https://github.com/qwopqwop200/GPTQ-for-LLaMa
 A Python package containing both major CUDA versions of GPTQ-for-LLaMa is used to simplify installation and compatibility: https://github.com/jllllll/GPTQ-for-LLaMa-CUDA
 ### Precompiled wheels
 Kindly provided by our friend jllllll: https://github.com/jllllll/GPTQ-for-LLaMa-CUDA/releases
 Wheels are included in requirements.txt and are installed with the webui on supported systems.
 ### Manual installation
 #### Step 1: install nvcc
 ```
 conda activate textgen
 conda install cuda -c nvidia/label/cuda-11.7.1
 ```
 The command above takes some 10 minutes to run and shows no progress bar or updates along the way.
 You are also going to need to have a C++ compiler installed. On Linux, `sudo apt install build-essential` or equivalent is enough. On Windows, Visual Studio or Visual Studio Build Tools is required.
 If you're using an older version of CUDA toolkit (e.g. 11.7) but the latest version of `gcc` and `g++` (12.0+) on Linux, you should downgrade with: `conda install -c conda-forge gxx==11.3.0`. Kernel compilation will fail otherwise.
 #### Step 2: compile the CUDA extensions
 ```
 python -m pip install git+https://github.com/jllllll/GPTQ-for-LLaMa-CUDA -v
 ```
 ### Getting pre-converted LLaMA weights
 * Direct download (recommended):
 https://huggingface.co/Neko-Institute-of-Science/LLaMA-7B-4bit-128g
 https://huggingface.co/Neko-Institute-of-Science/LLaMA-13B-4bit-128g
 https://huggingface.co/Neko-Institute-of-Science/LLaMA-30B-4bit-128g
 https://huggingface.co/Neko-Institute-of-Science/LLaMA-65B-4bit-128g
 These models were converted with `desc_act=True`. They work just fine with ExLlama. For AutoGPTQ, they will only work on Linux with the `triton` option checked.
 * Torrent:
 https://github.com/oobabooga/text-generation-webui/pull/530#issuecomment-1483891617
 https://github.com/oobabooga/text-generation-webui/pull/530#issuecomment-1483941105
 These models were converted with `desc_act=False`. As such, they are less accurate, but they work with AutoGPTQ on Windows. The `128g` versions are better from 13b upwards, and worse for 7b. The tokenizer files in the torrents are outdated, in particular the files called `tokenizer_config.json` and `special_tokens_map.json`. Here you can find those files: https://huggingface.co/oobabooga/llama-tokenizer
 ### Starting the web UI:
 Use the `--gptq-for-llama` flag.
 For the models converted without `group-size`:
 ```
 python server.py --model llama-7b-4bit --gptq-for-llama 
 ```
 For the models converted with `group-size`:
 ```
 python server.py --model llama-13b-4bit-128g  --gptq-for-llama --wbits 4 --groupsize 128
 ```
 The command-line flags `--wbits` and `--groupsize` are automatically detected based on the folder names in many cases.
 ### CPU offloading
 It is possible to offload part of the layers of the 4-bit model to the CPU with the `--pre_layer` flag. The higher the number after `--pre_layer`, the more layers will be allocated to the GPU.
 With this command, I can run llama-7b with 4GB VRAM:
 ```
 python server.py --model llama-7b-4bit --pre_layer 20
 ```
 This is the performance:
 ```
 Output generated in 123.79 seconds (1.61 tokens/s, 199 tokens)
 ```
 You can also use multiple GPUs with `pre_layer` if using the oobabooga fork of GPTQ, eg `--pre_layer 30 60` will load a LLaMA-30B model half onto your first GPU and half onto your second, or `--pre_layer 20 40` will load 20 layers onto GPU-0, 20 layers onto GPU-1, and 20 layers offloaded to CPU.
 ### Using LoRAs with GPTQ-for-LLaMa
 This requires using a monkey patch that is supported by this web UI: https://github.com/johnsmith0031/alpaca_lora_4bit
 To use it:
 1. Install alpaca_lora_4bit using pip
 ```
 git clone https://github.com/johnsmith0031/alpaca_lora_4bit.git
 cd alpaca_lora_4bit
 git fetch origin winglian-setup_pip
 git checkout winglian-setup_pip
 pip install .
 ```
 2. Start the UI with the `--monkey-patch` flag:
 ```
 python server.py --model llama-7b-4bit-128g --listen --lora tloen_alpaca-lora-7b --monkey-patch
 ```
--- a/docs/Generation-Parameters.md
+++ b/docs/Generation-Parameters.md
@ -1,71 +0,0 @@
 # Generation Parameters
 For a technical description of the parameters, the [transformers documentation](https://huggingface.co/docs/transformers/main_classes/text_generation#transformers.GenerationConfig) is a good reference.
 The best presets, according to the [Preset Arena](https://github.com/oobabooga/oobabooga.github.io/blob/main/arena/results.md) experiment, are:
 **Instruction following:**
 1) Divine Intellect
 2) Big O
 3) simple-1
 4) Space Alien
 5) StarChat
 6) Titanic
 7) tfs-with-top-a
 8) Asterism
 9) Contrastive Search
 **Chat:**
 1) Midnight Enigma
 2) Yara
 3) Shortwave
 ### Temperature
 Primary factor to control randomness of outputs. 0 = deterministic (only the most likely token is used). Higher value = more randomness.
 ### top_p
 If not set to 1, select tokens with probabilities adding up to less than this number. Higher value = higher range of possible random results.
 ### top_k
 Similar to top_p, but select instead only the top_k most likely tokens. Higher value = higher range of possible random results.
 ### typical_p
 If not set to 1, select only tokens that are at least this much more likely to appear than random tokens, given the prior text.
 ### epsilon_cutoff
 In units of 1e-4; a reasonable value is 3. This sets a probability floor below which tokens are excluded from being sampled. Should be used with top_p, top_k, and eta_cutoff set to 0.
 ### eta_cutoff
 In units of 1e-4; a reasonable value is 3. Should be used with top_p, top_k, and epsilon_cutoff set to 0.
 ### repetition_penalty
 Exponential penalty factor for repeating prior tokens. 1 means no penalty, higher value = less repetition, lower value = more repetition.
 ### repetition_penalty_range
 The number of most recent tokens to consider for repetition penalty. 0 makes all tokens be used.
 ### encoder_repetition_penalty
 Also known as the "Hallucinations filter". Used to penalize tokens that are *not* in the prior text. Higher value = more likely to stay in context, lower value = more likely to diverge.
 ### no_repeat_ngram_size
 If not set to 0, specifies the length of token sets that are completely blocked from repeating at all. Higher values = blocks larger phrases, lower values = blocks words or letters from repeating. Only 0 or high values are a good idea in most cases.
 ### min_length
 Minimum generation length in tokens.
 ### penalty_alpha
 Contrastive Search is enabled by setting this to greater than zero and unchecking "do_sample". It should be used with a low value of top_k, for instance, top_k = 4.
--- a/docs/LLaMA-model.md
+++ b/docs/LLaMA-model.md
@ -1,56 +0,0 @@
 LLaMA is a Large Language Model developed by Meta AI. 
 It was trained on more tokens than previous models. The result is that the smallest version with 7 billion parameters has similar performance to GPT-3 with 175 billion parameters.
 This guide will cover usage through the official `transformers` implementation. For 4-bit mode, head over to [GPTQ models (4 bit mode)
 ](GPTQ-models-(4-bit-mode).md).
 ## Getting the weights
 ### Option 1: pre-converted weights
 * Direct download (recommended):
 https://huggingface.co/Neko-Institute-of-Science/LLaMA-7B-HF
 https://huggingface.co/Neko-Institute-of-Science/LLaMA-13B-HF
 https://huggingface.co/Neko-Institute-of-Science/LLaMA-30B-HF
 https://huggingface.co/Neko-Institute-of-Science/LLaMA-65B-HF
 * Torrent:
 https://github.com/oobabooga/text-generation-webui/pull/530#issuecomment-1484235789
 The tokenizer files in the torrent above are outdated, in particular the files called `tokenizer_config.json` and `special_tokens_map.json`. Here you can find those files: https://huggingface.co/oobabooga/llama-tokenizer
 ### Option 2: convert the weights yourself
 1. Install the `protobuf` library:
 ```
 pip install protobuf==3.20.1
 ```
 2. Use the script below to convert the model in `.pth` format that you, a fellow academic, downloaded using Meta's official link.
 If you have `transformers` installed in place:
 ```
 python -m transformers.models.llama.convert_llama_weights_to_hf --input_dir /path/to/LLaMA --model_size 7B --output_dir /tmp/outputs/llama-7b
 ```
 Otherwise download [convert_llama_weights_to_hf.py](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/convert_llama_weights_to_hf.py) first and run:
 ```
 python convert_llama_weights_to_hf.py --input_dir /path/to/LLaMA --model_size 7B --output_dir /tmp/outputs/llama-7b
 ```
 3. Move the `llama-7b` folder inside your `text-generation-webui/models` folder.
 ## Starting the web UI
 ```python
 python server.py --model llama-7b
 ```
--- a/docs/LLaMA-v2-model.md
+++ b/docs/LLaMA-v2-model.md
@ -1,35 +0,0 @@
 # LLaMA-v2
 To convert LLaMA-v2 from the `.pth` format provided by Meta to transformers format, follow the steps below:
 1) `cd` into your `llama` folder (the one containing `download.sh` and the models that you downloaded):
 ```
 cd llama
 ```
 2) Clone the transformers library:
 ```
 git clone 'https://github.com/huggingface/transformers'
 ```
 3) Create symbolic links from the downloaded folders to names that the conversion script can recognize:
 ```
 ln -s llama-2-7b 7B
 ln -s llama-2-13b 13B
 ```
 4) Do the conversions:
 ```
 mkdir llama-2-7b-hf llama-2-13b-hf
 python ./transformers/src/transformers/models/llama/convert_llama_weights_to_hf.py --input_dir . --model_size 7B --output_dir llama-2-7b-hf --safe_serialization true
 python ./transformers/src/transformers/models/llama/convert_llama_weights_to_hf.py --input_dir . --model_size 13B --output_dir llama-2-13b-hf --safe_serialization true
 ```
 5) Move the output folders inside `text-generation-webui/models`
 6) Have fun
--- a/docs/LoRA.md
+++ b/docs/LoRA.md
@ -1,71 +0,0 @@
 # LoRA
 LoRA (Low-Rank Adaptation) is an extremely powerful method for customizing a base model by training only a small number of parameters. They can be attached to models at runtime.
 For instance, a 50mb LoRA can teach LLaMA an entire new language, a given writing style, or give it instruction-following or chat abilities.
 This is the current state of LoRA integration in the web UI:
 |Loader | Status |
 |--------|------|
 | Transformers | Full support in 16-bit, `--load-in-8bit`, `--load-in-4bit`, and CPU modes. |
 | ExLlama | Single LoRA support. Fast to remove the LoRA afterwards. |
 | AutoGPTQ | Single LoRA support. Removing the LoRA requires reloading the entire model.|
 | GPTQ-for-LLaMa | Full support with the [monkey patch](https://github.com/oobabooga/text-generation-webui/blob/main/docs/GPTQ-models-(4-bit-mode).md#using-loras-with-gptq-for-llama). |
 ## Downloading a LoRA
 The download script can be used. For instance:
 ```
 python download-model.py tloen/alpaca-lora-7b
 ```
 The files will be saved to `loras/tloen_alpaca-lora-7b`.
 ## Using the LoRA
 The `--lora` command-line flag can be used. Examples:
 ```
 python server.py --model llama-7b-hf --lora tloen_alpaca-lora-7b
 python server.py --model llama-7b-hf --lora tloen_alpaca-lora-7b --load-in-8bit
 python server.py --model llama-7b-hf --lora tloen_alpaca-lora-7b --load-in-4bit
 python server.py --model llama-7b-hf --lora tloen_alpaca-lora-7b --cpu
 ```
 Instead of using the `--lora` command-line flag, you can also select the LoRA in the "Parameters" tab of the interface.
 ## Prompt
 For the Alpaca LoRA in particular, the prompt must be formatted like this:
 ```
 Below is an instruction that describes a task. Write a response that appropriately completes the request.
 ### Instruction:
 Write a Python script that generates text using the transformers library.
 ### Response:
 ```
 Sample output:
 ```
 Below is an instruction that describes a task. Write a response that appropriately completes the request.
 ### Instruction:
 Write a Python script that generates text using the transformers library.
 ### Response:
 import transformers
 from transformers import AutoTokenizer, AutoModelForCausalLM
 tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
 model = AutoModelForCausalLM.from_pretrained("bert-base-uncased")
 texts = ["Hello world", "How are you"]
 for sentence in texts:
 sentence = tokenizer(sentence)
 print(f"Generated {len(sentence)} tokens from '{sentence}'")
 output = model(sentences=sentence).predict()
 print(f"Predicted {len(output)} tokens for '{sentence}':\n{output}")
 ```
 ## Training a LoRA
 You can train your own LoRAs from the `Training` tab. See [Training LoRAs](Training-LoRAs.md) for details.
--- a/docs/Low-VRAM-guide.md
+++ b/docs/Low-VRAM-guide.md
@ -1,53 +0,0 @@
 If you GPU is not large enough to fit a 16-bit model, try these in the following order:
 ### Load the model in 8-bit mode
 ```
 python server.py --load-in-8bit
 ```
 ### Load the model in 4-bit mode
 ```
 python server.py --load-in-4bit
 ```
 ### Split the model across your GPU and CPU
 ```
 python server.py --auto-devices
 ```
 If you can load the model with this command but it runs out of memory when you try to generate text, try increasingly limiting the amount of memory allocated to the GPU until the error stops happening:
 ```
 python server.py --auto-devices --gpu-memory 10
 python server.py --auto-devices --gpu-memory 9
 python server.py --auto-devices --gpu-memory 8
 ...
 ```
 where the number is in GiB.
 For finer control, you can also specify the unit in MiB explicitly:
 ```
 python server.py --auto-devices --gpu-memory 8722MiB
 python server.py --auto-devices --gpu-memory 4725MiB
 python server.py --auto-devices --gpu-memory 3500MiB
 ...
 ```
 ### Send layers to a disk cache
 As a desperate last measure, you can split the model across your GPU, CPU, and disk:
 ```
 python server.py --auto-devices --disk
 ```
 With this, I am able to load a 30b model into my RTX 3090, but it takes 10 seconds to generate 1 word.
 ### DeepSpeed (experimental)
 An experimental alternative to all of the above is to use DeepSpeed: [guide](DeepSpeed.md).
--- a/docs/One-Click-Installers.md
+++ b/docs/One-Click-Installers.md
@ -1,72 +0,0 @@
 # Additional one-click installers info
 ## Installing nvcc
 If you have an NVIDIA GPU and ever need to compile something, like ExLlamav2 (that currently doesn't have pre-built wheels), you can install `nvcc` by running the `cmd_` script for your OS and entering this command:
 ```
 conda install cuda -c nvidia/label/cuda-11.7.1
 ```
 ## Using an AMD GPU in Linux
 Requires ROCm SDK 5.4.2 or 5.4.3 to be installed. Some systems may also
 need: sudo apt-get install libstdc++-12-dev
 Edit the "one_click.py" script using a text editor and un-comment and
 modify the lines near the top of the script according to your setup. In
 particular, modify the os.environ["ROCM_PATH"] = '/opt/rocm' line to
 point to your ROCm installation.
 ## WSL instructions
 If you do not have WSL installed, see here:
 https://learn.microsoft.com/en-us/windows/wsl/install
 If you want to install Linux to a drive other than C
 Open powershell and enter these commands:
 cd D:\Path\To\Linux
 $ProgressPreference = 'SilentlyContinue'
 Invoke-WebRequest -Uri <LinuxDistroURL> -OutFile Linux.appx -UseBasicParsing
 mv Linux.appx Linux.zip
 Then open Linux.zip and you should see several .appx files inside.
 The one with _x64.appx contains the exe installer that you need.
 Extract the contents of that _x64.appx file and run <distro>.exe to install.
 Linux Distro URLs:
 https://learn.microsoft.com/en-us/windows/wsl/install-manual#downloading-distributions
 ******************************************************************************
 *ENSURE THAT THE WSL LINUX DISTRO THAT YOU WISH TO USE IS SET AS THE DEFAULT!*
 ******************************************************************************
 Do this by using these commands:
 wsl -l
 wsl -s <DistroName>
 ### Web UI Installation
 Run the "start" script. By default it will install the web UI in WSL:
 /home/{username}/text-gen-install
 To launch the web UI in the future after it is already installed, run
 the same "start" script. Ensure that one_click.py and wsl.sh are next to it!
 ### Updating the web UI
 As an alternative to running the "update" script, you can also run "wsl.sh update" in WSL.
 ### Running an interactive shell
 As an alternative to running the "cmd" script, you can also run "wsl.sh cmd" in WSL.
 ### Changing the default install location
 To change this, you will need to edit the scripts as follows:
 wsl.sh: line ~22   INSTALL_DIR="/path/to/install/dir"
 Keep in mind that there is a long-standing bug in WSL that significantly
 slows drive read/write speeds when using a physical drive as opposed to
 the virtual one that Linux is installed in.
--- a/docs/README.md
+++ b/docs/README.md
@ -1,21 +1,5 @@
-# text-generation-webui documentation
+These files is a mirror of the documentation at:
-## Table of contents
+# https://github.com/oobabooga/text-generation-webui/wiki
-* [Audio Notification](Audio-Notification.md)
+It is recommended to browse it there. Contributions can be sent here and will later be synced with the wiki.
 * [Chat mode](Chat-mode.md)
 * [DeepSpeed](DeepSpeed.md)
 * [Docker](Docker.md)
 * [ExLlama](ExLlama.md)
 * [Extensions](Extensions.md)
 * [GPTQ models (4 bit mode)](GPTQ-models-(4-bit-mode).md)
 * [LLaMA model](LLaMA-model.md)
 * [llama.cpp](llama.cpp.md)
 * [LoRA](LoRA.md)
 * [Low VRAM guide](Low-VRAM-guide.md)
 * [RWKV model](RWKV-model.md)
 * [Spell book](Spell-book.md)
 * [System requirements](System-requirements.md)
 * [Training LoRAs](Training-LoRAs.md)
 * [Windows installation guide](Windows-installation-guide.md)
 * [WSL installation guide](WSL-installation-guide.md)
--- a/docs/Spell-book.md
+++ b/docs/Spell-book.md
@ -1,107 +0,0 @@
 You have now entered a hidden corner of the internet.
 A confusing yet intriguing realm of paradoxes and contradictions.
 A place where you will find out that what you thought you knew, you in fact didn't know, and what you didn't know was in front of you all along.
 ![](https://i.pinimg.com/originals/6e/e2/7b/6ee27bad351d3aca470d80f1033ba9c6.jpg)
 *In other words, here I will document little-known facts about this web UI that I could not find another place for in the wiki.*
 #### You can train LoRAs in CPU mode
 Load the web UI with
 ```
 python server.py --cpu
 ```
 and start training the LoRA from the training tab as usual.
 #### 8-bit mode works with CPU offloading
 ```
 python server.py --load-in-8bit --gpu-memory 4000MiB
 ```
 #### `--pre_layer`, and not `--gpu-memory`, is the right way to do CPU offloading with 4-bit models
 ```
 python server.py --wbits 4 --groupsize 128 --pre_layer 20
 ```
 #### Models can be loaded in 32-bit, 16-bit, 8-bit, and 4-bit modes
 ```
 python server.py --cpu
 python server.py
 python server.py --load-in-8bit
 python server.py --wbits 4
 ```
 #### The web UI works with any version of GPTQ-for-LLaMa
 Including the up to date triton and cuda branches. But you have to delete the `repositories/GPTQ-for-LLaMa` folder and reinstall the new one every time:
 ```
 cd text-generation-webui/repositories
 rm -r GPTQ-for-LLaMa
 pip uninstall quant-cuda
 git clone https://github.com/oobabooga/GPTQ-for-LLaMa -b cuda # or any other repository and branch
 cd GPTQ-for-LLaMa
 python setup_cuda.py install
 ```
 #### Instruction-following templates are represented as chat characters
 https://github.com/oobabooga/text-generation-webui/tree/main/characters/instruction-following
 #### The right way to run Alpaca, Open Assistant, Vicuna, etc is Instruct mode, not normal chat mode
 Otherwise the prompt will not be formatted correctly.
 1. Start the web UI with
 ```
 python server.py --chat
 ```
 2. Click on the "instruct" option under "Chat modes"
 3. Select the correct template in the hidden dropdown menu that will become visible. 
 #### Notebook mode is best mode
 Ascended individuals have realized that notebook mode is the superset of chat mode and can do chats with ultimate flexibility, including group chats, editing replies, starting a new bot reply in a given way, and impersonating.
 #### RWKV is a RNN 
 Most models are transformers, but not RWKV, which is a RNN. It's a great model.
 #### `--gpu-memory` is not a hard limit on the GPU memory
 It is simply a parameter that is passed to the `accelerate` library while loading the model. More memory will be allocated during generation. That's why this parameter has to be set to less than your total GPU memory.
 #### Contrastive search perhaps the best preset
 But it uses a ton of VRAM.
 #### You can check the sha256sum of downloaded models with the download script
 ```
 python download-model.py facebook/galactica-125m --check
 ```
 #### The download script continues interrupted downloads by default
 It doesn't start over.
 #### You can download models with multiple threads
 ```
 python download-model.py facebook/galactica-125m --threads 8
 ```
 #### LoRAs work in 4-bit mode
 You need to follow [these instructions](GPTQ-models-(4-bit-mode).md#using-loras-in-4-bit-mode) and then start the web UI with the `--monkey-patch` flag.
--- a/docs/System-requirements.md
+++ b/docs/System-requirements.md
@ -1,42 +0,0 @@
 These are the VRAM and RAM requirements (in MiB) to run some examples of models **in 16-bit (default) precision**:
 | model                  |   VRAM (GPU) |     RAM |
 |:-----------------------|-------------:|--------:|
 | arxiv_ai_gpt2          |      1512.37 | 5824.2  |
 | blenderbot-1B-distill  |      2441.75 | 4425.91 |
 | opt-1.3b               |      2509.61 | 4427.79 |
 | gpt-neo-1.3b           |      2605.27 | 5851.58 |
 | opt-2.7b               |      5058.05 | 4863.95 |
 | gpt4chan_model_float16 |     11653.7  | 4437.71 |
 | gpt-j-6B               |     11653.7  | 5633.79 |
 | galactica-6.7b         |     12697.9  | 4429.89 |
 | opt-6.7b               |     12700    | 4368.66 |
 | bloomz-7b1-p3          |     13483.1  | 4470.34 |
 #### GPU mode with 8-bit precision
 Allows you to load models that would not normally fit into your GPU. Enabled by default for 13b and 20b models in this web UI.
 | model          |   VRAM (GPU) |     RAM |
 |:---------------|-------------:|--------:|
 | opt-13b        |      12528.1 | 1152.39 |
 | gpt-neox-20b   |      20384   | 2291.7  |
 #### CPU mode (32-bit precision)
 A lot slower, but does not require a GPU. 
 On my i5-12400F, 6B models take around 10-20 seconds to respond in chat mode, and around 5 minutes to generate a 200 tokens completion. 
 | model                  |      RAM |
 |:-----------------------|---------:|
 | arxiv_ai_gpt2          |  4430.82 |
 | gpt-neo-1.3b           |  6089.31 |
 | opt-1.3b               |  8411.12 |
 | blenderbot-1B-distill  |  8508.16 |
 | opt-2.7b               | 14969.3  |
 | bloomz-7b1-p3          | 21371.2  |
 | gpt-j-6B               | 24200.3  |
 | gpt4chan_model         | 24246.3  |
 | galactica-6.7b         | 26561.4  |
 | opt-6.7b               | 29596.6  |
--- a/docs/What
+++ b/docs/What
@ -0,0 +1,23 @@
 ## What Works
 | Loader         | Loading 1 LoRA | Loading 2 or more LoRAs | Training LoRAs | Multimodal extension | Perplexity evaluation |
 |----------------|----------------|-------------------------|----------------|----------------------|-----------------------|
 | Transformers   |       ✅       |           ❌            |       ✅*       |          ✅          |           ✅          |
 | ExLlama_HF     |       ✅       |           ❌            |       ❌       |          ❌          |           ✅          |
 | ExLlamav2_HF   |       ✅       |           ✅            |       ❌       |          ❌          |           ✅          |
 | ExLlama        |       ✅       |           ❌            |       ❌       |          ❌          |           use ExLlama_HF      |
 | ExLlamav2      |       ✅       |           ✅            |       ❌       |          ❌          |           use ExLlamav2_HF    |
 | AutoGPTQ       |       ✅       |           ❌            |       ❌       |          ✅          |           ✅          |
 | GPTQ-for-LLaMa |       ✅**       |           ❌            |       ✅       |          ✅          |           ✅          |
 | llama.cpp      |       ❌       |           ❌            |       ❌       |          ❌          |           use llamacpp_HF    |
 | llamacpp_HF    |       ❌       |           ❌            |       ❌       |          ❌          |           ✅          |
 | ctransformers  |       ❌       |           ❌            |       ❌       |          ❌          |           ❌          |
 | AutoAWQ        |       ?        |           ❌            |       ?       |          ?          |           ✅          |
 ❌ = not implemented
 ✅ = implemented
 \* Training LoRAs with GPTQ models also works with the Transformers loader. Make sure to check "auto-devices" and "disable_exllama" before loading the model.
 \*\* Requires the monkey-patch. The instructions can be found [here](https://github.com/oobabooga/text-generation-webui/wiki/08-%E2%80%90-Additional-Tips#using-loras-with-gptq-for-llama).
--- a/docs/Windows-installation-guide.md
+++ b/docs/Windows-installation-guide.md
@ -1,9 +0,0 @@
 If you are having trouble following the installation instructions in the README, Reddit user [Technical_Leather949](https://www.reddit.com/user/Technical_Leather949/) has created a more detailed, step-by-step guide covering:
 * Windows installation
 * 8-bit mode on Windows
 * LLaMA
 * LLaMA 4-bit
 The guide can be found here: https://www.reddit.com/r/LocalLLaMA/comments/11o6o3f/how_to_install_llama_8bit_and_4bit/
--- a/docs/llama.cpp.md
+++ b/docs/llama.cpp.md
@ -1,43 +0,0 @@
 # llama.cpp
 llama.cpp is the best backend in two important scenarios:
 1) You don't have a GPU.
 2) You want to run a model that doesn't fit into your GPU.
 ## Setting up the models
 #### Pre-converted
 Download the GGUF models directly into your `text-generation-webui/models` folder. It will be a single file.
 * Make sure its name ends in `.gguf`.
 * `q4_K_M` quantization is recommended.
 #### Convert Llama yourself
 Follow the instructions in the llama.cpp README to generate a GGUF: https://github.com/ggerganov/llama.cpp#prepare-data--run
 ## GPU acceleration
 Enabled with the `--n-gpu-layers` parameter. 
 * If you have enough VRAM, use a high number like `--n-gpu-layers 1000` to offload all layers to the GPU. 
 * Otherwise, start with a low number like `--n-gpu-layers 10` and then gradually increase it until you run out of memory.
 This feature works out of the box for NVIDIA GPUs on Linux (amd64) or Windows. For other GPUs, you need to uninstall `llama-cpp-python` with
 ```
 pip uninstall -y llama-cpp-python
 ```
 and then recompile it using the commands here: https://pypi.org/project/llama-cpp-python/
 #### macOS
 For macOS, these are the commands:
 ```
 pip uninstall -y llama-cpp-python
 CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install llama-cpp-python --no-cache-dir
 ```
--- a/one_click.py
+++ b/one_click.py
@ -182,6 +182,10 @@ def install_webui():
            while use_cuda118 not in 'YN':
                print("Invalid choice. Please try again.")
                use_cuda118 = input("Input> ").upper().strip('"\'').strip()
            if use_cuda118 == 'Y':
                print(f"CUDA: 11.8")
            else:
                print(f"CUDA: 12.1")
        install_pytorch = f"python -m pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/{'cu121' if use_cuda118 == 'N' else 'cu118'}"
    elif not is_macos() and choice == "B":