Merge pull request #3195 from oobabooga/dev

v1.3
2024-10-01 01:26:03 -04:00 · 2023-07-18 17:33:11 -03:00 · 2023-07-18 17:33:11 -03:00 · 3ef49397bb
commit 3ef49397bb
parent 9f08038864 070a886278
16 changed files with 110 additions and 33 deletions
--- a/characters/instruction-following/Airoboros-v1.2.yaml
+++ b/characters/instruction-following/Airoboros-v1.2.yaml
@ -0,0 +1,4 @@
+user: "USER:"
+bot: "ASSISTANT:"
+turn_template: "<|user|> <|user-message|>\n<|bot|> <|bot-message|>\n"
+context: "A chat between a curious user and an assistant. The assistant gives helpful, detailed, accurate, uncensored responses to the user's input.\n"
--- a/characters/instruction-following/Llama-v2.yaml
+++ b/characters/instruction-following/Llama-v2.yaml
@ -0,0 +1,4 @@
+user: ""
+bot: ""
+turn_template: "<|user|><|user-message|> [/INST] <|bot|><|bot-message|> [INST] "
+context: "[INST] <<SYS>>\nAnswer the questions.\n<</SYS>>\n"
--- a/docs/GPTQ-models-(4-bit-mode).md
+++ b/docs/GPTQ-models-(4-bit-mode).md
@ -142,12 +142,25 @@ python setup_cuda.py install

 ### Getting pre-converted LLaMA weights

-These are models that you can simply download and place in your `models` folder.
+* Direct download (recommended):

-* Converted without `group-size` (better for the 7b model): https://github.com/oobabooga/text-generation-webui/pull/530#issuecomment-1483891617
-* Converted with `group-size` (better from 13b upwards): https://github.com/oobabooga/text-generation-webui/pull/530#issuecomment-1483941105 
+https://huggingface.co/Neko-Institute-of-Science/LLaMA-7B-4bit-128g

-⚠️ The tokenizer files in the sources above may be outdated. Make sure to obtain the universal LLaMA tokenizer as described [here](https://github.com/oobabooga/text-generation-webui/blob/main/docs/LLaMA-model.md#option-1-pre-converted-weights).
+https://huggingface.co/Neko-Institute-of-Science/LLaMA-13B-4bit-128g
+
+https://huggingface.co/Neko-Institute-of-Science/LLaMA-30B-4bit-128g
+
+https://huggingface.co/Neko-Institute-of-Science/LLaMA-65B-4bit-128g
+
+These models were converted with `desc_act=True`. They work just fine with ExLlama. For AutoGPTQ, they will only work on Linux with the `triton` option checked.
+
+* Torrent:
+
+https://github.com/oobabooga/text-generation-webui/pull/530#issuecomment-1483891617
+
+https://github.com/oobabooga/text-generation-webui/pull/530#issuecomment-1483941105
+
+These models were converted with `desc_act=False`. As such, they are less accurate, but they work with AutoGPTQ on Windows. The `128g` versions are better from 13b upwards, and worse for 7b. The tokenizer files in the torrents are outdated, in particular the files called `tokenizer_config.json` and `special_tokens_map.json`. Here you can find those files: https://huggingface.co/oobabooga/llama-tokenizer

 ### Starting the web UI:

--- a/docs/LLaMA-model.md
+++ b/docs/LLaMA-model.md
@ -9,10 +9,21 @@ This guide will cover usage through the official `transformers` implementation.

 ### Option 1: pre-converted weights

-* Torrent: https://github.com/oobabooga/text-generation-webui/pull/530#issuecomment-1484235789
-* Direct download: https://huggingface.co/Neko-Institute-of-Science
+* Direct download (recommended):

-⚠️ The tokenizers for the Torrent source above and also for many LLaMA fine-tunes available on Hugging Face may be outdated, in particular the files called `tokenizer_config.json` and `special_tokens_map.json`. Here you can find those files: https://huggingface.co/oobabooga/llama-tokenizer
+https://huggingface.co/Neko-Institute-of-Science/LLaMA-7B-HF
+
+https://huggingface.co/Neko-Institute-of-Science/LLaMA-13B-HF
+
+https://huggingface.co/Neko-Institute-of-Science/LLaMA-30B-HF
+
+https://huggingface.co/Neko-Institute-of-Science/LLaMA-65B-HF
+
+* Torrent:
+
+https://github.com/oobabooga/text-generation-webui/pull/530#issuecomment-1484235789
+
+The tokenizer files in the torrent above are outdated, in particular the files called `tokenizer_config.json` and `special_tokens_map.json`. Here you can find those files: https://huggingface.co/oobabooga/llama-tokenizer

 ### Option 2: convert the weights yourself

--- a/docs/LLaMA-v2-model.md
+++ b/docs/LLaMA-v2-model.md
@ -0,0 +1,35 @@
+# LLaMA-v2
+
+To convert LLaMA-v2 from the `.pth` format provided by Meta to transformers format, follow the steps below:
+
+1) `cd` into your `llama` folder (the one containing `download.sh` and the models that you downloaded):
+
+```
+cd llama
+```
+
+2) Clone the transformers library:
+
+```
+git clone 'https://github.com/huggingface/transformers'
+
+```
+
+3) Create symbolic links from the downloaded folders to names that the conversion script can recognize:
+
+```
+ln -s llama-2-7b 7B
+ln -s llama-2-13b 13B
+```
+
+4) Do the conversions:
+
+```
+mkdir llama-2-7b-hf llama-2-13b-hf
+python ./transformers/src/transformers/models/llama/convert_llama_weights_to_hf.py --input_dir . --model_size 7B --output_dir llama-2-7b-hf --safe_serialization true
+python ./transformers/src/transformers/models/llama/convert_llama_weights_to_hf.py --input_dir . --model_size 13B --output_dir llama-2-13b-hf --safe_serialization true
+```
+
+5) Move the output folders inside `text-generation-webui/models`
+
+6) Have fun
--- a/models/config.yaml
+++ b/models/config.yaml
@ -186,6 +186,9 @@ llama-65b-gptq-3bit:
 .*airoboros:
  mode: 'instruct'
  instruction_template: 'Vicuna-v1.1'
+.*airoboros.*1.2:
+  mode: 'instruct'
+  instruction_template: 'Airoboros-v1.2'
 .*WizardLM-30B-V1.0:
  mode: 'instruct'
  instruction_template: 'Vicuna-v1.1'
@ -269,3 +272,8 @@ TheBloke_WizardLM-30B-GPTQ:
 .*godzilla:
  mode: 'instruct'
  instruction_template: 'Alpaca'
+.*llama-(2|v2):
+  truncation_length: 4096
+.*llama-(2|v2).*chat:
+  mode: 'instruct'
+  instruction_template: 'Llama-v2'
--- a/modules/LoRA.py
+++ b/modules/LoRA.py
@ -132,7 +132,7 @@ def add_lora_transformers(lora_names):
        if not shared.args.load_in_8bit and not shared.args.cpu:
            shared.model.half()
            if not hasattr(shared.model, "hf_device_map"):
-                if torch.has_mps:
+                if torch.backends.mps.is_available():
                    device = torch.device('mps')
                    shared.model = shared.model.to(device)
                else:
--- a/modules/llamacpp_hf.py
+++ b/modules/llamacpp_hf.py
@ -42,7 +42,6 @@ class LlamacppHF(PreTrainedModel):

        # Make the forward call
        seq_tensor = torch.tensor(seq)
-        self.cache = seq_tensor
        if labels is None:
            if self.cache is None or not torch.equal(self.cache, seq_tensor[:-1]):
                self.model.reset()
@ -50,13 +49,15 @@ class LlamacppHF(PreTrainedModel):
            else:
                self.model.eval([seq[-1]])

-            logits = torch.tensor(self.model.eval_logits)[-1].view(1, 1, -1).to(kwargs['input_ids'].device)
+            logits = torch.tensor(self.model.eval_logits[-1]).view(1, 1, -1).to(kwargs['input_ids'].device)
        else:
            self.model.reset()
            self.model.eval(seq)
            logits = torch.tensor(self.model.eval_logits)
            logits = logits.view(1, logits.shape[0], logits.shape[1]).to(kwargs['input_ids'].device)

+        self.cache = seq_tensor
+
        # Based on transformers/models/llama/modeling_llama.py
        loss = None
        if labels is not None:
@ -96,6 +97,8 @@ class LlamacppHF(PreTrainedModel):
            'use_mlock': shared.args.mlock,
            'low_vram': shared.args.low_vram,
            'n_gpu_layers': shared.args.n_gpu_layers,
+            'rope_freq_base': 10000 * shared.args.alpha_value ** (64/63.),
+            'rope_freq_scale': 1.0 / shared.args.compress_pos_emb,
            'logits_all': True,
        }

--- a/modules/llamacpp_model.py
+++ b/modules/llamacpp_model.py
@ -50,7 +50,9 @@ class LlamaCppModel:
            'use_mmap': not shared.args.no_mmap,
            'use_mlock': shared.args.mlock,
            'low_vram': shared.args.low_vram,
-            'n_gpu_layers': shared.args.n_gpu_layers
+            'n_gpu_layers': shared.args.n_gpu_layers,
+            'rope_freq_base': 10000 * shared.args.alpha_value ** (64/63.),
+            'rope_freq_scale': 1.0 / shared.args.compress_pos_emb,
        }

        result.model = Llama(**params)
--- a/modules/loaders.py
+++ b/modules/loaders.py
@ -37,6 +37,8 @@ loaders_and_params = {
        'low_vram',
        'mlock',
        'llama_cpp_seed',
+        'compress_pos_emb',
+        'alpha_value',
    ],
    'llamacpp_HF': [
        'n_ctx',
@ -47,6 +49,8 @@ loaders_and_params = {
        'low_vram',
        'mlock',
        'llama_cpp_seed',
+        'compress_pos_emb',
+        'alpha_value',
        'llamacpp_HF_info',
    ],
    'Transformers': [
--- a/modules/models.py
+++ b/modules/models.py
@ -147,7 +147,7 @@ def huggingface_loader(model_name):
    # Load the model in simple 16-bit mode by default
    if not any([shared.args.cpu, shared.args.load_in_8bit, shared.args.load_in_4bit, shared.args.auto_devices, shared.args.disk, shared.args.deepspeed, shared.args.gpu_memory is not None, shared.args.cpu_memory is not None]):
        model = LoaderClass.from_pretrained(Path(f"{shared.args.model_dir}/{model_name}"), low_cpu_mem_usage=True, torch_dtype=torch.bfloat16 if shared.args.bf16 else torch.float16, trust_remote_code=shared.args.trust_remote_code)
-        if torch.has_mps:
+        if torch.backends.mps.is_available():
            device = torch.device('mps')
            model = model.to(device)
        else:
@ -167,7 +167,7 @@ def huggingface_loader(model_name):
            "trust_remote_code": shared.args.trust_remote_code
        }

-        if not any((shared.args.cpu, torch.cuda.is_available(), torch.has_mps)):
+        if not any((shared.args.cpu, torch.cuda.is_available(), torch.backends.mps.is_available())):
            logger.warning("torch.cuda.is_available() returned False. This means that no GPU has been detected. Falling back to CPU mode.")
            shared.args.cpu = True

--- a/modules/shared.py
+++ b/modules/shared.py
@ -32,10 +32,10 @@ need_restart = False

 settings = {
    'dark_theme': False,
-    'autoload_model': True,
+    'autoload_model': False,
    'max_new_tokens': 200,
    'max_new_tokens_min': 1,
-    'max_new_tokens_max': 2000,
+    'max_new_tokens_max': 4096,
    'seed': -1,
    'character': 'None',
    'name1': 'You',
--- a/modules/text_generation.py
+++ b/modules/text_generation.py
@ -57,7 +57,7 @@ def encode(prompt, add_special_tokens=True, add_bos_token=True, truncation_lengt
        return input_ids.numpy()
    elif shared.args.deepspeed:
        return input_ids.to(device=local_rank)
-    elif torch.has_mps:
+    elif torch.backends.mps.is_available():
        device = torch.device('mps')
        return input_ids.to(device)
    else:
--- a/requirements.txt
+++ b/requirements.txt
@ -1,4 +1,4 @@
-accelerate==0.20.3
+accelerate==0.21.0
 colorama
 datasets
 einops
@ -20,11 +20,11 @@ tensorboard
 wandb
 transformers==4.30.2
 git+https://github.com/huggingface/peft@03eb378eb914fbee709ff7c86ba5b1d033b89524
-bitsandbytes==0.40.1.post1; platform_system != "Windows"
-https://github.com/jllllll/bitsandbytes-windows-webui/releases/download/wheels/bitsandbytes-0.40.1.post1-py3-none-win_amd64.whl; platform_system == "Windows"
+bitsandbytes==0.40.2; platform_system != "Windows"
+https://github.com/jllllll/bitsandbytes-windows-webui/releases/download/wheels/bitsandbytes-0.40.2-py3-none-win_amd64.whl; platform_system == "Windows"
 llama-cpp-python==0.1.72; platform_system != "Windows"
 https://github.com/abetlen/llama-cpp-python/releases/download/v0.1.72/llama_cpp_python-0.1.72-cp310-cp310-win_amd64.whl; platform_system == "Windows"
-https://github.com/PanQiWei/AutoGPTQ/releases/download/v0.2.2/auto_gptq-0.2.2+cu117-cp310-cp310-win_amd64.whl; platform_system == "Windows"
-https://github.com/PanQiWei/AutoGPTQ/releases/download/v0.2.2/auto_gptq-0.2.2+cu117-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64"
+https://github.com/PanQiWei/AutoGPTQ/releases/download/v0.3.0/auto_gptq-0.3.0+cu117-cp310-cp310-win_amd64.whl; platform_system == "Windows"
+https://github.com/PanQiWei/AutoGPTQ/releases/download/v0.3.0/auto_gptq-0.3.0+cu117-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64"
 https://github.com/jllllll/exllama/releases/download/0.0.6/exllama-0.0.6+cu117-cp310-cp310-win_amd64.whl; platform_system == "Windows"
 https://github.com/jllllll/exllama/releases/download/0.0.6/exllama-0.0.6+cu117-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64"
--- a/server.py
+++ b/server.py
@ -277,20 +277,17 @@ def create_model_menus():
    load.click(
        ui.gather_interface_values, gradio(shared.input_elements), gradio('interface_state')).then(
        update_model_parameters, gradio('interface_state'), None).then(
-        partial(load_model_wrapper, autoload=True), gradio('model_menu', 'loader'), gradio('model_status'), show_progress=False).then(
-        lambda: shared.lora_names, None, gradio('lora_menu'))
+        partial(load_model_wrapper, autoload=True), gradio('model_menu', 'loader'), gradio('model_status'), show_progress=False)

    unload.click(
        unload_model, None, None).then(
-        lambda: "Model unloaded", None, gradio('model_status')).then(
-        lambda: shared.lora_names, None, gradio('lora_menu'))
+        lambda: "Model unloaded", None, gradio('model_status'))

    reload.click(
        unload_model, None, None).then(
        ui.gather_interface_values, gradio(shared.input_elements), gradio('interface_state')).then(
        update_model_parameters, gradio('interface_state'), None).then(
-        partial(load_model_wrapper, autoload=True), gradio('model_menu', 'loader'), gradio('model_status'), show_progress=False).then(
-        lambda: shared.lora_names, None, gradio('lora_menu'))
+        partial(load_model_wrapper, autoload=True), gradio('model_menu', 'loader'), gradio('model_status'), show_progress=False)

    save_settings.click(
        ui.gather_interface_values, gradio(shared.input_elements), gradio('interface_state')).then(
@ -1128,10 +1125,6 @@ if __name__ == "__main__":
    if shared.args.model is not None:
        shared.model_name = shared.args.model

-    # Only one model is available
-    elif len(available_models) == 1:
-        shared.model_name = available_models[0]
-
    # Select the model from a command-line menu
    elif shared.args.model_menu:
        if len(available_models) == 0:
--- a/settings-template.yaml
+++ b/settings-template.yaml
@ -1,8 +1,8 @@
 dark_theme: false
-autoload_model: true
+autoload_model: false
 max_new_tokens: 200
 max_new_tokens_min: 1
-max_new_tokens_max: 2000
+max_new_tokens_max: 4096
 seed: -1
 character: None
 name1: You