* Added Dockerfile for inference
* Added instructions for Dockerfile
* Update README.md
* Update README.md
* Update README.md
* Pass env through Dockerfile
* Added docker compose setup and instructions
* Added more environment options
* Set a safer default mount point
* add docker-compose changes
* Added Dockerfile for inference
* Added instructions for Dockerfile
* Update README.md
* Update README.md
* Update README.md
* Pass env through Dockerfile
* Added docker compose setup and instructions
* Added more environment options
* Set a safer default mount point
* add to gitignore, update to new generate.py
* add docker ignore, simplify docker compose file
* add back missing requirements
* Adjustments to compose and generate.py, added Docker to README.md
* Linting adjust to Black
* Adjusting import linting
* Update README.md
* Update README.md
* Removed comment by original Dockerfile creator.
Comment not necessary.
* cleanup README
Co-authored-by: Francesco Saverio Zuppichini <zuppif@usi.ch>
---------
Co-authored-by: Francesco Saverio Zuppichini <zuppif@usi.ch>
Co-authored-by: Chris Alexiuk <c.s.alexiuk@gmail.com>
Co-authored-by: ElRoberto538 <>
Co-authored-by: Sam Sipe <samsipe@gmail.com>
Co-authored-by: Eric J. Wang <eric.james.wang@gmail.com>
* Templated prompter
* fix dup import
* Set Verbose False by default
I forgot to disable after testing.
* Fix imports order
* Use Black Formatting
* lint
* Re-introduce lost line
* Cleanup
* template default
* isort
---------
Co-authored-by: Eric Wang <eric.james.wang@gmail.com>
* Print only on Rank 0
When training on multiple GPU, the settings are printed once per gpu.
This only prints from rank 0
See https://github.com/tloen/alpaca-lora/issues/182#issuecomment-1485550636
for a sample output.
Could apply to a few other prints further down as well.
* Typo
* Added failsafe
So this works whether or not LOCAL_RANK is defined.
* override broken data parallelism with model parallelism
* formatting
* formatting, again
---------
Co-authored-by: Eric Wang <eric.james.wang@gmail.com>
Avoids the
"Overriding torch_dtype=None with `torch_dtype=torch.float16` due to requirements of `bitsandbytes` to enable model loading in mixed int8. Either pass torch_dtype=torch.float16 or don't pass this argument at all to remove this warning."
warning
Removes the warning:
`FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead`
* Improve tokenization
The PR changes a few things related to tokenization:
- Sets the padding to the left, which is required if you want to run batched inference with decoder models.
- Pads to the maximum length in each batch ensuring multiple of 8 length (tensor cores like multiple of 8) instead of CUTOFF_LEN. This should make the training faster as less tokens are fed into the model when it is not required (~10% faster in my experiments). To correctly implement this change I need to manually append the eos token (if the input is not truncated) so I have deleted "add_eos" token from the Tokenizer load function.
- Returns the labels in the tokenize function since some users seem to prefer it this way. This requires using the DataCollatorForSeq2Seq for padding the labels as well as input ids. Behavior of both DataCollators is the same if mlm=False. I can revert to DataCollatorForLanguageModeling if preferred.
* Experimental dynamic batching
* mask out user prompt, again
* Add options
* Remove BASE_MODEL again
* Small optimization
* Final tweaks
---------
Co-authored-by: Iker García-Ferrero <i.garciaferrerosanpelayo@gmail.com>
Co-authored-by: Sebastiaan <751205+SharkWipf@users.noreply.github.com>