Also dynamically limit the GPU layers and context length fields to the maximum supported by the model.
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
* Handle edge cases when generating embeddings
* Improve Python handling & add llmodel_c.h note
- In the Python bindings fail fast with a ValueError when text is empty
- Advice other bindings authors to do likewise in llmodel_c.h
most of these can just shortcut out of the model loading logic llama is a bit worse to deal with because we submodule it so I have to at least parse the hparams, and then I just use the size on disk as an estimate for the mem size (which seems reasonable since we mmap() the llama files anyway)
Major change to the backend that allows for pluggable versions of llama.cpp/ggml. This was squashed merged from dlopen_backend_5 where the history is preserved.
Fix occurrences of the prompt callback being incorrectly specified, or
the response callback's prototype being incorrectly used in its place.
Signed-off-by: Juuso Alasuutari <juuso.alasuutari@gmail.com>