* rebase onto llama.cpp commit ggerganov/llama.cpp@d46dbc76f
* support for CUDA backend (enabled by default)
* partial support for Occam's Vulkan backend (disabled by default)
* partial support for HIP/ROCm backend (disabled by default)
* sync llama.cpp.cmake with upstream llama.cpp CMakeLists.txt
* changes to GPT4All backend, bindings, and chat UI to handle choice of llama.cpp backend (Kompute or CUDA)
* ship CUDA runtime with installed version
* make device selection in the UI on macOS actually do something
* model whitelist: remove dbrx, mamba, persimmon, plamo; add internlm and starcoder2
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
* backend: factor out common structs in model code
prepping to hack on these by hopefully making there be fewer places to fix the same bug
rename
* use common buffer wrapper instead of manual malloc
* fix replit compile warnings
* Initial Library Loader
* Load library as part of Model factory
* Dynamically search and find the dlls
* Update tests to use locally built runtimes
* Fix dylib loading, add macos runtime support for sample/tests
* Bypass automatic loading by default.
* Only set CMAKE_OSX_ARCHITECTURES if not already set, allow cross-compile
* Switch Loading again
* Update build scripts for mac/linux
* Update bindings to support newest breaking changes
* Fix build
* Use llmodel for Windows
* Actually, it does need to be libllmodel
* Name
* Remove TFMs, bypass loading by default
* Fix script
* Delete mac script
---------
Co-authored-by: Tim Miller <innerlogic4321@ghmail.com>
* porting over replit code model to gpt4all
* replaced memory with kv_self struct
* continuing debug
* welp it built but lot of sus things
* working model loading and somewhat working generate.. need to format response?
* revert back to semi working version
* finally got rid of weird formatting
* figured out problem is with python bindings - this is good to go for testing
* addressing PR feedback
* output refactor
* fixed prompt reponse collection
* cleanup
* addressing PR comments
* building replit backend with new ggmlver code
* chatllm replit and clean python files
* cleanup
* updated replit to match new llmodel api
* match llmodel api and change size_t to Token
* resolve PR comments
* replit model commit comment
Major change to the backend that allows for pluggable versions of llama.cpp/ggml. This was squashed merged from dlopen_backend_5 where the history is preserved.
Improves output quality by making these tokenizers more closely
match the behavior of the huggingface `tokenizers` based BPE
tokenizers these models were trained with.
Featuring:
* Fixed unicode handling (via ICU)
* Fixed BPE token merge handling
* Complete added vocabulary handling