gpt4all/gpt4all-backend/llamamodel_impl.h

#ifndef LLAMAMODEL_H_I_KNOW_WHAT_I_AM_DOING_WHEN_INCLUDING_THIS_FILE
#error This file is NOT meant to be included outside of llamamodel.cpp. Doing so is DANGEROUS. Be sure to know what you are doing before proceeding to #define LLAMAMODEL_H_I_KNOW_WHAT_I_AM_DOING_WHEN_INCLUDING_THIS_FILE
#endif
#ifndef LLAMAMODEL_H
#define LLAMAMODEL_H

#include "llmodel.h"

#include <memory>
#include <string>
#include <vector>

struct LLamaPrivate;
struct EmbModelSpec;

class LLamaModel : public LLModel {
public:
    LLamaModel();
    ~LLamaModel();

    bool supportsEmbedding() const override { return m_supportsEmbedding; }
    bool supportsCompletion() const override { return m_supportsCompletion; }
    bool loadModel(const std::string &modelPath, int n_ctx, int ngl) override;
    bool isModelBlacklisted(const std::string &modelPath) const override;
    bool isEmbeddingModel(const std::string &modelPath) const override;
    bool isModelLoaded() const override;
    size_t requiredMem(const std::string &modelPath, int n_ctx, int ngl) override;
    size_t stateSize() const override;
    size_t saveState(uint8_t *dest) const override;
    size_t restoreState(const uint8_t *src) override;
    void setThreadCount(int32_t n_threads) override;
    int32_t threadCount() const override;
    std::vector<GPUDevice> availableGPUDevices(size_t memoryRequired = 0) const override;
    bool initializeGPUDevice(size_t memoryRequired, const std::string &name) const override;
    bool initializeGPUDevice(int device, std::string *unavail_reason = nullptr) const override;
    bool usingGPUDevice() const override;
    const char *backendName() const override;
    const char *gpuDeviceName() const override;

    size_t embeddingSize() const override;
    // user-specified prefix
    void embed(const std::vector<std::string> &texts, float *embeddings, std::optional<std::string> prefix,
               int dimensionality = -1, size_t *tokenCount = nullptr, bool doMean = true, bool atlas = false,
               EmbedCancelCallback *cancelCb = nullptr) override;
    // automatic prefix
    void embed(const std::vector<std::string> &texts, float *embeddings, bool isRetrieval, int dimensionality = -1,
               size_t *tokenCount = nullptr, bool doMean = true, bool atlas = false) override;

private:
    std::unique_ptr<LLamaPrivate> d_ptr;
    bool m_supportsEmbedding = false;
    bool m_supportsCompletion = false;

protected:
    std::vector<Token> tokenize(PromptContext &ctx, const std::string &str, bool special) override;
    bool isSpecialToken(Token id) const override;
    std::string tokenToString(Token id) const override;
    Token sampleToken(PromptContext &ctx) const override;
    bool evalTokens(PromptContext &ctx, const std::vector<int32_t> &tokens) const override;
    void shiftContext(PromptContext &promptCtx) override;
    int32_t contextLength() const override;
    const std::vector<Token> &endTokens() const override;
    bool shouldAddBOS() const override;
    int32_t maxContextLength(std::string const &modelPath) const override;
    int32_t layerCount(std::string const &modelPath) const override;

    void embedInternal(const std::vector<std::string> &texts, float *embeddings, std::string prefix, int dimensionality,
                       size_t *tokenCount, bool doMean, bool atlas, EmbedCancelCallback *cancelCb,
                       const EmbModelSpec *spec);
};

#endif // LLAMAMODEL_H
Dlopen backend 5 (#779) Major change to the backend that allows for pluggable versions of llama.cpp/ggml. This was squashed merged from dlopen_backend_5 where the history is preserved. 2023-05-31 17:04:01 -04:00			`#ifndef LLAMAMODEL_H_I_KNOW_WHAT_I_AM_DOING_WHEN_INCLUDING_THIS_FILE`
			`#error This file is NOT meant to be included outside of llamamodel.cpp. Doing so is DANGEROUS. Be sure to know what you are doing before proceeding to #define LLAMAMODEL_H_I_KNOW_WHAT_I_AM_DOING_WHEN_INCLUDING_THIS_FILE`
			`#endif`
Add llama.cpp support for loading llama based models in the gui. We now support loading both gptj derived models and llama derived models. 2023-04-15 15:57:32 -04:00			`#ifndef LLAMAMODEL_H`
			`#define LLAMAMODEL_H`

backend: fix #includes with include-what-you-use (#2371) Also fix a PARENT_SCOPE warning when building the backend. Signed-off-by: Jared Van Bortel <jared@nomic.ai> 2024-05-31 16:34:54 -04:00			`#include "llmodel.h"`

expose n_gpu_layers parameter of llama.cpp (#1890) Also dynamically limit the GPU layers and context length fields to the maximum supported by the model. Signed-off-by: Jared Van Bortel <jared@nomic.ai> 2024-01-31 14:17:44 -05:00			`#include <memory>`
			`#include <string>`
Add llama.cpp support for loading llama based models in the gui. We now support loading both gptj derived models and llama derived models. 2023-04-15 15:57:32 -04:00			`#include <vector>`

Dlopen backend 5 (#779) Major change to the backend that allows for pluggable versions of llama.cpp/ggml. This was squashed merged from dlopen_backend_5 where the history is preserved. 2023-05-31 17:04:01 -04:00			`struct LLamaPrivate;`
implement local Nomic Embed via llama.cpp (#2086) Signed-off-by: Jared Van Bortel <jared@nomic.ai> 2024-03-13 18:09:24 -04:00			`struct EmbModelSpec;`

Add llama.cpp support for loading llama based models in the gui. We now support loading both gptj derived models and llama derived models. 2023-04-15 15:57:32 -04:00			`class LLamaModel : public LLModel {`
			`public:`
			`LLamaModel();`
			`~LLamaModel();`

implement local Nomic Embed via llama.cpp (#2086) Signed-off-by: Jared Van Bortel <jared@nomic.ai> 2024-03-13 18:09:24 -04:00			`bool supportsEmbedding() const override { return m_supportsEmbedding; }`
			`bool supportsCompletion() const override { return m_supportsCompletion; }`
expose n_gpu_layers parameter of llama.cpp (#1890) Also dynamically limit the GPU layers and context length fields to the maximum supported by the model. Signed-off-by: Jared Van Bortel <jared@nomic.ai> 2024-01-31 14:17:44 -05:00			`bool loadModel(const std::string &modelPath, int n_ctx, int ngl) override;`
implement local Nomic Embed via llama.cpp (#2086) Signed-off-by: Jared Van Bortel <jared@nomic.ai> 2024-03-13 18:09:24 -04:00			`bool isModelBlacklisted(const std::string &modelPath) const override;`
			`bool isEmbeddingModel(const std::string &modelPath) const override;`
Add llama.cpp support for loading llama based models in the gui. We now support loading both gptj derived models and llama derived models. 2023-04-15 15:57:32 -04:00			`bool isModelLoaded() const override;`
expose n_gpu_layers parameter of llama.cpp (#1890) Also dynamically limit the GPU layers and context length fields to the maximum supported by the model. Signed-off-by: Jared Van Bortel <jared@nomic.ai> 2024-01-31 14:17:44 -05:00			`size_t requiredMem(const std::string &modelPath, int n_ctx, int ngl) override;`
First attempt at providing a persistent chat list experience. Limitations: 1) Context is not restored for gpt-j models 2) When you switch between different model types in an existing chat the context and all the conversation is lost 3) The settings are not chat or conversation specific 4) The sizes of the chat persisted files are very large due to how much data the llama.cpp backend tries to persist. Need to investigate how we can shrink this. 2023-05-04 15:31:41 -04:00			`size_t stateSize() const override;`
			`size_t saveState(uint8_t *dest) const override;`
			`size_t restoreState(const uint8_t *src) override;`
Add llama.cpp support for loading llama based models in the gui. We now support loading both gptj derived models and llama derived models. 2023-04-15 15:57:32 -04:00			`void setThreadCount(int32_t n_threads) override;`
llmodel: constify LLModel::threadCount() 2023-05-21 16:45:29 -04:00			`int32_t threadCount() const override;`
support the llama.cpp CUDA backend (#2310) * rebase onto llama.cpp commit ggerganov/llama.cpp@d46dbc76f * support for CUDA backend (enabled by default) * partial support for Occam's Vulkan backend (disabled by default) * partial support for HIP/ROCm backend (disabled by default) * sync llama.cpp.cmake with upstream llama.cpp CMakeLists.txt * changes to GPT4All backend, bindings, and chat UI to handle choice of llama.cpp backend (Kompute or CUDA) * ship CUDA runtime with installed version * make device selection in the UI on macOS actually do something * model whitelist: remove dbrx, mamba, persimmon, plamo; add internlm and starcoder2 Signed-off-by: Jared Van Bortel <jared@nomic.ai> 2024-05-15 15:27:50 -04:00			`std::vector<GPUDevice> availableGPUDevices(size_t memoryRequired = 0) const override;`
fix chat-style prompt templates (#1970) Also use a new version of Mistral OpenOrca. Signed-off-by: Jared Van Bortel <jared@nomic.ai> 2024-02-21 15:45:32 -05:00			`bool initializeGPUDevice(size_t memoryRequired, const std::string &name) const override;`
implement local Nomic Embed via llama.cpp (#2086) Signed-off-by: Jared Van Bortel <jared@nomic.ai> 2024-03-13 18:09:24 -04:00			`bool initializeGPUDevice(int device, std::string *unavail_reason = nullptr) const override;`
python: do not print GPU name with verbose=False, expose this info via properties (#2222) * llamamodel: only print device used in verbose mode Signed-off-by: Jared Van Bortel <jared@nomic.ai> * python: expose backend and device via GPT4All properties Signed-off-by: Jared Van Bortel <jared@nomic.ai> * backend: const correctness fixes Signed-off-by: Jared Van Bortel <jared@nomic.ai> * python: bump version Signed-off-by: Jared Van Bortel <jared@nomic.ai> * python: typing fixups Signed-off-by: Jared Van Bortel <jared@nomic.ai> * python: fix segfault with closed GPT4All Signed-off-by: Jared Van Bortel <jared@nomic.ai> --------- Signed-off-by: Jared Van Bortel <jared@nomic.ai> 2024-04-18 14:52:02 -04:00			`bool usingGPUDevice() const override;`
			`const char *backendName() const override;`
			`const char *gpuDeviceName() const override;`
Add llama.cpp support for loading llama based models in the gui. We now support loading both gptj derived models and llama derived models. 2023-04-15 15:57:32 -04:00
implement local Nomic Embed via llama.cpp (#2086) Signed-off-by: Jared Van Bortel <jared@nomic.ai> 2024-03-13 18:09:24 -04:00			`size_t embeddingSize() const override;`
			`// user-specified prefix`
			`void embed(const std::vector<std::string> &texts, float *embeddings, std::optional<std::string> prefix,`
python: embedding cancel callback for nomic client dynamic mode (#2214) Signed-off-by: Jared Van Bortel <jared@nomic.ai> 2024-04-12 16:00:39 -04:00			`int dimensionality = -1, size_t *tokenCount = nullptr, bool doMean = true, bool atlas = false,`
			`EmbedCancelCallback *cancelCb = nullptr) override;`
implement local Nomic Embed via llama.cpp (#2086) Signed-off-by: Jared Van Bortel <jared@nomic.ai> 2024-03-13 18:09:24 -04:00			`// automatic prefix`
			`void embed(const std::vector<std::string> &texts, float *embeddings, bool isRetrieval, int dimensionality = -1,`
Embed4All: optionally count tokens, misc fixes (#2145) Key changes: * python: optionally return token count in Embed4All.embed * python and docs: models2.json -> models3.json * Embed4All: require explicit prefix for unknown models * llamamodel: fix shouldAddBOS for Bert and Nomic Bert Signed-off-by: Jared Van Bortel <jared@nomic.ai> 2024-03-20 11:24:02 -04:00			`size_t *tokenCount = nullptr, bool doMean = true, bool atlas = false) override;`
implement local Nomic Embed via llama.cpp (#2086) Signed-off-by: Jared Van Bortel <jared@nomic.ai> 2024-03-13 18:09:24 -04:00
Add llama.cpp support for loading llama based models in the gui. We now support loading both gptj derived models and llama derived models. 2023-04-15 15:57:32 -04:00			`private:`
expose n_gpu_layers parameter of llama.cpp (#1890) Also dynamically limit the GPU layers and context length fields to the maximum supported by the model. Signed-off-by: Jared Van Bortel <jared@nomic.ai> 2024-01-31 14:17:44 -05:00			`std::unique_ptr<LLamaPrivate> d_ptr;`
implement local Nomic Embed via llama.cpp (#2086) Signed-off-by: Jared Van Bortel <jared@nomic.ai> 2024-03-13 18:09:24 -04:00			`bool m_supportsEmbedding = false;`
			`bool m_supportsCompletion = false;`
Backend prompt dedup (#822) * Deduplicated prompt() function code 2023-06-04 08:59:24 -04:00
			`protected:`
backend: fix extra spaces in tokenization and a CUDA crash (#2778) Also potentially improves accuracy of BOS insertion, token cache, and logit indexing. Signed-off-by: Jared Van Bortel <jared@nomic.ai> 2024-08-01 10:46:36 -04:00			`std::vector<Token> tokenize(PromptContext &ctx, const std::string &str, bool special) override;`
chat: faster KV shift, continue generating, fix stop sequences (#2781) * Don't stop generating at end of context * Use llama_kv_cache ops to shift context * Fix and improve reverse prompt detection * Replace prompt recalc callback with a flag to disallow context shift 2024-08-07 11:25:24 -04:00			`bool isSpecialToken(Token id) const override;`
fix chat-style prompt templates (#1970) Also use a new version of Mistral OpenOrca. Signed-off-by: Jared Van Bortel <jared@nomic.ai> 2024-02-21 15:45:32 -05:00			`std::string tokenToString(Token id) const override;`
			`Token sampleToken(PromptContext &ctx) const override;`
			`bool evalTokens(PromptContext &ctx, const std::vector<int32_t> &tokens) const override;`
chat: faster KV shift, continue generating, fix stop sequences (#2781) * Don't stop generating at end of context * Use llama_kv_cache ops to shift context * Fix and improve reverse prompt detection * Replace prompt recalc callback with a flag to disallow context shift 2024-08-07 11:25:24 -04:00			`void shiftContext(PromptContext &promptCtx) override;`
Backend prompt dedup (#822) * Deduplicated prompt() function code 2023-06-04 08:59:24 -04:00			`int32_t contextLength() const override;`
fix chat-style prompt templates (#1970) Also use a new version of Mistral OpenOrca. Signed-off-by: Jared Van Bortel <jared@nomic.ai> 2024-02-21 15:45:32 -05:00			`const std::vector<Token> &endTokens() const override;`
			`bool shouldAddBOS() const override;`
expose n_gpu_layers parameter of llama.cpp (#1890) Also dynamically limit the GPU layers and context length fields to the maximum supported by the model. Signed-off-by: Jared Van Bortel <jared@nomic.ai> 2024-01-31 14:17:44 -05:00			`int32_t maxContextLength(std::string const &modelPath) const override;`
			`int32_t layerCount(std::string const &modelPath) const override;`
implement local Nomic Embed via llama.cpp (#2086) Signed-off-by: Jared Van Bortel <jared@nomic.ai> 2024-03-13 18:09:24 -04:00
			`void embedInternal(const std::vector<std::string> &texts, float *embeddings, std::string prefix, int dimensionality,`
python: embedding cancel callback for nomic client dynamic mode (#2214) Signed-off-by: Jared Van Bortel <jared@nomic.ai> 2024-04-12 16:00:39 -04:00			`size_t tokenCount, bool doMean, bool atlas, EmbedCancelCallback cancelCb,`
			`const EmbModelSpec *spec);`
Add llama.cpp support for loading llama based models in the gui. We now support loading both gptj derived models and llama derived models. 2023-04-15 15:57:32 -04:00			`};`

Dlopen backend 5 (#779) Major change to the backend that allows for pluggable versions of llama.cpp/ggml. This was squashed merged from dlopen_backend_5 where the history is preserved. 2023-05-31 17:04:01 -04:00			`#endif // LLAMAMODEL_H`