* adding some native methods to cpp wrapper * gpu seems to work * typings and add availibleGpus method * fix spelling * fix syntax * more * normalize methods to conform to py * remove extra dynamic linker deps when building with vulkan * bump python version (library linking fix) * Don't link against libvulkan. * vulkan python bindings on windows fixes * Bring the vulkan backend to the GUI. * When device is Auto (the default) then we will only consider discrete GPU's otherwise fallback to CPU. * Show the device we're currently using. * Fix up the name and formatting. * init at most one vulkan device, submodule update fixes issues w/ multiple of the same gpu * Update the submodule. * Add version 2.4.15 and bump the version number. * Fix a bug where we're not properly falling back to CPU. * Sync to a newer version of llama.cpp with bugfix for vulkan. * Report the actual device we're using. * Only show GPU when we're actually using it. * Bump to new llama with new bugfix. * Release notes for v2.4.16 and bump the version. * Fallback to CPU more robustly. * Release notes for v2.4.17 and bump the version. * Bump the Python version to python-v1.0.12 to restrict the quants that vulkan recognizes. * Link against ggml in bin so we can get the available devices without loading a model. * Send actual and requested device info for those who have opt-in. * Actually bump the version. * Release notes for v2.4.18 and bump the version. * Fix for crashes on systems where vulkan is not installed properly. * Release notes for v2.4.19 and bump the version. * fix typings and vulkan build works on win * Add flatpak manifest * Remove unnecessary stuffs from manifest * Update to 2.4.19 * appdata: update software description * Latest rebase on llama.cpp with gguf support. * macos build fixes * llamamodel: metal supports all quantization types now * gpt4all.py: GGUF * pyllmodel: print specific error message * backend: port BERT to GGUF * backend: port MPT to GGUF * backend: port Replit to GGUF * backend: use gguf branch of llama.cpp-mainline * backend: use llamamodel.cpp for StarCoder * conversion scripts: cleanup * convert scripts: load model as late as possible * convert_mpt_hf_to_gguf.py: better tokenizer decoding * backend: use llamamodel.cpp for Falcon * convert scripts: make them directly executable * fix references to removed model types * modellist: fix the system prompt * backend: port GPT-J to GGUF * gpt-j: update inference to match latest llama.cpp insights - Use F16 KV cache - Store transposed V in the cache - Avoid unnecessary Q copy Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> ggml upstream commit 0265f0813492602fec0e1159fe61de1bf0ccaf78 * chatllm: grammar fix * convert scripts: use bytes_to_unicode from transformers * convert scripts: make gptj script executable * convert scripts: add feed-forward length for better compatiblilty This GGUF key is used by all llama.cpp models with upstream support. * gptj: remove unused variables * Refactor for subgroups on mat * vec kernel. * Add q6_k kernels for vulkan. * python binding: print debug message to stderr * Fix regenerate button to be deterministic and bump the llama version to latest we have for gguf. * Bump to the latest fixes for vulkan in llama. * llamamodel: fix static vector in LLamaModel::endTokens * Switch to new models2.json for new gguf release and bump our version to 2.5.0. * Bump to latest llama/gguf branch. * chat: report reason for fallback to CPU * chat: make sure to clear fallback reason on success * more accurate fallback descriptions * differentiate between init failure and unsupported models * backend: do not use Vulkan with non-LLaMA models * Add q8_0 kernels to kompute shaders and bump to latest llama/gguf. * backend: fix build with Visual Studio generator Use the $<CONFIG> generator expression instead of CMAKE_BUILD_TYPE. This is needed because Visual Studio is a multi-configuration generator, so we do not know what the build type will be until `cmake --build` is called. Fixes #1470 * remove old llama.cpp submodules * Reorder and refresh our models2.json. * rebase on newer llama.cpp * python/embed4all: use gguf model, allow passing kwargs/overriding model * Add starcoder, rift and sbert to our models2.json. * Push a new version number for llmodel backend now that it is based on gguf. * fix stray comma in models2.json Signed-off-by: Aaron Miller <apage43@ninjawhale.com> * Speculative fix for build on mac. * chat: clearer CPU fallback messages * Fix crasher with an empty string for prompt template. * Update the language here to avoid misunderstanding. * added EM German Mistral Model * make codespell happy * issue template: remove "Related Components" section * cmake: install the GPT-J plugin (#1487) * Do not delete saved chats if we fail to serialize properly. * Restore state from text if necessary. * Another codespell attempted fix. * llmodel: do not call magic_match unless build variant is correct (#1488) * chatllm: do not write uninitialized data to stream (#1486) * mat*mat for q4_0, q8_0 * do not process prompts on gpu yet * python: support Path in GPT4All.__init__ (#1462) * llmodel: print an error if the CPU does not support AVX (#1499) * python bindings should be quiet by default * disable llama.cpp logging unless GPT4ALL_VERBOSE_LLAMACPP envvar is nonempty * make verbose flag for retrieve_model default false (but also be overridable via gpt4all constructor) should be able to run a basic test: ```python import gpt4all model = gpt4all.GPT4All('/Users/aaron/Downloads/rift-coder-v0-7b-q4_0.gguf') print(model.generate('def fib(n):')) ``` and see no non-model output when successful * python: always check status code of HTTP responses (#1502) * Always save chats to disk, but save them as text by default. This also changes the UI behavior to always open a 'New Chat' and setting it as current instead of setting a restored chat as current. This improves usability by not requiring the user to wait if they want to immediately start chatting. * Update README.md Signed-off-by: umarmnaq <102142660+umarmnaq@users.noreply.github.com> * fix embed4all filename https://discordapp.com/channels/1076964370942267462/1093558720690143283/1161778216462192692 Signed-off-by: Aaron Miller <apage43@ninjawhale.com> * Improves Java API signatures maintaining back compatibility * python: replace deprecated pkg_resources with importlib (#1505) * Updated chat wishlist (#1351) * q6k, q4_1 mat*mat * update mini-orca 3b to gguf2, license Signed-off-by: Aaron Miller <apage43@ninjawhale.com> * convert scripts: fix AutoConfig typo (#1512) * publish config https://docs.npmjs.com/cli/v9/configuring-npm/package-json#publishconfig (#1375) merge into my branch * fix appendBin * fix gpu not initializing first * sync up * progress, still wip on destructor * some detection work * untested dispose method * add js side of dispose * Update gpt4all-bindings/typescript/index.cc Co-authored-by: cebtenzzre <cebtenzzre@gmail.com> Signed-off-by: Jacob Nguyen <76754747+jacoobes@users.noreply.github.com> * Update gpt4all-bindings/typescript/index.cc Co-authored-by: cebtenzzre <cebtenzzre@gmail.com> Signed-off-by: Jacob Nguyen <76754747+jacoobes@users.noreply.github.com> * Update gpt4all-bindings/typescript/index.cc Co-authored-by: cebtenzzre <cebtenzzre@gmail.com> Signed-off-by: Jacob Nguyen <76754747+jacoobes@users.noreply.github.com> * Update gpt4all-bindings/typescript/src/gpt4all.d.ts Co-authored-by: cebtenzzre <cebtenzzre@gmail.com> Signed-off-by: Jacob Nguyen <76754747+jacoobes@users.noreply.github.com> * Update gpt4all-bindings/typescript/src/gpt4all.js Co-authored-by: cebtenzzre <cebtenzzre@gmail.com> Signed-off-by: Jacob Nguyen <76754747+jacoobes@users.noreply.github.com> * Update gpt4all-bindings/typescript/src/util.js Co-authored-by: cebtenzzre <cebtenzzre@gmail.com> Signed-off-by: Jacob Nguyen <76754747+jacoobes@users.noreply.github.com> * fix tests * fix circleci for nodejs * bump version --------- Signed-off-by: Aaron Miller <apage43@ninjawhale.com> Signed-off-by: umarmnaq <102142660+umarmnaq@users.noreply.github.com> Signed-off-by: Jacob Nguyen <76754747+jacoobes@users.noreply.github.com> Co-authored-by: Aaron Miller <apage43@ninjawhale.com> Co-authored-by: Adam Treat <treat.adam@gmail.com> Co-authored-by: Akarshan Biswas <akarshan.biswas@gmail.com> Co-authored-by: Cebtenzzre <cebtenzzre@gmail.com> Co-authored-by: Jan Philipp Harries <jpdus@users.noreply.github.com> Co-authored-by: umarmnaq <102142660+umarmnaq@users.noreply.github.com> Co-authored-by: Alex Soto <asotobu@gmail.com> Co-authored-by: niansa/tuxifan <tuxifan@posteo.de>
25 KiB
GPT4All Node.js API
yarn add gpt4all@alpha
npm install gpt4all@alpha
pnpm install gpt4all@alpha
The original GPT4All typescript bindings are now out of date.
- New bindings created by jacoobes, limez and the nomic ai community, for all to use.
- The nodejs api has made strides to mirror the python api. It is not 100% mirrored, but many pieces of the api resemble its python counterpart.
- Everything should work out the box.
- See API Reference
Chat Completion
import { createCompletion, loadModel } from '../src/gpt4all.js'
const model = await loadModel('ggml-vicuna-7b-1.1-q4_2', { verbose: true });
const response = await createCompletion(model, [
{ role : 'system', content: 'You are meant to be annoying and unhelpful.' },
{ role : 'user', content: 'What is 1 + 1?' }
]);
Embedding
import { createEmbedding, loadModel } from '../src/gpt4all.js'
const model = await loadModel('ggml-all-MiniLM-L6-v2-f16', { verbose: true });
const fltArray = createEmbedding(model, "Pain is inevitable, suffering optional");
Build Instructions
- binding.gyp is compile config
- Tested on Ubuntu. Everything seems to work fine
- Tested on Windows. Everything works fine.
- Sparse testing on mac os.
- MingW works as well to build the gpt4all-backend. HOWEVER, this package works only with MSVC built dlls.
Requirements
- git
- node.js >= 18.0.0
- yarn
- node-gyp
- all of its requirements.
- (unix) gcc version 12
- (win) msvc version 143
- Can be obtained with visual studio 2022 build tools
- python 3
- On Windows and Linux, building GPT4All requires the complete Vulkan SDK. You may download it from here: https://vulkan.lunarg.com/sdk/home
- macOS users do not need Vulkan, as GPT4All will use Metal instead.
Build (from source)
git clone https://github.com/nomic-ai/gpt4all.git
cd gpt4all-bindings/typescript
-
The below shell commands assume the current working directory is
typescript
. -
To Build and Rebuild:
yarn
- llama.cpp git submodule for gpt4all can be possibly absent. If this is the case, make sure to run in llama.cpp parent directory
git submodule update --init --depth 1 --recursive
yarn build:backend
This will build platform-dependent dynamic libraries, and will be located in runtimes/(platform)/native The only current way to use them is to put them in the current working directory of your application. That is, WHEREVER YOU RUN YOUR NODE APPLICATION
- llama-xxxx.dll is required.
- According to whatever model you are using, you'll need to select the proper model loader.
- For example, if you running an Mosaic MPT model, you will need to select the mpt-(buildvariant).(dynamiclibrary)
Test
yarn test
Source Overview
src/
- Extra functions to help aid devex
- Typings for the native node addon
- the javascript interface
test/
- simple unit testings for some functions exported.
- more advanced ai testing is not handled
spec/
- Average look and feel of the api
- Should work assuming a model and libraries are installed locally in working directory
index.cc
- The bridge between nodejs and c. Where the bindings are.
prompt.cc
- Handling prompting and inference of models in a threadsafe, asynchronous way.
Known Issues
- why your model may be spewing bull 💩
- The downloaded model is broken (just reinstall or download from official site)
- That's it so far
Roadmap
This package is in active development, and breaking changes may happen until the api stabilizes. Here's what's the todo list:
-
x] prompt models via a threadsafe function in order to have proper non blocking behavior in nodejs
-
] ~~createTokenStream, an async iterator that streams each token emitted from the model. Planning on following this [example](https://github.com/nodejs/node-addon-examples/tree/main/threadsafe-async-iterator)~~ May not implement unless someone else can complete
-
x] proper unit testing (integrate with circle ci)
-
x] publish to npm under alpha tag `gpt4all@alpha`
-
x] have more people test on other platforms (mac tester needed)
-
x] switch to new pluggable backend
-
] NPM bundle size reduction via optionalDependencies strategy (need help) * Should include prebuilds to avoid painful node-gyp errors
-
] createChatSession ( the python equivalent to create\_chat\_session )
API Reference
Table of Contents
- ModelType
- ModelFile
- type
- LLModel
- loadModel
- createCompletion
- createEmbedding
- CompletionOptions
- PromptMessage
- prompt_tokens
- completion_tokens
- total_tokens
- CompletionReturn
- CompletionChoice
- LLModelPromptContext
- createTokenStream
- DEFAULT_DIRECTORY
- DEFAULT_LIBRARIES_DIRECTORY
- DEFAULT_MODEL_CONFIG
- DEFAULT_PROMT_CONTEXT
- DEFAULT_MODEL_LIST_URL
- downloadModel
- DownloadModelOptions
- DownloadController
ModelType
Type of the model
Type: ("gptj"
| "llama"
| "mpt"
| "replit"
)
ModelFile
Full list of models available @deprecated These model names are outdated and this type will not be maintained, please use a string literal instead
gptj
List of GPT-J Models
Type: ("ggml-gpt4all-j-v1.3-groovy.bin"
| "ggml-gpt4all-j-v1.2-jazzy.bin"
| "ggml-gpt4all-j-v1.1-breezy.bin"
| "ggml-gpt4all-j.bin"
)
llama
List Llama Models
Type: ("ggml-gpt4all-l13b-snoozy.bin"
| "ggml-vicuna-7b-1.1-q4_2.bin"
| "ggml-vicuna-13b-1.1-q4_2.bin"
| "ggml-wizardLM-7B.q4_2.bin"
| "ggml-stable-vicuna-13B.q4_2.bin"
| "ggml-nous-gpt4-vicuna-13b.bin"
| "ggml-v3-13b-hermes-q5_1.bin"
)
mpt
List of MPT Models
Type: ("ggml-mpt-7b-base.bin"
| "ggml-mpt-7b-chat.bin"
| "ggml-mpt-7b-instruct.bin"
)
replit
List of Replit Models
Type: "ggml-replit-code-v1-3b.bin"
type
Model architecture. This argument currently does not have any functionality and is just used as descriptive identifier for user.
Type: ModelType
LLModel
LLModel class representing a language model. This is a base class that provides common functionality for different types of language models.
constructor
Initialize a new LLModel.
Parameters
path
string Absolute path to the model file.
- Throws Error If the model file does not exist.
type
either 'gpt', mpt', or 'llama' or undefined
Returns (ModelType | undefined)
name
The name of the model.
Returns string
stateSize
Get the size of the internal state of the model. NOTE: This state data is specific to the type of model you have created.
Returns number the size in bytes of the internal state of the model
threadCount
Get the number of threads used for model inference. The default is the number of physical cores your computer has.
Returns number The number of threads used for model inference.
setThreadCount
Set the number of threads used for model inference.
Parameters
newNumber
number The new number of threads.
Returns void
raw_prompt
Prompt the model with a given input and optional parameters. This is the raw output from model. Use the prompt function exported for a value
Parameters
q
string The prompt input.params
Partial<LLModelPromptContext> Optional parameters for the prompt context.callback
function (res: string): void
Returns void The result of the model prompt.
embed
Embed text with the model. Keep in mind that not all models can embed text, (only bert can embed as of 07/16/2023 (mm/dd/yyyy)) Use the prompt function exported for a value
Parameters
text
stringq
The prompt input.params
Optional parameters for the prompt context.
Returns Float32Array The result of the model prompt.
isModelLoaded
Whether the model is loaded or not.
Returns boolean
setLibraryPath
Where to search for the pluggable backend libraries
Parameters
s
string
Returns void
getLibraryPath
Where to get the pluggable backend libraries
Returns string
loadModel
Loads a machine learning model with the specified name. The defacto way to create a model. By default this will download a model from the official GPT4ALL website, if a model is not present at given path.
Parameters
modelName
string The name of the model to load.options
(LoadModelOptions | undefined)? (Optional) Additional options for loading the model.
Returns Promise<(InferenceModel | EmbeddingModel)> A promise that resolves to an instance of the loaded LLModel.
createCompletion
The nodejs equivalent to python binding's chat_completion
Parameters
model
InferenceModel The language model object.messages
Array<PromptMessage> The array of messages for the conversation.options
CompletionOptions The options for creating the completion.
Returns CompletionReturn The completion result.
createEmbedding
The nodejs moral equivalent to python binding's Embed4All().embed() meow
Parameters
model
EmbeddingModel The language model object.text
string text to embed
Returns Float32Array The completion result.
CompletionOptions
Extends Partial<LLModelPromptContext>
The options for creating the completion.
verbose
Indicates if verbose logging is enabled.
Type: boolean
systemPromptTemplate
Template for the system message. Will be put before the conversation with %1 being replaced by all system messages. Note that if this is not defined, system messages will not be included in the prompt.
Type: string
promptTemplate
Template for user messages, with %1 being replaced by the message.
Type: boolean
promptHeader
The initial instruction for the model, on top of the prompt
Type: string
promptFooter
The last instruction for the model, appended to the end of the prompt.
Type: string
PromptMessage
A message in the conversation, identical to OpenAI's chat message.
role
The role of the message.
Type: ("system"
| "assistant"
| "user"
)
content
The message content.
Type: string
prompt_tokens
The number of tokens used in the prompt.
Type: number
completion_tokens
The number of tokens used in the completion.
Type: number
total_tokens
The total number of tokens used.
Type: number
CompletionReturn
The result of the completion, similar to OpenAI's format.
model
The model used for the completion.
Type: string
usage
Token usage report.
Type: {prompt_tokens: number, completion_tokens: number, total_tokens: number}
choices
The generated completions.
Type: Array<CompletionChoice>
CompletionChoice
A completion choice, similar to OpenAI's format.
message
Response message
Type: PromptMessage
LLModelPromptContext
Model inference arguments for generating completions.
logitsSize
The size of the raw logits vector.
Type: number
tokensSize
The size of the raw tokens vector.
Type: number
nPast
The number of tokens in the past conversation.
Type: number
nCtx
The number of tokens possible in the context window.
Type: number
nPredict
The number of tokens to predict.
Type: number
topK
The top-k logits to sample from. Top-K sampling selects the next token only from the top K most likely tokens predicted by the model. It helps reduce the risk of generating low-probability or nonsensical tokens, but it may also limit the diversity of the output. A higher value for top-K (eg., 100) will consider more tokens and lead to more diverse text, while a lower value (eg., 10) will focus on the most probable tokens and generate more conservative text. 30 - 60 is a good range for most tasks.
Type: number
topP
The nucleus sampling probability threshold. Top-P limits the selection of the next token to a subset of tokens with a cumulative probability above a threshold P. This method, also known as nucleus sampling, finds a balance between diversity and quality by considering both token probabilities and the number of tokens available for sampling. When using a higher value for top-P (eg., 0.95), the generated text becomes more diverse. On the other hand, a lower value (eg., 0.1) produces more focused and conservative text. The default value is 0.4, which is aimed to be the middle ground between focus and diversity, but for more creative tasks a higher top-p value will be beneficial, about 0.5-0.9 is a good range for that.
Type: number
temp
The temperature to adjust the model's output distribution. Temperature is like a knob that adjusts how creative or focused the output becomes. Higher temperatures (eg., 1.2) increase randomness, resulting in more imaginative and diverse text. Lower temperatures (eg., 0.5) make the output more focused, predictable, and conservative. When the temperature is set to 0, the output becomes completely deterministic, always selecting the most probable next token and producing identical results each time. A safe range would be around 0.6 - 0.85, but you are free to search what value fits best for you.
Type: number
nBatch
The number of predictions to generate in parallel. By splitting the prompt every N tokens, prompt-batch-size reduces RAM usage during processing. However, this can increase the processing time as a trade-off. If the N value is set too low (e.g., 10), long prompts with 500+ tokens will be most affected, requiring numerous processing runs to complete the prompt processing. To ensure optimal performance, setting the prompt-batch-size to 2048 allows processing of all tokens in a single run.
Type: number
repeatPenalty
The penalty factor for repeated tokens. Repeat-penalty can help penalize tokens based on how frequently they occur in the text, including the input prompt. A token that has already appeared five times is penalized more heavily than a token that has appeared only one time. A value of 1 means that there is no penalty and values larger than 1 discourage repeated tokens.
Type: number
repeatLastN
The number of last tokens to penalize. The repeat-penalty-tokens N option controls the number of tokens in the history to consider for penalizing repetition. A larger value will look further back in the generated text to prevent repetitions, while a smaller value will only consider recent tokens.
Type: number
contextErase
The percentage of context to erase if the context window is exceeded.
Type: number
createTokenStream
TODO: Help wanted to implement this
Parameters
llmodel
LLModelmessages
Array<PromptMessage>options
CompletionOptions
Returns function (ll: LLModel): AsyncGenerator<string>
DEFAULT_DIRECTORY
From python api: models will be stored in (homedir)/.cache/gpt4all/`
Type: string
DEFAULT_LIBRARIES_DIRECTORY
From python api: The default path for dynamic libraries to be stored. You may separate paths by a semicolon to search in multiple areas. This searches DEFAULT_DIRECTORY/libraries, cwd/libraries, and finally cwd.
Type: string
DEFAULT_MODEL_CONFIG
Default model configuration.
Type: ModelConfig
DEFAULT_PROMT_CONTEXT
Default prompt context.
Type: LLModelPromptContext
DEFAULT_MODEL_LIST_URL
Default model list url.
Type: string
downloadModel
Initiates the download of a model file. By default this downloads without waiting. use the controller returned to alter this behavior.
Parameters
modelName
string The model to be downloaded.options
DownloadOptions to pass into the downloader. Default is { location: (cwd), verbose: false }.
Examples
const download = downloadModel('ggml-gpt4all-j-v1.3-groovy.bin')
download.promise.then(() => console.log('Downloaded!'))
- Throws Error If the model already exists in the specified location.
- Throws Error If the model cannot be found at the specified url.
Returns DownloadController object that allows controlling the download process.
DownloadModelOptions
Options for the model download process.
modelPath
location to download the model. Default is process.cwd(), or the current working directory
Type: string
verbose
Debug mode -- check how long it took to download in seconds
Type: boolean
url
Remote download url. Defaults to https://gpt4all.io/models/gguf/<modelName>
Type: string
md5sum
MD5 sum of the model file. If this is provided, the downloaded file will be checked against this sum. If the sums do not match, an error will be thrown and the file will be deleted.
Type: string
DownloadController
Model download controller.
cancel
Cancel the request to download if this is called.
Type: function (): void
promise
A promise resolving to the downloaded models config once the download is done
Type: Promise<ModelConfig>