AI/gpt4all

mirror of https://github.com/nomic-ai/gpt4all.git synced 2024-10-01 01:06:10 -04:00

gpt4all: an ecosystem of open-source chatbots trained on a massive collections of clean assistant data including code, stories and dialogue

Go to file

Richard Guo 79c236cca8 updated readme and storing old readme separately		2023-05-09 10:56:04 -04:00
chat	Merge branch 'main' into chat-windows-binary	2023-03-29 10:35:31 -04:00
configs	Update finetune.yaml	2023-04-13 18:04:30 -07:00
eval_data	started eval script and added eval data	2023-03-27 21:50:08 +00:00
figs	feat: wip training log	2023-04-13 18:41:39 +00:00
peft@098962fa65	chore: peft	2023-04-12 03:50:54 +00:00
.gitignore	Merge: main into gptj	2023-04-13 15:16:31 +00:00
.gitmodules	chore: remove transformers submodule	2023-04-13 20:30:01 +00:00
build_map.py	fix: rename	2023-04-13 20:58:27 +00:00
clean.py	fix: clean where prompt is randomly 1 char	2023-04-04 20:47:21 +00:00
create_hostname.sh	feat: multinode setup	2023-04-05 02:53:04 +00:00
data.py	Merge: main into gptj	2023-04-13 15:16:31 +00:00
env.yaml	feat: env for conda, pip	2023-03-25 16:16:40 +00:00
eval_figures.py	feat: evals on new gptj models	2023-04-10 02:14:20 +00:00
eval_self_instruct.py	feat: evals on new gptj models	2023-04-10 02:14:20 +00:00
generate.py	metrics run on configs now	2023-03-28 00:09:47 +00:00
gpt4all-lora-demo.gif	GIF	2023-03-28 15:54:44 -04:00
GPT-J_MAP.md	fix: rename	2023-04-13 20:58:27 +00:00
inference.py	fix: embeddings instead of logits!!!	2023-04-08 17:05:40 +00:00
launcher.sh	Merge: main into gptj	2023-04-13 15:16:31 +00:00
LICENSE.txt	Merge: main into gptj	2023-04-13 15:16:31 +00:00
read.py	feat: train and clean data	2023-03-25 16:17:48 +00:00
README_old.md	updated readme and storing old readme separately	2023-05-09 10:56:04 -04:00
README.md	updated readme and storing old readme separately	2023-05-09 10:56:04 -04:00
requirements.txt	chore: remove transformers submodule	2023-04-13 20:30:01 +00:00
train.py	fix: num training steps for lr decay	2023-04-10 02:15:31 +00:00
TRAINING_LOG.md	fix: format	2023-04-13 20:30:45 +00:00

README.md

GPT4All

Demo, data, and code to train open-source assistant-style large language model based on GPT-J and LLaMa

📗 Technical Report 2: GPT4All-J

📗 Technical Report 1: GPT4All

🐍 Official Python Bindings

💻 Official Typescript Bindings

💬 Official Web Chat Interface

💬 Official Chat Interface

🦜️🔗 Official Langchain Backend

Discord

GPT4All is made possible by our compute partner Paperspace.

GPT4All-J: An Apache-2 Licensed GPT4All Model

Run on an M1 Mac (not sped up!)

gpt4all.io

Find the most up-to-date information including chat UI, installation, and model information on the GPT4All Website!

Training GPT4All-J

Please see GPT4All-J Technical Report for details.

GPT4All-J Training Data

We are releasing the curated training data for anyone to replicate GPT4All-J here: GPT4All-J Training Data
- Atlas Map of Prompts
- Atlas Map of Responses

We have released updated versions of our GPT4All-J model and training data.

v1.0: The original model trained on the v1.0 dataset
v1.1-breezy: Trained on a filtered dataset where we removed all instances of AI language model
v1.2-jazzy: Trained on a filtered dataset where we also removed instances like I'm sorry, I can't answer... and AI language model

The models and data versions can be specified by passing a revision argument.

For example, to load the v1.2-jazzy model and dataset, run:

from datasets import load_dataset
from transformers import AutoModelForCausalLM

dataset = load_dataset("nomic-ai/gpt4all-j-prompt-generations", revision="v1.2-jazzy")
model = AutoModelForCausalLM.from_pretrained("nomic-ai/gpt4all-j-prompt-generations", revision="v1.2-jazzy")

GPT4All-J Training Instructions

accelerate launch --dynamo_backend=inductor --num_processes=8 --num_machines=1 --machine_rank=0 --deepspeed_multinode_launcher standard --mixed_precision=bf16  --use_deepspeed --deepspeed_config_file=configs/deepspeed/ds_config_gptj.json train.py --config configs/train/finetune_gptj.yaml

GPU Interface

There are two ways to get up and running with this model on GPU. The setup here is slightly more involved than the CPU model.

clone the nomic client repo and run pip install .[GPT4All] in the home dir.
run pip install nomic and install the additional deps from the wheels built here

Once this is done, you can run the model on GPU with a script like the following:

from nomic.gpt4all import GPT4AllGPU
m = GPT4AllGPU(LLAMA_PATH)
config = {'num_beams': 2,
          'min_new_tokens': 10,
          'max_length': 100,
          'repetition_penalty': 2.0}
out = m.generate('write me a story about a lonely computer', config)
print(out)

Where LLAMA_PATH is the path to a Huggingface Automodel compliant LLAMA model. Nomic is unable to distribute this file at this time. We are working on a GPT4All that does not have this limitation right now.

You can pass any of the huggingface generation config params in the config.

Raw Models

https://gpt4all.io/models/ggml-gpt4all-j-v1.3-groovy.bin (default) (md5sum 81a09a0ddf89690372fc296ff7f625af) Current best commercially licensable model based on GPT-J and trained by Nomic AI on the latest curated GPT4All dataset.
https://gpt4all.io/models/ggml-gpt4all-l13b-snoozy.bin (md5sum 91f886b68fbce697e9a3cd501951e455) Current best non-commercially licensable model based on Llama 13b and trained by Nomic AI on the latest curated GPT4All dataset.
https://gpt4all.io/models/ggml-gpt4all-j-v1.2-jazzy.bin (md5sum 879344aaa9d62fdccbda0be7a09e7976) An commercially licensable model based on GPT-J and trained by Nomic AI on the v2 GPT4All dataset.
https://gpt4all.io/models/ggml-gpt4all-j-v1.1-breezy.bin (md5sum 61d48a82cb188cceb14ebb8082bfec37) An commercially licensable model based on GPT-J and trained by Nomic AI on the v1 GPT4All dataset.
https://gpt4all.io/models/ggml-gpt4all-j.bin (md5sum 5b5a3f9b858d33b29b52b89692415595) An commercially licensable model based on GPT-J and trained by Nomic AI on the v0 GPT4All dataset.
https://gpt4all.io/models/ggml-vicuna-7b-1.1-q4_2.bin (md5sum 29119f8fa11712704c6b22ac5ab792ea) An non-commercially licensable model based on Llama 7b and trained by teams from UC Berkeley, CMU, Stanford, MBZUAI, and UC San Diego.
https://gpt4all.io/models/ggml-vicuna-13b-1.1-q4_2.bin (md5sum 95999b7b0699e2070af63bf5d34101a8) An non-commercially licensable model based on Llama 13b and trained by teams from UC Berkeley, CMU, Stanford, MBZUAI, and UC San Diego.
https://gpt4all.io/models/ggml-wizardLM-7B.q4_2.bin (md5sum 99e6d129745a3f1fb1121abed747b05a) An non-commercially licensable model based on Llama 7b and trained by Microsoft and Peking University.
https://gpt4all.io/models/ggml-stable-vicuna-13B.q4_2.bin (md5sum 6cb4ee297537c9133bddab9692879de0) An non-commercially licensable model based on Llama 13b and RLHF trained by Stable AI.

Note these models are only compatible with the C++ bindings found here. It will not work with any existing llama.cpp bindings as we had to do a large fork of llama.cpp. GPT4All will support the ecosystem around this new C++ backend going forward.

Python bindings are imminent and will be integrated into this repository. Stay tuned on the GPT4All discord for updates.

Roadmap

Short Term

(Done) Train a GPT4All model based on GPTJ to alleviate llama distribution issues.
(Done) Create improved CPU and GPU interfaces for this model.
(Done) Integrate llama.cpp bindings
(Done) Create a good conversational chat interface for the model.
(Done) Allow users to opt in and submit their chats for subsequent training runs

Medium Term

(NOT STARTED) Integrate GPT4All with Atlas to allow for document retrieval.
- BLOCKED by GPT4All based on GPTJ
(Done) Integrate GPT4All with Langchain.
(IN PROGRESS) Build easy custom training scripts to allow users to fine tune models.

Long Term

(NOT STARTED) Allow anyone to curate training data for subsequent GPT4All releases using Atlas.
(IN PROGRESS) Democratize AI.

Reproducibility

Raw Data:

Training Data Without P3
- Explorer: https://atlas.nomic.ai/map/gpt4all_data_clean_without_p3
Full Dataset with P3
- Explorer: https://atlas.nomic.ai/map/gpt4all_data_clean
GPT4All-J Dataset
- Explorer Indexed on Prompts: https://atlas.nomic.ai/map/gpt4all-j-prompts-curated
- Exporer Indexed on Responses: https://atlas.nomic.ai/map/gpt4all-j-response-curated

You can reproduce our trained model by doing the following:

Setup

Clone the repo

git clone --recurse-submodules https://github.com/nomic-ai/gpt4all.git
git submodule update --init

Setup the environment

python -m pip install -r requirements.txt

cd ../peft
pip install -e .

Training

accelerate launch --dynamo_backend=inductor --num_processes=8 --num_machines=1 --machine_rank=0 --deepspeed_multinode_launcher standard --mixed_precision=bf16  --use_deepspeed --deepspeed_config_file=configs/deepspeed/ds_config.json train.py --config configs/train/finetune-7b.yaml

Generate

python generate.py --config configs/generate/generate.yaml --prompt "Write a script to reverse a string in Python"

Need Help?

Join the Discord and ask for help in #gpt4all-help

Sample Generations

Provide instructions for the given exercise. Leg Raises

Leg Raises

Stand with your feet shoulder-width apart and your knees slightly bent.

Raise your arms straight out in front of you.

Slowly bend your knees and raise your heels off the ground.

Hold this position for a count of 3, then slowly lower your heels back to the ground.

Repeat the exercise 10 times.

Do not let your knees go beyond your toes.

Keep your back straight and your abdominal muscles engaged.

Do not let your hips sag or your back arch.

Keep your arms straight and your elbows locked.

Do not let your head drop or your shoulders rise.

Breathe normally throughout the exercise.

Rest for 30 seconds before repeating the exercise.

Gradually increase the number of repetitions as you become stronger.

A color description has been provided. Find the CSS code associated with that color. A light red color with a medium light shade of pink

The CSS code associated with the color provided is: #FF6347

Come up with an interesting idea for a new movie plot. Your plot should be described with a title and a summary.

The title of your movie plot should be "The Last Stand".

The summary of your movie plot should be: A group of mercenaries are hired to protect a small town from a dangerous criminal who is on the run. However, the town's sheriff is not willing to let them take over and decides to take matters into his own hands.

Reverse a string in python.

In Python, you can reverse a list or tuple by using the reversed() function on it. Here's an example of how to use this method with strings:
my_string = "Hello World" # Define your original string here
reversed_str = my_string[::-1]   # Use a slice and negative index for reverse order
print(reversed_str)     # Output will be reversed version of the inputted string.

List 10 dogs.

Labrador Retriever

Golden Retriever

Beagle

Poodle

German Shepherd

Bulldog

Rottweiler

Boxer

Shih Tzu

Pug

Write me a poem about the fall of Julius Caesar into a caesar salad in iambic pentameter.

The fall of Julius Caesar into a caesar salad
In iambic pentameter,
The mighty Roman emperor
Fell into a salad of lettuce and croutons,
His empire crumbling around him,
As he was devoured by the hungry diners.
The once mighty emperor
Was now just a salad topping,
His legacy reduced to a mere garnish,
As he was consumed by the masses.
The fall of Julius Caesar into a caesar salad,
In iambic pentameter,
Was a tragic end to a great man,
Who had once ruled the world with an iron fist.
But now, he was just a salad topping,
His legacy reduced to a mere garnish,
As he was consumed by the masses.

What is a three word topic describing the following keywords: baseball, football, soccer:

Sports, athletics, games

Citation

If you utilize this repository, models or data in a downstream project, please consider citing it with:

@misc{gpt4all,
  author = {Yuvanesh Anand and Zach Nussbaum and Brandon Duderstadt and Benjamin Schmidt and Andriy Mulyar},
  title = {GPT4All: Training an Assistant-style Chatbot with Large Scale Data Distillation from GPT-3.5-Turbo},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/nomic-ai/gpt4all}},
}

README.md Unescape Escape