diff --git a/README.md b/README.md
index 9e73eb6f..0d39fd21 100644
--- a/README.md
+++ b/README.md
@@ -40,8 +40,9 @@ A GPT4All model is a 3GB - 8GB file that you can download and plug into the GPT4
     - Offline build support for running old versions of the GPT4All Local LLM Chat Client.
 - **September 18th, 2023**: [Nomic Vulkan](https://blog.nomic.ai/posts/gpt4all-gpu-inference-with-vulkan) launches supporting local LLM inference on NVIDIA and AMD GPUs.
 - **July 2023**: Stable support for LocalDocs, a feature that allows you to privately and locally chat with your data.
-- **June 28th, 2023**: Docker-based API server launches allowing inference of local LLMs from an OpenAI-compatible HTTP endpoint.
+- **June 28th, 2023**: [Docker-based API server] launches allowing inference of local LLMs from an OpenAI-compatible HTTP endpoint.
 
+[Docker-based API server]: https://github.com/nomic-ai/gpt4all/tree/cef74c2be20f5b697055d5b8b506861c7b997fab/gpt4all-api
 
 ### Building From Source
 
diff --git a/gpt4all-api/.gitignore b/gpt4all-api/.gitignore
deleted file mode 100644
index 9b518d73..00000000
--- a/gpt4all-api/.gitignore
+++ /dev/null
@@ -1,112 +0,0 @@
-# Byte-compiled / optimized / DLL files
-__pycache__/
-app/__pycache__/
-gpt4all_api/__pycache__/
-gpt4all_api/app/api_v1/__pycache__/
-*.py[cod]
-*$py.class
-
-# C extensions
-*.so
-
-# VS Code
-.vscode/
-
-# Distribution / packaging
-.Python
-build/
-develop-eggs/
-dist/
-downloads/
-eggs/
-.eggs/
-lib64/
-parts/
-sdist/
-var/
-wheels/
-*.egg-info/
-.installed.cfg
-*.egg
-MANIFEST
-
-# PyInstaller
-#  Usually these files are written by a python script from a template
-#  before PyInstaller builds the exe, so as to inject date/other infos into it.
-*.manifest
-*.spec
-
-# Installer logs
-pip-log.txt
-pip-delete-this-directory.txt
-
-# Unit test / coverage reports
-htmlcov/
-.tox/
-.coverage
-.coverage.*
-.cache
-nosetests.xml
-coverage.xml
-*.cover
-.hypothesis/
-.pytest_cache/
-
-# Translations
-*.mo
-*.pot
-
-# Django stuff:
-*.log
-local_settings.py
-db.sqlite3
-
-# Flask stuff:
-instance/
-.webassets-cache
-
-# Scrapy stuff:
-.scrapy
-
-# Sphinx documentation
-docs/_build/
-
-# PyBuilder
-target/
-
-# Jupyter Notebook
-.ipynb_checkpoints
-
-# pyenv
-.python-version
-
-# celery beat schedule file
-celerybeat-schedule
-
-# SageMath parsed files
-*.sage.py
-
-# Environments
-.env
-.venv
-env/
-venv/
-ENV/
-env.bak/
-venv.bak/
-
-# Spyder project settings
-.spyderproject
-.spyproject
-
-# Rope project settings
-.ropeproject
-
-# mkdocs documentation
-/site
-
-# mypy
-.mypy_cache/
-
-*.lock
-*.cache
\ No newline at end of file
diff --git a/gpt4all-api/.isort.cfg b/gpt4all-api/.isort.cfg
deleted file mode 100644
index 485c85a7..00000000
--- a/gpt4all-api/.isort.cfg
+++ /dev/null
@@ -1,7 +0,0 @@
-[settings]
-known_third_party=geopy,nltk,np,numpy,pandas,pysbd,fire,torch
-
-line_length=120
-include_trailing_comma=True
-multi_line_output=3
-use_parentheses=True
\ No newline at end of file
diff --git a/gpt4all-api/LICENSE b/gpt4all-api/LICENSE
deleted file mode 100644
index e12d5ef4..00000000
--- a/gpt4all-api/LICENSE
+++ /dev/null
@@ -1,13 +0,0 @@
-Copyright 2023 Nomic, Inc.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
\ No newline at end of file
diff --git a/gpt4all-api/README.md b/gpt4all-api/README.md
deleted file mode 100644
index 29dff5c8..00000000
--- a/gpt4all-api/README.md
+++ /dev/null
@@ -1,90 +0,0 @@
-# GPT4All REST API
-
-NOTICE: We are considering to deprecate this API as it has become challenging to maintain and test. If you have any interest in maintaining this or would like to takeover and adopt or discuss the future of this API please speak up in the discord channel.
-
-This directory contains the source code to run and build docker images that run a FastAPI app
-for serving inference from GPT4All models. The API matches the OpenAI API spec.
-
-## Tutorial
-
-The following tutorial assumes that you have checked out this repo and cd'd into it.
-
-### Starting the app
-
-First change your working directory to `gpt4all/gpt4all-api`.
-
-Now you can build the FastAPI docker image. You only have to do this on initial build or when you add new dependencies to the requirements.txt file:
-```bash
-DOCKER_BUILDKIT=1 docker build -t gpt4all_api --progress plain -f gpt4all_api/Dockerfile.buildkit .
-```
-
-Then, start the backend with:
-
-```bash
-docker compose up --build
-```
-
-This will run both the API and locally hosted GPU inference server. If you want to run the API without the GPU inference server, you can run:
-
-```bash
-docker compose up --build gpt4all_api
-```
-
-To run the API with the GPU inference server, you will need to include environment variables (like the `MODEL_ID`). Edit the `.env` file and run
-```bash
-docker compose --env-file .env up --build
-```
-
-
-#### Spinning up your app
-Run `docker compose up` to spin up the backend. Monitor the logs for errors in-case you forgot to set an environment variable above.
-
-
-#### Development
-Run
-
-```bash
-docker compose up --build
-```
-and edit files in the `app` directory. The api will hot-reload on changes.
-
-You can run the unit tests with
-
-```bash
-make test
-```
-
-#### Viewing API documentation
-
-Once the FastAPI ap is started you can access its documentation and test the search endpoint by going to:
-```
-localhost:80/docs
-```
-
-This documentation should match the OpenAI OpenAPI spec located at https://github.com/openai/openai-openapi/blob/master/openapi.yaml
-
-
-#### Running inference
-```python
-import openai
-openai.api_base = "http://localhost:4891/v1"
-
-openai.api_key = "not needed for a local LLM"
-
-
-def test_completion():
-    model = "gpt4all-j-v1.3-groovy"
-    prompt = "Who is Michael Jordan?"
-    response = openai.Completion.create(
-        model=model,
-        prompt=prompt,
-        max_tokens=50,
-        temperature=0.28,
-        top_p=0.95,
-        n=1,
-        echo=True,
-        stream=False
-    )
-    assert len(response['choices'][0]['text']) > len(prompt)
-    print(response)
-```
diff --git a/gpt4all-api/docker-compose.gpu.yaml b/gpt4all-api/docker-compose.gpu.yaml
deleted file mode 100644
index 0ceb86d2..00000000
--- a/gpt4all-api/docker-compose.gpu.yaml
+++ /dev/null
@@ -1,24 +0,0 @@
-version: "3.8"
-
-services:
-  gpt4all_gpu:
-    image: ghcr.io/huggingface/text-generation-inference:0.9.3
-    container_name: gpt4all_gpu
-    restart: always #restart on error (usually code compilation from save during bad state)
-    environment:
-      - HUGGING_FACE_HUB_TOKEN=token
-      - USE_FLASH_ATTENTION=false
-      - MODEL_ID=''
-      - NUM_SHARD=1
-    command: --model-id $MODEL_ID --num-shard $NUM_SHARD
-    volumes:
-      - ./:/data
-    ports:
-      - "8080:80"
-    shm_size: 1g
-    deploy:
-      resources:
-        reservations:
-          devices:
-            - driver: nvidia
-              capabilities: [gpu]
\ No newline at end of file
diff --git a/gpt4all-api/docker-compose.yaml b/gpt4all-api/docker-compose.yaml
deleted file mode 100644
index 6c9ffcf6..00000000
--- a/gpt4all-api/docker-compose.yaml
+++ /dev/null
@@ -1,22 +0,0 @@
-version: "3.8"
-
-services:
-  gpt4all_api:
-    image: gpt4all_api
-    container_name: gpt4all_api
-    restart: always #restart on error (usually code compilation from save during bad state)
-    ports:
-      - "4891:4891"
-    env_file:
-      - .env
-    environment:
-      - APP_ENVIRONMENT=dev
-      - WEB_CONCURRENCY=2
-      - LOGLEVEL=debug
-      - PORT=4891
-      - model=${MODEL_BIN} # using variable from .env file
-      - inference_mode=cpu
-    volumes:
-      - './gpt4all_api/app:/app'
-      - './gpt4all_api/models:/models' # models are mounted in the container
-    command: ["/start-reload.sh"]
\ No newline at end of file
diff --git a/gpt4all-api/gpt4all_api/Dockerfile.buildkit b/gpt4all-api/gpt4all_api/Dockerfile.buildkit
deleted file mode 100644
index a2ae80a9..00000000
--- a/gpt4all-api/gpt4all_api/Dockerfile.buildkit
+++ /dev/null
@@ -1,17 +0,0 @@
-# syntax=docker/dockerfile:1.0.0-experimental
-FROM tiangolo/uvicorn-gunicorn:python3.11
-
-# Put first so anytime this file changes other cached layers are invalidated.
-COPY gpt4all_api/requirements.txt /requirements.txt
-
-RUN pip install --upgrade pip
-
-# Run various pip install commands with ssh keys from host machine.
-RUN --mount=type=ssh pip install -r /requirements.txt && \
-  rm -Rf /root/.cache && rm -Rf /tmp/pip-install*
-
-# Finally, copy app and client.
-COPY gpt4all_api/app /app
-
-RUN mkdir -p /models
-
diff --git a/gpt4all-api/gpt4all_api/README.md b/gpt4all-api/gpt4all_api/README.md
deleted file mode 100644
index 5219c39b..00000000
--- a/gpt4all-api/gpt4all_api/README.md
+++ /dev/null
@@ -1 +0,0 @@
-# FastAPI app for serving GPT4All models
diff --git a/gpt4all-api/gpt4all_api/app/api_v1/api.py b/gpt4all-api/gpt4all_api/app/api_v1/api.py
deleted file mode 100644
index e68af796..00000000
--- a/gpt4all-api/gpt4all_api/app/api_v1/api.py
+++ /dev/null
@@ -1,9 +0,0 @@
-from api_v1.routes import chat, completions, engines, health
-from fastapi import APIRouter
-
-router = APIRouter()
-
-router.include_router(chat.router)
-router.include_router(completions.router)
-router.include_router(engines.router)
-router.include_router(health.router)
diff --git a/gpt4all-api/gpt4all_api/app/api_v1/events.py b/gpt4all-api/gpt4all_api/app/api_v1/events.py
deleted file mode 100644
index ba6f73fc..00000000
--- a/gpt4all-api/gpt4all_api/app/api_v1/events.py
+++ /dev/null
@@ -1,29 +0,0 @@
-import logging
-
-from api_v1.settings import settings
-from fastapi import HTTPException
-from fastapi.responses import JSONResponse
-from starlette.requests import Request
-
-log = logging.getLogger(__name__)
-
-
-startup_msg_fmt = """
- Starting up GPT4All API
-"""
-
-
-async def on_http_error(request: Request, exc: HTTPException):
-    return JSONResponse({'detail': exc.detail}, status_code=exc.status_code)
-
-
-async def on_startup(app):
-    startup_msg = startup_msg_fmt.format(settings=settings)
-    log.info(startup_msg)
-
-
-def startup_event_handler(app):
-    async def start_app() -> None:
-        await on_startup(app)
-
-    return start_app
diff --git a/gpt4all-api/gpt4all_api/app/api_v1/routes/chat.py b/gpt4all-api/gpt4all_api/app/api_v1/routes/chat.py
deleted file mode 100644
index eec597bf..00000000
--- a/gpt4all-api/gpt4all_api/app/api_v1/routes/chat.py
+++ /dev/null
@@ -1,103 +0,0 @@
-import logging
-import time
-from typing import List
-from uuid import uuid4
-from fastapi import APIRouter, HTTPException
-from gpt4all import GPT4All
-from pydantic import BaseModel, Field
-from api_v1.settings import settings
-from fastapi.responses import StreamingResponse
-
-logger = logging.getLogger(__name__)
-logger.setLevel(logging.DEBUG)
-
-### This should follow https://github.com/openai/openai-openapi/blob/master/openapi.yaml
-class ChatCompletionMessage(BaseModel):
-    role: str
-    content: str
-
-class ChatCompletionRequest(BaseModel):
-    model: str = Field(settings.model, description='The model to generate a completion from.')
-    messages: List[ChatCompletionMessage] = Field(..., description='Messages for the chat completion.')
-    temperature: float = Field(settings.temp, description='Model temperature')
-
-class ChatCompletionChoice(BaseModel):
-    message: ChatCompletionMessage
-    index: int
-    logprobs: float
-    finish_reason: str
-
-class ChatCompletionUsage(BaseModel):
-    prompt_tokens: int
-    completion_tokens: int
-    total_tokens: int
-
-class ChatCompletionResponse(BaseModel):
-    id: str
-    object: str = 'text_completion'
-    created: int
-    model: str
-    choices: List[ChatCompletionChoice]
-    usage: ChatCompletionUsage
-
-router = APIRouter(prefix="/chat", tags=["Completions Endpoints"])
-
-@router.post("/completions", response_model=ChatCompletionResponse)
-async def chat_completion(request: ChatCompletionRequest):
-    '''
-    Completes a GPT4All model response based on the last message in the chat.
-    '''
-    # GPU is not implemented yet
-    if settings.inference_mode == "gpu":
-        raise HTTPException(status_code=400,
-              detail=f"Not implemented yet: Can only infer in CPU mode.")
-
-    # we only support the configured model
-    if request.model != settings.model:
-        raise HTTPException(status_code=400,
-              detail=f"The GPT4All inference server is booted to only infer: `{settings.model}`")
-
-    # run only of we have a message
-    if request.messages:
-        model = GPT4All(model_name=settings.model, model_path=settings.gpt4all_path)
-
-        # format system message and conversation history correctly
-        formatted_messages = ""
-        for message in request.messages:
-            formatted_messages += f"<|im_start|>{message.role}\n{message.content}<|im_end|>\n"
-
-        # the LLM will complete the response of the assistant
-        formatted_messages += "<|im_start|>assistant\n"
-        response = model.generate(
-            prompt=formatted_messages,
-            temp=request.temperature
-            )
-
-        # the LLM may continue to hallucinate the conversation, but we want only the first response
-        # so, cut off everything after first <|im_end|>
-        index = response.find("<|im_end|>")
-        response_content = response[:index].strip()
-    else:
-        response_content = "No messages received."
-
-    # Create a chat message for the response
-    response_message = ChatCompletionMessage(role="assistant", content=response_content)
-
-    # Create a choice object with the response message
-    response_choice = ChatCompletionChoice(
-        message=response_message,
-        index=0,
-        logprobs=-1.0,  # Placeholder value
-        finish_reason="length"  # Placeholder value
-    )
-
-    # Create the response object
-    chat_response = ChatCompletionResponse(
-        id=str(uuid4()),
-        created=int(time.time()),
-        model=request.model,
-        choices=[response_choice],
-        usage=ChatCompletionUsage(prompt_tokens=0, completion_tokens=0, total_tokens=0),  # Placeholder values
-    )
-
-    return chat_response
diff --git a/gpt4all-api/gpt4all_api/app/api_v1/routes/completions.py b/gpt4all-api/gpt4all_api/app/api_v1/routes/completions.py
deleted file mode 100644
index a403faac..00000000
--- a/gpt4all-api/gpt4all_api/app/api_v1/routes/completions.py
+++ /dev/null
@@ -1,215 +0,0 @@
-import json
-from typing import List, Dict, Iterable, AsyncIterable
-import logging
-import time
-from typing import Dict, List, Union, Optional
-from uuid import uuid4
-import aiohttp
-import asyncio
-from api_v1.settings import settings
-from fastapi import APIRouter, Depends, Response, Security, status, HTTPException
-from fastapi.responses import StreamingResponse
-from gpt4all import GPT4All
-from pydantic import BaseModel, Field
-
-logger = logging.getLogger(__name__)
-logger.setLevel(logging.DEBUG)
-
-
-### This should follow https://github.com/openai/openai-openapi/blob/master/openapi.yaml
-
-
-class CompletionRequest(BaseModel):
-    model: str = Field(settings.model, description='The model to generate a completion from.')
-    prompt: Union[List[str], str] = Field(..., description='The prompt to begin completing from.')
-    max_tokens: int = Field(None, description='Max tokens to generate')
-    temperature: float = Field(settings.temp, description='Model temperature')
-    top_p: Optional[float] = Field(settings.top_p, description='top_p')
-    top_k: Optional[int] = Field(settings.top_k, description='top_k')
-    n: int = Field(1, description='How many completions to generate for each prompt')
-    stream: bool = Field(False, description='Stream responses')
-    repeat_penalty: float = Field(settings.repeat_penalty, description='Repeat penalty')
-
-
-class CompletionChoice(BaseModel):
-    text: str
-    index: int
-    logprobs: float
-    finish_reason: str
-
-
-class CompletionUsage(BaseModel):
-    prompt_tokens: int
-    completion_tokens: int
-    total_tokens: int
-
-
-class CompletionResponse(BaseModel):
-    id: str
-    object: str = 'text_completion'
-    created: int
-    model: str
-    choices: List[CompletionChoice]
-    usage: CompletionUsage
-
-
-class CompletionStreamResponse(BaseModel):
-    id: str
-    object: str = 'text_completion'
-    created: int
-    model: str
-    choices: List[CompletionChoice]
-
-
-router = APIRouter(prefix="/completions", tags=["Completion Endpoints"])
-
-def stream_completion(output: Iterable, base_response: CompletionStreamResponse):
-    """
-    Streams a GPT4All output to the client.
-
-    Args:
-        output: The output of GPT4All.generate(), which is an iterable of tokens.
-        base_response: The base response object, which is cloned and modified for each token.
-
-    Returns:
-        A Generator of CompletionStreamResponse objects, which are serialized to JSON Event Stream format.
-    """
-    for token in output:
-        chunk = base_response.copy()
-        chunk.choices = [dict(CompletionChoice(
-            text=token,
-            index=0,
-            logprobs=-1,
-            finish_reason=''
-        ))]
-        yield f"data: {json.dumps(dict(chunk))}\n\n"
-
-async def gpu_infer(payload, header):
-    async with aiohttp.ClientSession() as session:
-        try:
-            async with session.post(
-                settings.hf_inference_server_host, headers=header, data=json.dumps(payload)
-            ) as response:
-                resp = await response.json()
-            return resp
-
-        except aiohttp.ClientError as e:
-            # Handle client-side errors (e.g., connection error, invalid URL)
-            logger.error(f"Client error: {e}")
-        except aiohttp.ServerError as e:
-            # Handle server-side errors (e.g., internal server error)
-            logger.error(f"Server error: {e}")
-        except json.JSONDecodeError as e:
-            # Handle JSON decoding errors
-            logger.error(f"JSON decoding error: {e}")
-        except Exception as e:
-            # Handle other unexpected exceptions
-            logger.error(f"Unexpected error: {e}")
-
-@router.post("/", response_model=CompletionResponse)
-async def completions(request: CompletionRequest):
-    '''
-    Completes a GPT4All model response.
-    '''
-    if settings.inference_mode == "gpu":
-        params = request.dict(exclude={'model', 'prompt', 'max_tokens', 'n'})
-        params["max_new_tokens"] = request.max_tokens
-        params["num_return_sequences"] = request.n
-
-        header = {"Content-Type": "application/json"}
-        if isinstance(request.prompt, list):
-            tasks = []
-            for prompt in request.prompt:
-                payload = {"parameters": params}
-                payload["inputs"] = prompt
-                task = gpu_infer(payload, header)
-                tasks.append(task)
-            results = await asyncio.gather(*tasks)
-
-            choices = []
-            for response in results:
-                scores = response["scores"] if "scores" in response else -1.0
-                choices.append(
-                    dict(
-                        CompletionChoice(
-                            text=response["generated_text"], index=0, logprobs=scores, finish_reason='stop'
-                        )
-                    )
-                )
-
-            return CompletionResponse(
-                id=str(uuid4()),
-                created=time.time(),
-                model=request.model,
-                choices=choices,
-                usage={'prompt_tokens': 0, 'completion_tokens': 0, 'total_tokens': 0},
-            )
-
-        else:
-            payload = {"parameters": params}
-            # If streaming, we need to return a StreamingResponse
-            payload["inputs"] = request.prompt
-
-            resp = await gpu_infer(payload, header)
-
-            output = resp["generated_text"]
-            # this returns all logprobs
-            scores = resp["scores"] if "scores" in resp else -1.0
-
-            return CompletionResponse(
-                id=str(uuid4()),
-                created=time.time(),
-                model=request.model,
-                choices=[dict(CompletionChoice(text=output, index=0, logprobs=scores, finish_reason='stop'))],
-                usage={'prompt_tokens': 0, 'completion_tokens': 0, 'total_tokens': 0},
-            )
-
-    else:
-
-        if request.model != settings.model:
-            raise HTTPException(status_code=400,
-                                detail=f"The GPT4All inference server is booted to only infer: `{settings.model}`")
-
-        if isinstance(request.prompt, list):
-            if len(request.prompt) > 1:
-                raise HTTPException(status_code=400, detail="Can only infer one inference per request in CPU mode.")
-            else:
-                request.prompt = request.prompt[0]
-
-        model = GPT4All(model_name=settings.model, model_path=settings.gpt4all_path)
-
-        output = model.generate(prompt=request.prompt,
-                                max_tokens=request.max_tokens,
-                                streaming=request.stream,
-                                top_k=request.top_k,
-                                top_p=request.top_p,
-                                temp=request.temperature,
-                                )
-
-        # If streaming, we need to return a StreamingResponse
-        if request.stream:
-            base_chunk = CompletionStreamResponse(
-                id=str(uuid4()),
-                created=time.time(),
-                model=request.model,
-                choices=[]
-            )
-            return StreamingResponse((response for response in stream_completion(output, base_chunk)),
-                                     media_type="text/event-stream")
-        else:
-            return CompletionResponse(
-                id=str(uuid4()),
-                created=time.time(),
-                model=request.model,
-                choices=[dict(CompletionChoice(
-                    text=output,
-                    index=0,
-                    logprobs=-1,
-                    finish_reason='stop'
-                ))],
-                usage={
-                    'prompt_tokens': 0,  # TODO how to compute this?
-                    'completion_tokens': 0,
-                    'total_tokens': 0
-                }
-            )
diff --git a/gpt4all-api/gpt4all_api/app/api_v1/routes/embeddings.py b/gpt4all-api/gpt4all_api/app/api_v1/routes/embeddings.py
deleted file mode 100644
index 50a5590f..00000000
--- a/gpt4all-api/gpt4all_api/app/api_v1/routes/embeddings.py
+++ /dev/null
@@ -1,65 +0,0 @@
-from typing import List, Union
-from fastapi import APIRouter
-from api_v1.settings import settings
-from gpt4all import Embed4All
-from pydantic import BaseModel, Field
-
-### This should follow https://github.com/openai/openai-openapi/blob/master/openapi.yaml
-
-
-class EmbeddingRequest(BaseModel):
-    model: str = Field(
-        settings.model, description="The model to generate an embedding from."
-    )
-    input: Union[str, List[str], List[int], List[List[int]]] = Field(
-        ..., description="Input text to embed, encoded as a string or array of tokens."
-    )
-
-
-class EmbeddingUsage(BaseModel):
-    prompt_tokens: int = 0
-    total_tokens: int = 0
-
-
-class Embedding(BaseModel):
-    index: int = 0
-    object: str = "embedding"
-    embedding: List[float]
-
-
-class EmbeddingResponse(BaseModel):
-    object: str = "list"
-    model: str
-    data: List[Embedding]
-    usage: EmbeddingUsage
-
-
-router = APIRouter(prefix="/embeddings", tags=["Embedding Endpoints"])
-
-embedder = Embed4All()
-
-
-def get_embedding(data: EmbeddingRequest) -> EmbeddingResponse:
-    """
-    Calculates the embedding for the given input using a specified model.
-
-    Args:
-        data (EmbeddingRequest): An EmbeddingRequest object containing the input data
-        and model name.
-
-    Returns:
-        EmbeddingResponse: An EmbeddingResponse object encapsulating the calculated embedding,
-        usage info, and the model name.
-    """
-    embedding = embedder.embed(data.input)
-    return EmbeddingResponse(
-        data=[Embedding(embedding=embedding)], usage=EmbeddingUsage(), model=data.model
-    )
-
-
-@router.post("/", response_model=EmbeddingResponse)
-def embeddings(data: EmbeddingRequest):
-    """
-    Creates a GPT4All embedding
-    """
-    return get_embedding(data)
diff --git a/gpt4all-api/gpt4all_api/app/api_v1/routes/engines.py b/gpt4all-api/gpt4all_api/app/api_v1/routes/engines.py
deleted file mode 100644
index 9b1e6785..00000000
--- a/gpt4all-api/gpt4all_api/app/api_v1/routes/engines.py
+++ /dev/null
@@ -1,39 +0,0 @@
-import requests
-from fastapi import APIRouter, HTTPException
-from pydantic import BaseModel, Field
-from typing import List, Dict
-
-# Define the router for the engines module
-router = APIRouter(prefix="/engines", tags=["Search Endpoints"])
-
-# Define the models for the engines module
-class ListEnginesResponse(BaseModel):
-    data: List[Dict] = Field(..., description="All available models.")
-
-class EngineResponse(BaseModel):
-    data: List[Dict] = Field(..., description="All available models.")
-
-
-# Define the routes for the engines module
-@router.get("/", response_model=ListEnginesResponse)
-async def list_engines():
-    try:
-        response = requests.get('https://raw.githubusercontent.com/nomic-ai/gpt4all/main/gpt4all-chat/metadata/models2.json')
-        response.raise_for_status()  # This will raise an HTTPError if the HTTP request returned an unsuccessful status code
-        engines = response.json()
-        return ListEnginesResponse(data=engines)
-    except requests.RequestException as e:
-        logger.error(f"Error fetching engine list: {e}")
-        raise HTTPException(status_code=500, detail="Error fetching engine list")
-
-# Define the routes for the engines module
-@router.get("/{engine_id}", response_model=EngineResponse)
-async def retrieve_engine(engine_id: str):
-    try:
-        # Implement logic to fetch a specific engine's details
-        # This is a placeholder, replace with your actual data retrieval logic
-        engine_details = {"id": engine_id, "name": "Engine Name", "description": "Engine Description"}
-        return EngineResponse(data=[engine_details])
-    except Exception as e:
-        logger.error(f"Error fetching engine details: {e}")
-        raise HTTPException(status_code=500, detail=f"Error fetching details for engine {engine_id}")
\ No newline at end of file
diff --git a/gpt4all-api/gpt4all_api/app/api_v1/routes/health.py b/gpt4all-api/gpt4all_api/app/api_v1/routes/health.py
deleted file mode 100644
index 37f30728..00000000
--- a/gpt4all-api/gpt4all_api/app/api_v1/routes/health.py
+++ /dev/null
@@ -1,13 +0,0 @@
-import logging
-from fastapi import APIRouter
-from fastapi.responses import JSONResponse
-
-log = logging.getLogger(__name__)
-
-router = APIRouter(prefix="/health", tags=["Health"])
-
-
-@router.get('/', response_class=JSONResponse)
-async def health_check():
-    """Runs a health check on this instance of the API."""
-    return JSONResponse({'status': 'ok'}, headers={'Access-Control-Allow-Origin': '*'})
diff --git a/gpt4all-api/gpt4all_api/app/api_v1/settings.py b/gpt4all-api/gpt4all_api/app/api_v1/settings.py
deleted file mode 100644
index f15fd301..00000000
--- a/gpt4all-api/gpt4all_api/app/api_v1/settings.py
+++ /dev/null
@@ -1,19 +0,0 @@
-from pydantic import BaseSettings
-
-
-class Settings(BaseSettings):
-    app_environment = 'dev'
-    model: str = 'ggml-mpt-7b-chat.bin'
-    gpt4all_path: str = '/models'
-    inference_mode: str = "cpu"
-    hf_inference_server_host: str = "http://gpt4all_gpu:80/generate"
-    sentry_dns: str = None
-
-    temp: float = 0.18
-    top_p: float = 1.0
-    top_k: int = 50
-    repeat_penalty: float = 1.18
-
-
-
-settings = Settings()
diff --git a/gpt4all-api/gpt4all_api/app/docs.py b/gpt4all-api/gpt4all_api/app/docs.py
deleted file mode 100644
index 1a47b76a..00000000
--- a/gpt4all-api/gpt4all_api/app/docs.py
+++ /dev/null
@@ -1,3 +0,0 @@
-desc = 'GPT4All API'
-
-endpoint_paths = {'health': '/health'}
diff --git a/gpt4all-api/gpt4all_api/app/main.py b/gpt4all-api/gpt4all_api/app/main.py
deleted file mode 100644
index 25b794ff..00000000
--- a/gpt4all-api/gpt4all_api/app/main.py
+++ /dev/null
@@ -1,84 +0,0 @@
-import logging
-import os
-
-import docs
-from api_v1 import events
-from api_v1.api import router as v1_router
-from api_v1.settings import settings
-from fastapi import FastAPI, HTTPException, Request
-from fastapi.logger import logger as fastapi_logger
-from starlette.middleware.cors import CORSMiddleware
-
-logger = logging.getLogger(__name__)
-
-app = FastAPI(title='GPT4All API', description=docs.desc)
-
-# CORS Configuration (in-case you want to deploy)
-app.add_middleware(
-    CORSMiddleware,
-    allow_origins=["*"],
-    allow_credentials=True,
-    allow_methods=["GET", "POST", "OPTIONS"],
-    allow_headers=["*"],
-)
-
-logger.info('Adding v1 endpoints..')
-
-# add v1
-app.include_router(v1_router, prefix='/v1')
-app.add_event_handler('startup', events.startup_event_handler(app))
-app.add_exception_handler(HTTPException, events.on_http_error)
-
-
-@app.on_event("startup")
-async def startup():
-    global model
-    if settings.inference_mode == "cpu":
-        logger.info(f"Downloading/fetching model: {os.path.join(settings.gpt4all_path, settings.model)}")
-        from gpt4all import GPT4All
-
-        model = GPT4All(model_name=settings.model, model_path=settings.gpt4all_path)
-
-        logger.info(f"GPT4All API is ready to infer from {settings.model} on CPU.")
-
-    else:
-        # is it possible to do this once the server is up?
-        ## TODO block until HF inference server is up.
-        logger.info(f"GPT4All API is ready to infer from {settings.model} on CPU.")
-
-
-
-@app.on_event("shutdown")
-async def shutdown():
-    logger.info("Shutting down API")
-
-
-if settings.sentry_dns is not None:
-    import sentry_sdk
-
-    def traces_sampler(sampling_context):
-        if 'health' in sampling_context['transaction_context']['name']:
-            return False
-
-    sentry_sdk.init(
-        dsn=settings.sentry_dns, traces_sample_rate=0.1, traces_sampler=traces_sampler, send_default_pii=False
-    )
-
-# This is needed to get logs to show up in the app
-if "gunicorn" in os.environ.get("SERVER_SOFTWARE", ""):
-    gunicorn_error_logger = logging.getLogger("gunicorn.error")
-    gunicorn_logger = logging.getLogger("gunicorn")
-
-    root_logger = logging.getLogger()
-    fastapi_logger.setLevel(gunicorn_logger.level)
-    fastapi_logger.handlers = gunicorn_error_logger.handlers
-    root_logger.setLevel(gunicorn_logger.level)
-
-    uvicorn_logger = logging.getLogger("uvicorn.access")
-    uvicorn_logger.handlers = gunicorn_error_logger.handlers
-else:
-    # https://github.com/tiangolo/fastapi/issues/2019
-    LOG_FORMAT2 = (
-        "[%(asctime)s %(process)d:%(threadName)s] %(name)s - %(levelname)s - %(message)s | %(filename)s:%(lineno)d"
-    )
-    logging.basicConfig(level=logging.INFO, format=LOG_FORMAT2)
diff --git a/gpt4all-api/gpt4all_api/app/tests/test_endpoints.py b/gpt4all-api/gpt4all_api/app/tests/test_endpoints.py
deleted file mode 100644
index c32b6220..00000000
--- a/gpt4all-api/gpt4all_api/app/tests/test_endpoints.py
+++ /dev/null
@@ -1,93 +0,0 @@
-"""
-Use the OpenAI python API to test gpt4all models.
-"""
-from typing import List, get_args
-import os
-from dotenv import load_dotenv
-
-import openai
-
-openai.api_base = "http://localhost:4891/v1"
-openai.api_key = "not needed for a local LLM"
-
-# Load the .env file
-env_path = 'gpt4all-api/gpt4all_api/.env'
-load_dotenv(dotenv_path=env_path)
-
-# Fetch MODEL_ID from .env file
-model_id = os.getenv('MODEL_BIN', 'default_model_id')
-embedding = os.getenv('EMBEDDING', 'default_embedding_model_id')
-print (model_id)
-print (embedding)
-
-def test_completion():
-    model = model_id
-    prompt = "Who is Michael Jordan?"
-    response = openai.Completion.create(
-        model=model, prompt=prompt, max_tokens=50, temperature=0.28, top_p=0.95, n=1, echo=True, stream=False
-    )
-    assert len(response['choices'][0]['text']) > len(prompt)
-
-def test_streaming_completion():
-    model = model_id
-    prompt = "Who is Michael Jordan?"
-    tokens = []
-    for resp in openai.Completion.create(
-            model=model,
-            prompt=prompt,
-            max_tokens=50,
-            temperature=0.28,
-            top_p=0.95,
-            n=1,
-            echo=True,
-            stream=True):
-        tokens.append(resp.choices[0].text)
-
-    assert (len(tokens) > 0)
-    assert (len("".join(tokens)) > len(prompt))
-
-# Modified test batch, problems with keyerror in response
-def test_batched_completion():
-    model = model_id  # replace with your specific model ID
-    prompt = "Who is Michael Jordan?"
-    responses = []
-
-    # Loop to create completions one at a time
-    for _ in range(3):
-        response = openai.Completion.create(
-            model=model, prompt=prompt, max_tokens=50, temperature=0.28, top_p=0.95, n=1, echo=True, stream=False
-        )
-        responses.append(response)
-
-    # Assertions to check the responses
-    for response in responses:
-        assert len(response['choices'][0]['text']) > len(prompt)
-
-    assert len(responses) == 3
-
-def test_embedding():
-    model = embedding
-    prompt = "Who is Michael Jordan?"
-    response = openai.Embedding.create(model=model, input=prompt)
-    output = response["data"][0]["embedding"]
-    args = get_args(List[float])
-
-    assert response["model"] == model
-    assert isinstance(output, list)
-    assert all(isinstance(x, args) for x in output)
-
-def test_chat_completion():
-    model = model_id
-
-    response = openai.ChatCompletion.create(
-        model=model,
-        messages=[
-            {"role": "system", "content": "You are a helpful assistant."},
-            {"role": "user", "content": "Knock knock."},
-            {"role": "assistant", "content": "Who's there?"},
-            {"role": "user", "content": "Orange."},
-            ]
-    )
-
-    assert response.choices[0].message.role == "assistant"
-    assert len(response.choices[0].message.content) > 0
diff --git a/gpt4all-api/gpt4all_api/env b/gpt4all-api/gpt4all_api/env
deleted file mode 100644
index 5b9d67a3..00000000
--- a/gpt4all-api/gpt4all_api/env
+++ /dev/null
@@ -1,3 +0,0 @@
-# Add your GGUF compatible model LLM here. ie: MODEL_BIN="mistral-7b-instruct-v0.1.Q4_0", rename file ".env"
-# Make sure this LLM matches the model you placed inside the models folder
-MODEL_BIN=""
\ No newline at end of file
diff --git a/gpt4all-api/gpt4all_api/models/README.md b/gpt4all-api/gpt4all_api/models/README.md
deleted file mode 100644
index 425324f2..00000000
--- a/gpt4all-api/gpt4all_api/models/README.md
+++ /dev/null
@@ -1 +0,0 @@
-### Drop GGUF compatible models here, make sure it matches MODEL_BIN on your .env file
\ No newline at end of file
diff --git a/gpt4all-api/gpt4all_api/requirements.txt b/gpt4all-api/gpt4all_api/requirements.txt
deleted file mode 100644
index 6bfe6ddd..00000000
--- a/gpt4all-api/gpt4all_api/requirements.txt
+++ /dev/null
@@ -1,13 +0,0 @@
-aiohttp>=3.6.2
-aiofiles
-pydantic>=1.4.0,<2.0.0
-requests>=2.24.0
-ujson>=2.0.2
-fastapi>=0.95.0
-Jinja2>=3.0
-gpt4all>=1.0.0
-pytest
-openai==0.28.0
-black
-isort
-python-dotenv
\ No newline at end of file
diff --git a/gpt4all-api/makefile b/gpt4all-api/makefile
deleted file mode 100644
index 8c0e5ef2..00000000
--- a/gpt4all-api/makefile
+++ /dev/null
@@ -1,46 +0,0 @@
-ROOT_DIR:=$(shell dirname $(realpath $(lastword $(MAKEFILE_LIST))))
-APP_NAME:=gpt4all_api
-PYTHON:=python3.8
-SHELL := /bin/bash
-
-all: dependencies
-
-fresh: clean dependencies
-
-testenv: clean_testenv test_build
-	docker compose -f docker-compose.yaml up --build
-
-testenv_gpu: clean_testenv test_build
-	docker compose -f docker-compose.yaml -f docker-compose.gpu.yaml up --build
-
-testenv_d: clean_testenv test_build
-	docker compose env up --build -d
-
-test:
-	docker compose exec $(APP_NAME) pytest -svv --disable-warnings -p no:cacheprovider /app/tests
-
-test_build:
-    DOCKER_BUILDKIT=1 docker build -t $(APP_NAME) --progress plain -f $(APP_NAME)/Dockerfile.buildkit .
-
-clean_testenv:
-	docker compose down -v
-
-fresh_testenv: clean_testenv testenv
-
-venv:
-	if [ ! -d $(ROOT_DIR)/venv ]; then $(PYTHON) -m venv $(ROOT_DIR)/venv; fi
-
-dependencies: venv
-	source $(ROOT_DIR)/venv/bin/activate; $(PYTHON) -m pip install -r $(ROOT_DIR)/$(APP_NAME)/requirements.txt
-
-clean: clean_testenv
-	# Remove existing environment
-	rm -rf $(ROOT_DIR)/venv;
-	rm -rf $(ROOT_DIR)/$(APP_NAME)/*.pyc;
-
-
-black:
-	source $(ROOT_DIR)/venv/bin/activate; black -l 120 -S --target-version py38 $(APP_NAME)
-
-isort:
-	source $(ROOT_DIR)/venv/bin/activate; isort  --ignore-whitespace --atomic -w 120 $(APP_NAME)
\ No newline at end of file
diff --git a/gpt4all-bindings/python/docs/index.md b/gpt4all-bindings/python/docs/index.md
index b87ee14e..ed35c20f 100644
--- a/gpt4all-bindings/python/docs/index.md
+++ b/gpt4all-bindings/python/docs/index.md
@@ -26,7 +26,6 @@ is organized as a monorepo with the following structure:
 - **gpt4all-backend**: The GPT4All backend maintains and exposes a universal, performance optimized C API for running inference with multi-billion parameter Transformer Decoders.
 This C API is then bound to any higher level programming language such as C++, Python, Go, etc.
 - **gpt4all-bindings**: GPT4All bindings contain a variety of high-level programming languages that implement the C API. Each directory is a bound programming language. The [CLI](gpt4all_cli.md) is included here, as well.
-- **gpt4all-api**: The GPT4All API (under initial development) exposes REST API endpoints for gathering completions and embeddings from large language models.
 - **gpt4all-chat**: GPT4All Chat is an OS native chat application that runs on macOS, Windows and Linux. It is the easiest way to run local, privacy aware chat assistants on everyday hardware. You can download it on the [GPT4All Website](https://gpt4all.io) and read its source code in the monorepo.
 
 Explore detailed documentation for the backend, bindings and chat client in the sidebar.
diff --git a/gpt4all-docker/README.md b/gpt4all-docker/README.md
deleted file mode 100644
index 8d7d97d8..00000000
--- a/gpt4all-docker/README.md
+++ /dev/null
@@ -1,8 +0,0 @@
-# GPT4All Docker
-This directory will contain Dockerfiles to build out different gpt4all recipes.
-
-For example:
-1. Docker container that builds out gpt4all RESTful API.
-2. Docker container that builds out gpt4all model backends and Python bindings.
-3. Docker container that builds out everything.
-4. etc.
\ No newline at end of file
diff --git a/monorepo_plan.md b/monorepo_plan.md
deleted file mode 100644
index deae3fc5..00000000
--- a/monorepo_plan.md
+++ /dev/null
@@ -1,36 +0,0 @@
-# Monorepo Plan (DRAFT)
-
-## Directory Structure
-- gpt4all-api
-    - RESTful API
-- gpt4all-backend
-    - C/C++ (ggml) model backends
-- gpt4all-bindings
-    - Language bindings for model backends
-- gpt4all-chat
-    - Chat GUI
-- gpt4all-docker
-    - Dockerfile recipes for various gpt4all builds
-- gpt4all-training
-    - Model training/inference/eval code
-
-## Transition Plan:
-This is roughly based on what's feasible now and path of least resistance.
-
-1. Clean up gpt4all-training.
-    - Remove deprecated/unneeded files
-    - Organize into separate training, inference, eval, etc. directories
-
-2. Clean up gpt4all-chat so it roughly has same structures as above 
-    - Separate into gpt4all-chat and gpt4all-backends
-    - Separate model backends into separate subdirectories (e.g. llama, gptj)
-
-3. Develop Python bindings (high priority and in-flight)
-    - Release Python binding as PyPi package
-    - Reimplement [Nomic GPT4All](https://github.com/nomic-ai/nomic/blob/main/nomic/gpt4all/gpt4all.py#L58-L190) to call new Python bindings
-
-4. Develop Dockerfiles for different combinations of model backends and bindings
-    - Dockerfile for just model backend
-    - Dockerfile for model backend and Python bindings
-
-5. Develop RESTful API / FastAPI