maint: remove Docker API server and related references (#2314)

Signed-off-by: Jared Van Bortel <jared@nomic.ai>
2024-10-01 01:06:10 -04:00 · 2024-05-09 12:50:26 -04:00 · 2024-05-09 12:50:26 -04:00 · 86560f3952
commit 86560f3952
parent 5fb9d17c00
27 changed files with 2 additions and 1067 deletions
--- a/README.md
+++ b/README.md
@ -40,8 +40,9 @@ A GPT4All model is a 3GB - 8GB file that you can download and plug into the GPT4
    - Offline build support for running old versions of the GPT4All Local LLM Chat Client.
 - **September 18th, 2023**: [Nomic Vulkan](https://blog.nomic.ai/posts/gpt4all-gpu-inference-with-vulkan) launches supporting local LLM inference on NVIDIA and AMD GPUs.
 - **July 2023**: Stable support for LocalDocs, a feature that allows you to privately and locally chat with your data.
- **June 28th, 2023**: Docker-based API server launches allowing inference of local LLMs from an OpenAI-compatible HTTP endpoint.
+- **June 28th, 2023**: [Docker-based API server] launches allowing inference of local LLMs from an OpenAI-compatible HTTP endpoint.
 [Docker-based API server]: https://github.com/nomic-ai/gpt4all/tree/cef74c2be20f5b697055d5b8b506861c7b997fab/gpt4all-api
 ### Building From Source
--- a/gpt4all-api/.gitignore
+++ b/gpt4all-api/.gitignore
@ -1,112 +0,0 @@
 # Byte-compiled / optimized / DLL files
 __pycache__/
 app/__pycache__/
 gpt4all_api/__pycache__/
 gpt4all_api/app/api_v1/__pycache__/
 *.py[cod]
 *$py.class
 # C extensions
 *.so
 # VS Code
 .vscode/
 # Distribution / packaging
 .Python
 build/
 develop-eggs/
 dist/
 downloads/
 eggs/
 .eggs/
 lib64/
 parts/
 sdist/
 var/
 wheels/
 *.egg-info/
 .installed.cfg
 *.egg
 MANIFEST
 # PyInstaller
 #  Usually these files are written by a python script from a template
 #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 *.manifest
 *.spec
 # Installer logs
 pip-log.txt
 pip-delete-this-directory.txt
 # Unit test / coverage reports
 htmlcov/
 .tox/
 .coverage
 .coverage.*
 .cache
 nosetests.xml
 coverage.xml
 *.cover
 .hypothesis/
 .pytest_cache/
 # Translations
 *.mo
 *.pot
 # Django stuff:
 *.log
 local_settings.py
 db.sqlite3
 # Flask stuff:
 instance/
 .webassets-cache
 # Scrapy stuff:
 .scrapy
 # Sphinx documentation
 docs/_build/
 # PyBuilder
 target/
 # Jupyter Notebook
 .ipynb_checkpoints
 # pyenv
 .python-version
 # celery beat schedule file
 celerybeat-schedule
 # SageMath parsed files
 *.sage.py
 # Environments
 .env
 .venv
 env/
 venv/
 ENV/
 env.bak/
 venv.bak/
 # Spyder project settings
 .spyderproject
 .spyproject
 # Rope project settings
 .ropeproject
 # mkdocs documentation
 /site
 # mypy
 .mypy_cache/
 *.lock
 *.cache
--- a/gpt4all-api/.isort.cfg
+++ b/gpt4all-api/.isort.cfg
@ -1,7 +0,0 @@
 [settings]
 known_third_party=geopy,nltk,np,numpy,pandas,pysbd,fire,torch
 line_length=120
 include_trailing_comma=True
 multi_line_output=3
 use_parentheses=True
--- a/gpt4all-api/LICENSE
+++ b/gpt4all-api/LICENSE
@ -1,13 +0,0 @@
 Copyright 2023 Nomic, Inc.
 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
 You may obtain a copy of the License at
    http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
--- a/gpt4all-api/README.md
+++ b/gpt4all-api/README.md
@ -1,90 +0,0 @@
 # GPT4All REST API
 NOTICE: We are considering to deprecate this API as it has become challenging to maintain and test. If you have any interest in maintaining this or would like to takeover and adopt or discuss the future of this API please speak up in the discord channel.
 This directory contains the source code to run and build docker images that run a FastAPI app
 for serving inference from GPT4All models. The API matches the OpenAI API spec.
 ## Tutorial
 The following tutorial assumes that you have checked out this repo and cd'd into it.
 ### Starting the app
 First change your working directory to `gpt4all/gpt4all-api`.
 Now you can build the FastAPI docker image. You only have to do this on initial build or when you add new dependencies to the requirements.txt file:
 ```bash
 DOCKER_BUILDKIT=1 docker build -t gpt4all_api --progress plain -f gpt4all_api/Dockerfile.buildkit .
 ```
 Then, start the backend with:
 ```bash
 docker compose up --build
 ```
 This will run both the API and locally hosted GPU inference server. If you want to run the API without the GPU inference server, you can run:
 ```bash
 docker compose up --build gpt4all_api
 ```
 To run the API with the GPU inference server, you will need to include environment variables (like the `MODEL_ID`). Edit the `.env` file and run
 ```bash
 docker compose --env-file .env up --build
 ```
 #### Spinning up your app
 Run `docker compose up` to spin up the backend. Monitor the logs for errors in-case you forgot to set an environment variable above.
 #### Development
 Run
 ```bash
 docker compose up --build
 ```
 and edit files in the `app` directory. The api will hot-reload on changes.
 You can run the unit tests with
 ```bash
 make test
 ```
 #### Viewing API documentation
 Once the FastAPI ap is started you can access its documentation and test the search endpoint by going to:
 ```
 localhost:80/docs
 ```
 This documentation should match the OpenAI OpenAPI spec located at https://github.com/openai/openai-openapi/blob/master/openapi.yaml
 #### Running inference
 ```python
 import openai
 openai.api_base = "http://localhost:4891/v1"
 openai.api_key = "not needed for a local LLM"
 def test_completion():
    model = "gpt4all-j-v1.3-groovy"
    prompt = "Who is Michael Jordan?"
    response = openai.Completion.create(
        model=model,
        prompt=prompt,
        max_tokens=50,
        temperature=0.28,
        top_p=0.95,
        n=1,
        echo=True,
        stream=False
    )
    assert len(response['choices'][0]['text']) > len(prompt)
    print(response)
 ```
--- a/gpt4all-api/docker-compose.gpu.yaml
+++ b/gpt4all-api/docker-compose.gpu.yaml
@ -1,24 +0,0 @@
 version: "3.8"
 services:
  gpt4all_gpu:
    image: ghcr.io/huggingface/text-generation-inference:0.9.3
    container_name: gpt4all_gpu
    restart: always #restart on error (usually code compilation from save during bad state)
    environment:
      - HUGGING_FACE_HUB_TOKEN=token
      - USE_FLASH_ATTENTION=false
      - MODEL_ID=''
      - NUM_SHARD=1
    command: --model-id $MODEL_ID --num-shard $NUM_SHARD
    volumes:
      - ./:/data
    ports:
      - "8080:80"
    shm_size: 1g
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              capabilities: [gpu]
--- a/gpt4all-api/docker-compose.yaml
+++ b/gpt4all-api/docker-compose.yaml
@ -1,22 +0,0 @@
 version: "3.8"
 services:
  gpt4all_api:
    image: gpt4all_api
    container_name: gpt4all_api
    restart: always #restart on error (usually code compilation from save during bad state)
    ports:
      - "4891:4891"
    env_file:
      - .env
    environment:
      - APP_ENVIRONMENT=dev
      - WEB_CONCURRENCY=2
      - LOGLEVEL=debug
      - PORT=4891
      - model=${MODEL_BIN} # using variable from .env file
      - inference_mode=cpu
    volumes:
      - './gpt4all_api/app:/app'
      - './gpt4all_api/models:/models' # models are mounted in the container
    command: ["/start-reload.sh"]
--- a/gpt4all-api/gpt4all_api/Dockerfile.buildkit
+++ b/gpt4all-api/gpt4all_api/Dockerfile.buildkit
@ -1,17 +0,0 @@
 # syntax=docker/dockerfile:1.0.0-experimental
 FROM tiangolo/uvicorn-gunicorn:python3.11
 # Put first so anytime this file changes other cached layers are invalidated.
 COPY gpt4all_api/requirements.txt /requirements.txt
 RUN pip install --upgrade pip
 # Run various pip install commands with ssh keys from host machine.
 RUN --mount=type=ssh pip install -r /requirements.txt && \
  rm -Rf /root/.cache && rm -Rf /tmp/pip-install*
 # Finally, copy app and client.
 COPY gpt4all_api/app /app
 RUN mkdir -p /models
--- a/gpt4all-api/gpt4all_api/README.md
+++ b/gpt4all-api/gpt4all_api/README.md
@ -1 +0,0 @@
 # FastAPI app for serving GPT4All models
--- a/gpt4all-api/gpt4all_api/app/api_v1/api.py
+++ b/gpt4all-api/gpt4all_api/app/api_v1/api.py
@ -1,9 +0,0 @@
 from api_v1.routes import chat, completions, engines, health
 from fastapi import APIRouter
 router = APIRouter()
 router.include_router(chat.router)
 router.include_router(completions.router)
 router.include_router(engines.router)
 router.include_router(health.router)
--- a/gpt4all-api/gpt4all_api/app/api_v1/events.py
+++ b/gpt4all-api/gpt4all_api/app/api_v1/events.py
@ -1,29 +0,0 @@
 import logging
 from api_v1.settings import settings
 from fastapi import HTTPException
 from fastapi.responses import JSONResponse
 from starlette.requests import Request
 log = logging.getLogger(__name__)
 startup_msg_fmt = """
 Starting up GPT4All API
 """
 async def on_http_error(request: Request, exc: HTTPException):
    return JSONResponse({'detail': exc.detail}, status_code=exc.status_code)
 async def on_startup(app):
    startup_msg = startup_msg_fmt.format(settings=settings)
    log.info(startup_msg)
 def startup_event_handler(app):
    async def start_app() -> None:
        await on_startup(app)
    return start_app
--- a/gpt4all-api/gpt4all_api/app/api_v1/routes/chat.py
+++ b/gpt4all-api/gpt4all_api/app/api_v1/routes/chat.py
@ -1,103 +0,0 @@
 import logging
 import time
 from typing import List
 from uuid import uuid4
 from fastapi import APIRouter, HTTPException
 from gpt4all import GPT4All
 from pydantic import BaseModel, Field
 from api_v1.settings import settings
 from fastapi.responses import StreamingResponse
 logger = logging.getLogger(__name__)
 logger.setLevel(logging.DEBUG)
 ### This should follow https://github.com/openai/openai-openapi/blob/master/openapi.yaml
 class ChatCompletionMessage(BaseModel):
    role: str
    content: str
 class ChatCompletionRequest(BaseModel):
    model: str = Field(settings.model, description='The model to generate a completion from.')
    messages: List[ChatCompletionMessage] = Field(..., description='Messages for the chat completion.')
    temperature: float = Field(settings.temp, description='Model temperature')
 class ChatCompletionChoice(BaseModel):
    message: ChatCompletionMessage
    index: int
    logprobs: float
    finish_reason: str
 class ChatCompletionUsage(BaseModel):
    prompt_tokens: int
    completion_tokens: int
    total_tokens: int
 class ChatCompletionResponse(BaseModel):
    id: str
    object: str = 'text_completion'
    created: int
    model: str
    choices: List[ChatCompletionChoice]
    usage: ChatCompletionUsage
 router = APIRouter(prefix="/chat", tags=["Completions Endpoints"])
@router.post("/completions", response_model=ChatCompletionResponse)
 async def chat_completion(request: ChatCompletionRequest):
    '''
    Completes a GPT4All model response based on the last message in the chat.
    '''
    # GPU is not implemented yet
    if settings.inference_mode == "gpu":
        raise HTTPException(status_code=400,
              detail=f"Not implemented yet: Can only infer in CPU mode.")
    # we only support the configured model
    if request.model != settings.model:
        raise HTTPException(status_code=400,
              detail=f"The GPT4All inference server is booted to only infer: `{settings.model}`")
    # run only of we have a message
    if request.messages:
        model = GPT4All(model_name=settings.model, model_path=settings.gpt4all_path)
        # format system message and conversation history correctly
        formatted_messages = ""
        for message in request.messages:
            formatted_messages += f"<|im_start|>{message.role}\n{message.content}<|im_end|>\n"
        # the LLM will complete the response of the assistant
        formatted_messages += "<|im_start|>assistant\n"
        response = model.generate(
            prompt=formatted_messages,
            temp=request.temperature
            )
        # the LLM may continue to hallucinate the conversation, but we want only the first response
        # so, cut off everything after first <|im_end|>
        index = response.find("<|im_end|>")
        response_content = response[:index].strip()
    else:
        response_content = "No messages received."
    # Create a chat message for the response
    response_message = ChatCompletionMessage(role="assistant", content=response_content)
    # Create a choice object with the response message
    response_choice = ChatCompletionChoice(
        message=response_message,
        index=0,
        logprobs=-1.0,  # Placeholder value
        finish_reason="length"  # Placeholder value
    )
    # Create the response object
    chat_response = ChatCompletionResponse(
        id=str(uuid4()),
        created=int(time.time()),
        model=request.model,
        choices=[response_choice],
        usage=ChatCompletionUsage(prompt_tokens=0, completion_tokens=0, total_tokens=0),  # Placeholder values
    )
    return chat_response
--- a/gpt4all-api/gpt4all_api/app/api_v1/routes/completions.py
+++ b/gpt4all-api/gpt4all_api/app/api_v1/routes/completions.py
@ -1,215 +0,0 @@
 import json
 from typing import List, Dict, Iterable, AsyncIterable
 import logging
 import time
 from typing import Dict, List, Union, Optional
 from uuid import uuid4
 import aiohttp
 import asyncio
 from api_v1.settings import settings
 from fastapi import APIRouter, Depends, Response, Security, status, HTTPException
 from fastapi.responses import StreamingResponse
 from gpt4all import GPT4All
 from pydantic import BaseModel, Field
 logger = logging.getLogger(__name__)
 logger.setLevel(logging.DEBUG)
 ### This should follow https://github.com/openai/openai-openapi/blob/master/openapi.yaml
 class CompletionRequest(BaseModel):
    model: str = Field(settings.model, description='The model to generate a completion from.')
    prompt: Union[List[str], str] = Field(..., description='The prompt to begin completing from.')
    max_tokens: int = Field(None, description='Max tokens to generate')
    temperature: float = Field(settings.temp, description='Model temperature')
    top_p: Optional[float] = Field(settings.top_p, description='top_p')
    top_k: Optional[int] = Field(settings.top_k, description='top_k')
    n: int = Field(1, description='How many completions to generate for each prompt')
    stream: bool = Field(False, description='Stream responses')
    repeat_penalty: float = Field(settings.repeat_penalty, description='Repeat penalty')
 class CompletionChoice(BaseModel):
    text: str
    index: int
    logprobs: float
    finish_reason: str
 class CompletionUsage(BaseModel):
    prompt_tokens: int
    completion_tokens: int
    total_tokens: int
 class CompletionResponse(BaseModel):
    id: str
    object: str = 'text_completion'
    created: int
    model: str
    choices: List[CompletionChoice]
    usage: CompletionUsage
 class CompletionStreamResponse(BaseModel):
    id: str
    object: str = 'text_completion'
    created: int
    model: str
    choices: List[CompletionChoice]
 router = APIRouter(prefix="/completions", tags=["Completion Endpoints"])
 def stream_completion(output: Iterable, base_response: CompletionStreamResponse):
    """
    Streams a GPT4All output to the client.
    Args:
        output: The output of GPT4All.generate(), which is an iterable of tokens.
        base_response: The base response object, which is cloned and modified for each token.
    Returns:
        A Generator of CompletionStreamResponse objects, which are serialized to JSON Event Stream format.
    """
    for token in output:
        chunk = base_response.copy()
        chunk.choices = [dict(CompletionChoice(
            text=token,
            index=0,
            logprobs=-1,
            finish_reason=''
        ))]
        yield f"data: {json.dumps(dict(chunk))}\n\n"
 async def gpu_infer(payload, header):
    async with aiohttp.ClientSession() as session:
        try:
            async with session.post(
                settings.hf_inference_server_host, headers=header, data=json.dumps(payload)
            ) as response:
                resp = await response.json()
            return resp
        except aiohttp.ClientError as e:
            # Handle client-side errors (e.g., connection error, invalid URL)
            logger.error(f"Client error: {e}")
        except aiohttp.ServerError as e:
            # Handle server-side errors (e.g., internal server error)
            logger.error(f"Server error: {e}")
        except json.JSONDecodeError as e:
            # Handle JSON decoding errors
            logger.error(f"JSON decoding error: {e}")
        except Exception as e:
            # Handle other unexpected exceptions
            logger.error(f"Unexpected error: {e}")
@router.post("/", response_model=CompletionResponse)
 async def completions(request: CompletionRequest):
    '''
    Completes a GPT4All model response.
    '''
    if settings.inference_mode == "gpu":
        params = request.dict(exclude={'model', 'prompt', 'max_tokens', 'n'})
        params["max_new_tokens"] = request.max_tokens
        params["num_return_sequences"] = request.n
        header = {"Content-Type": "application/json"}
        if isinstance(request.prompt, list):
            tasks = []
            for prompt in request.prompt:
                payload = {"parameters": params}
                payload["inputs"] = prompt
                task = gpu_infer(payload, header)
                tasks.append(task)
            results = await asyncio.gather(*tasks)
            choices = []
            for response in results:
                scores = response["scores"] if "scores" in response else -1.0
                choices.append(
                    dict(
                        CompletionChoice(
                            text=response["generated_text"], index=0, logprobs=scores, finish_reason='stop'
                        )
                    )
                )
            return CompletionResponse(
                id=str(uuid4()),
                created=time.time(),
                model=request.model,
                choices=choices,
                usage={'prompt_tokens': 0, 'completion_tokens': 0, 'total_tokens': 0},
            )
        else:
            payload = {"parameters": params}
            # If streaming, we need to return a StreamingResponse
            payload["inputs"] = request.prompt
            resp = await gpu_infer(payload, header)
            output = resp["generated_text"]
            # this returns all logprobs
            scores = resp["scores"] if "scores" in resp else -1.0
            return CompletionResponse(
                id=str(uuid4()),
                created=time.time(),
                model=request.model,
                choices=[dict(CompletionChoice(text=output, index=0, logprobs=scores, finish_reason='stop'))],
                usage={'prompt_tokens': 0, 'completion_tokens': 0, 'total_tokens': 0},
            )
    else:
        if request.model != settings.model:
            raise HTTPException(status_code=400,
                                detail=f"The GPT4All inference server is booted to only infer: `{settings.model}`")
        if isinstance(request.prompt, list):
            if len(request.prompt) > 1:
                raise HTTPException(status_code=400, detail="Can only infer one inference per request in CPU mode.")
            else:
                request.prompt = request.prompt[0]
        model = GPT4All(model_name=settings.model, model_path=settings.gpt4all_path)
        output = model.generate(prompt=request.prompt,
                                max_tokens=request.max_tokens,
                                streaming=request.stream,
                                top_k=request.top_k,
                                top_p=request.top_p,
                                temp=request.temperature,
                                )
        # If streaming, we need to return a StreamingResponse
        if request.stream:
            base_chunk = CompletionStreamResponse(
                id=str(uuid4()),
                created=time.time(),
                model=request.model,
                choices=[]
            )
            return StreamingResponse((response for response in stream_completion(output, base_chunk)),
                                     media_type="text/event-stream")
        else:
            return CompletionResponse(
                id=str(uuid4()),
                created=time.time(),
                model=request.model,
                choices=[dict(CompletionChoice(
                    text=output,
                    index=0,
                    logprobs=-1,
                    finish_reason='stop'
                ))],
                usage={
                    'prompt_tokens': 0,  # TODO how to compute this?
                    'completion_tokens': 0,
                    'total_tokens': 0
                }
            )
--- a/gpt4all-api/gpt4all_api/app/api_v1/routes/embeddings.py
+++ b/gpt4all-api/gpt4all_api/app/api_v1/routes/embeddings.py
@ -1,65 +0,0 @@
 from typing import List, Union
 from fastapi import APIRouter
 from api_v1.settings import settings
 from gpt4all import Embed4All
 from pydantic import BaseModel, Field
 ### This should follow https://github.com/openai/openai-openapi/blob/master/openapi.yaml
 class EmbeddingRequest(BaseModel):
    model: str = Field(
        settings.model, description="The model to generate an embedding from."
    )
    input: Union[str, List[str], List[int], List[List[int]]] = Field(
        ..., description="Input text to embed, encoded as a string or array of tokens."
    )
 class EmbeddingUsage(BaseModel):
    prompt_tokens: int = 0
    total_tokens: int = 0
 class Embedding(BaseModel):
    index: int = 0
    object: str = "embedding"
    embedding: List[float]
 class EmbeddingResponse(BaseModel):
    object: str = "list"
    model: str
    data: List[Embedding]
    usage: EmbeddingUsage
 router = APIRouter(prefix="/embeddings", tags=["Embedding Endpoints"])
 embedder = Embed4All()
 def get_embedding(data: EmbeddingRequest) -> EmbeddingResponse:
    """
    Calculates the embedding for the given input using a specified model.
    Args:
        data (EmbeddingRequest): An EmbeddingRequest object containing the input data
        and model name.
    Returns:
        EmbeddingResponse: An EmbeddingResponse object encapsulating the calculated embedding,
        usage info, and the model name.
    """
    embedding = embedder.embed(data.input)
    return EmbeddingResponse(
        data=[Embedding(embedding=embedding)], usage=EmbeddingUsage(), model=data.model
    )
@router.post("/", response_model=EmbeddingResponse)
 def embeddings(data: EmbeddingRequest):
    """
    Creates a GPT4All embedding
    """
    return get_embedding(data)
--- a/gpt4all-api/gpt4all_api/app/api_v1/routes/engines.py
+++ b/gpt4all-api/gpt4all_api/app/api_v1/routes/engines.py
@ -1,39 +0,0 @@
 import requests
 from fastapi import APIRouter, HTTPException
 from pydantic import BaseModel, Field
 from typing import List, Dict
 # Define the router for the engines module
 router = APIRouter(prefix="/engines", tags=["Search Endpoints"])
 # Define the models for the engines module
 class ListEnginesResponse(BaseModel):
    data: List[Dict] = Field(..., description="All available models.")
 class EngineResponse(BaseModel):
    data: List[Dict] = Field(..., description="All available models.")
 # Define the routes for the engines module
@router.get("/", response_model=ListEnginesResponse)
 async def list_engines():
    try:
        response = requests.get('https://raw.githubusercontent.com/nomic-ai/gpt4all/main/gpt4all-chat/metadata/models2.json')
        response.raise_for_status()  # This will raise an HTTPError if the HTTP request returned an unsuccessful status code
        engines = response.json()
        return ListEnginesResponse(data=engines)
    except requests.RequestException as e:
        logger.error(f"Error fetching engine list: {e}")
        raise HTTPException(status_code=500, detail="Error fetching engine list")
 # Define the routes for the engines module
@router.get("/{engine_id}", response_model=EngineResponse)
 async def retrieve_engine(engine_id: str):
    try:
        # Implement logic to fetch a specific engine's details
        # This is a placeholder, replace with your actual data retrieval logic
        engine_details = {"id": engine_id, "name": "Engine Name", "description": "Engine Description"}
        return EngineResponse(data=[engine_details])
    except Exception as e:
        logger.error(f"Error fetching engine details: {e}")
        raise HTTPException(status_code=500, detail=f"Error fetching details for engine {engine_id}")
--- a/gpt4all-api/gpt4all_api/app/api_v1/routes/health.py
+++ b/gpt4all-api/gpt4all_api/app/api_v1/routes/health.py
@ -1,13 +0,0 @@
 import logging
 from fastapi import APIRouter
 from fastapi.responses import JSONResponse
 log = logging.getLogger(__name__)
 router = APIRouter(prefix="/health", tags=["Health"])
@router.get('/', response_class=JSONResponse)
 async def health_check():
    """Runs a health check on this instance of the API."""
    return JSONResponse({'status': 'ok'}, headers={'Access-Control-Allow-Origin': '*'})
--- a/gpt4all-api/gpt4all_api/app/api_v1/settings.py
+++ b/gpt4all-api/gpt4all_api/app/api_v1/settings.py
@ -1,19 +0,0 @@
 from pydantic import BaseSettings
 class Settings(BaseSettings):
    app_environment = 'dev'
    model: str = 'ggml-mpt-7b-chat.bin'
    gpt4all_path: str = '/models'
    inference_mode: str = "cpu"
    hf_inference_server_host: str = "http://gpt4all_gpu:80/generate"
    sentry_dns: str = None
    temp: float = 0.18
    top_p: float = 1.0
    top_k: int = 50
    repeat_penalty: float = 1.18
 settings = Settings()
--- a/gpt4all-api/gpt4all_api/app/docs.py
+++ b/gpt4all-api/gpt4all_api/app/docs.py
@ -1,3 +0,0 @@
 desc = 'GPT4All API'
 endpoint_paths = {'health': '/health'}
--- a/gpt4all-api/gpt4all_api/app/main.py
+++ b/gpt4all-api/gpt4all_api/app/main.py
@ -1,84 +0,0 @@
 import logging
 import os
 import docs
 from api_v1 import events
 from api_v1.api import router as v1_router
 from api_v1.settings import settings
 from fastapi import FastAPI, HTTPException, Request
 from fastapi.logger import logger as fastapi_logger
 from starlette.middleware.cors import CORSMiddleware
 logger = logging.getLogger(__name__)
 app = FastAPI(title='GPT4All API', description=docs.desc)
 # CORS Configuration (in-case you want to deploy)
 app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_credentials=True,
    allow_methods=["GET", "POST", "OPTIONS"],
    allow_headers=["*"],
 )
 logger.info('Adding v1 endpoints..')
 # add v1
 app.include_router(v1_router, prefix='/v1')
 app.add_event_handler('startup', events.startup_event_handler(app))
 app.add_exception_handler(HTTPException, events.on_http_error)
@app.on_event("startup")
 async def startup():
    global model
    if settings.inference_mode == "cpu":
        logger.info(f"Downloading/fetching model: {os.path.join(settings.gpt4all_path, settings.model)}")
        from gpt4all import GPT4All
        model = GPT4All(model_name=settings.model, model_path=settings.gpt4all_path)
        logger.info(f"GPT4All API is ready to infer from {settings.model} on CPU.")
    else:
        # is it possible to do this once the server is up?
        ## TODO block until HF inference server is up.
        logger.info(f"GPT4All API is ready to infer from {settings.model} on CPU.")
@app.on_event("shutdown")
 async def shutdown():
    logger.info("Shutting down API")
 if settings.sentry_dns is not None:
    import sentry_sdk
    def traces_sampler(sampling_context):
        if 'health' in sampling_context['transaction_context']['name']:
            return False
    sentry_sdk.init(
        dsn=settings.sentry_dns, traces_sample_rate=0.1, traces_sampler=traces_sampler, send_default_pii=False
    )
 # This is needed to get logs to show up in the app
 if "gunicorn" in os.environ.get("SERVER_SOFTWARE", ""):
    gunicorn_error_logger = logging.getLogger("gunicorn.error")
    gunicorn_logger = logging.getLogger("gunicorn")
    root_logger = logging.getLogger()
    fastapi_logger.setLevel(gunicorn_logger.level)
    fastapi_logger.handlers = gunicorn_error_logger.handlers
    root_logger.setLevel(gunicorn_logger.level)
    uvicorn_logger = logging.getLogger("uvicorn.access")
    uvicorn_logger.handlers = gunicorn_error_logger.handlers
 else:
    # https://github.com/tiangolo/fastapi/issues/2019
    LOG_FORMAT2 = (
        "[%(asctime)s %(process)d:%(threadName)s] %(name)s - %(levelname)s - %(message)s | %(filename)s:%(lineno)d"
    )
    logging.basicConfig(level=logging.INFO, format=LOG_FORMAT2)
--- a/gpt4all-api/gpt4all_api/app/tests/test_endpoints.py
+++ b/gpt4all-api/gpt4all_api/app/tests/test_endpoints.py
@ -1,93 +0,0 @@
 """
 Use the OpenAI python API to test gpt4all models.
 """
 from typing import List, get_args
 import os
 from dotenv import load_dotenv
 import openai
 openai.api_base = "http://localhost:4891/v1"
 openai.api_key = "not needed for a local LLM"
 # Load the .env file
 env_path = 'gpt4all-api/gpt4all_api/.env'
 load_dotenv(dotenv_path=env_path)
 # Fetch MODEL_ID from .env file
 model_id = os.getenv('MODEL_BIN', 'default_model_id')
 embedding = os.getenv('EMBEDDING', 'default_embedding_model_id')
 print (model_id)
 print (embedding)
 def test_completion():
    model = model_id
    prompt = "Who is Michael Jordan?"
    response = openai.Completion.create(
        model=model, prompt=prompt, max_tokens=50, temperature=0.28, top_p=0.95, n=1, echo=True, stream=False
    )
    assert len(response['choices'][0]['text']) > len(prompt)
 def test_streaming_completion():
    model = model_id
    prompt = "Who is Michael Jordan?"
    tokens = []
    for resp in openai.Completion.create(
            model=model,
            prompt=prompt,
            max_tokens=50,
            temperature=0.28,
            top_p=0.95,
            n=1,
            echo=True,
            stream=True):
        tokens.append(resp.choices[0].text)
    assert (len(tokens) > 0)
    assert (len("".join(tokens)) > len(prompt))
 # Modified test batch, problems with keyerror in response
 def test_batched_completion():
    model = model_id  # replace with your specific model ID
    prompt = "Who is Michael Jordan?"
    responses = []
    # Loop to create completions one at a time
    for _ in range(3):
        response = openai.Completion.create(
            model=model, prompt=prompt, max_tokens=50, temperature=0.28, top_p=0.95, n=1, echo=True, stream=False
        )
        responses.append(response)
    # Assertions to check the responses
    for response in responses:
        assert len(response['choices'][0]['text']) > len(prompt)
    assert len(responses) == 3
 def test_embedding():
    model = embedding
    prompt = "Who is Michael Jordan?"
    response = openai.Embedding.create(model=model, input=prompt)
    output = response["data"][0]["embedding"]
    args = get_args(List[float])
    assert response["model"] == model
    assert isinstance(output, list)
    assert all(isinstance(x, args) for x in output)
 def test_chat_completion():
    model = model_id
    response = openai.ChatCompletion.create(
        model=model,
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Knock knock."},
            {"role": "assistant", "content": "Who's there?"},
            {"role": "user", "content": "Orange."},
            ]
    )
    assert response.choices[0].message.role == "assistant"
    assert len(response.choices[0].message.content) > 0
--- a/gpt4all-api/gpt4all_api/env
+++ b/gpt4all-api/gpt4all_api/env
@ -1,3 +0,0 @@
 # Add your GGUF compatible model LLM here. ie: MODEL_BIN="mistral-7b-instruct-v0.1.Q4_0", rename file ".env"
 # Make sure this LLM matches the model you placed inside the models folder
 MODEL_BIN=""
--- a/gpt4all-api/gpt4all_api/models/README.md
+++ b/gpt4all-api/gpt4all_api/models/README.md
@ -1 +0,0 @@
 ### Drop GGUF compatible models here, make sure it matches MODEL_BIN on your .env file
--- a/gpt4all-api/gpt4all_api/requirements.txt
+++ b/gpt4all-api/gpt4all_api/requirements.txt
@ -1,13 +0,0 @@
 aiohttp>=3.6.2
 aiofiles
 pydantic>=1.4.0,<2.0.0
 requests>=2.24.0
 ujson>=2.0.2
 fastapi>=0.95.0
 Jinja2>=3.0
 gpt4all>=1.0.0
 pytest
 openai==0.28.0
 black
 isort
 python-dotenv
--- a/gpt4all-api/makefile
+++ b/gpt4all-api/makefile
@ -1,46 +0,0 @@
 ROOT_DIR:=$(shell dirname $(realpath $(lastword $(MAKEFILE_LIST))))
 APP_NAME:=gpt4all_api
 PYTHON:=python3.8
 SHELL := /bin/bash
 all: dependencies
 fresh: clean dependencies
 testenv: clean_testenv test_build
 	docker compose -f docker-compose.yaml up --build
 testenv_gpu: clean_testenv test_build
 	docker compose -f docker-compose.yaml -f docker-compose.gpu.yaml up --build
 testenv_d: clean_testenv test_build
 	docker compose env up --build -d
 test:
 	docker compose exec $(APP_NAME) pytest -svv --disable-warnings -p no:cacheprovider /app/tests
 test_build:
    DOCKER_BUILDKIT=1 docker build -t $(APP_NAME) --progress plain -f $(APP_NAME)/Dockerfile.buildkit .
 clean_testenv:
 	docker compose down -v
 fresh_testenv: clean_testenv testenv
 venv:
 	if [ ! -d $(ROOT_DIR)/venv ]; then $(PYTHON) -m venv $(ROOT_DIR)/venv; fi
 dependencies: venv
 	source $(ROOT_DIR)/venv/bin/activate; $(PYTHON) -m pip install -r $(ROOT_DIR)/$(APP_NAME)/requirements.txt
 clean: clean_testenv
 	# Remove existing environment
 	rm -rf $(ROOT_DIR)/venv;
 	rm -rf $(ROOT_DIR)/$(APP_NAME)/*.pyc;
 black:
 	source $(ROOT_DIR)/venv/bin/activate; black -l 120 -S --target-version py38 $(APP_NAME)
 isort:
 	source $(ROOT_DIR)/venv/bin/activate; isort  --ignore-whitespace --atomic -w 120 $(APP_NAME)
--- a/gpt4all-bindings/python/docs/index.md
+++ b/gpt4all-bindings/python/docs/index.md
@ -26,7 +26,6 @@ is organized as a monorepo with the following structure:
 - **gpt4all-backend**: The GPT4All backend maintains and exposes a universal, performance optimized C API for running inference with multi-billion parameter Transformer Decoders.
 This C API is then bound to any higher level programming language such as C++, Python, Go, etc.
 - **gpt4all-bindings**: GPT4All bindings contain a variety of high-level programming languages that implement the C API. Each directory is a bound programming language. The [CLI](gpt4all_cli.md) is included here, as well.
 - **gpt4all-api**: The GPT4All API (under initial development) exposes REST API endpoints for gathering completions and embeddings from large language models.
 - **gpt4all-chat**: GPT4All Chat is an OS native chat application that runs on macOS, Windows and Linux. It is the easiest way to run local, privacy aware chat assistants on everyday hardware. You can download it on the [GPT4All Website](https://gpt4all.io) and read its source code in the monorepo.
 Explore detailed documentation for the backend, bindings and chat client in the sidebar.
--- a/gpt4all-docker/README.md
+++ b/gpt4all-docker/README.md
@ -1,8 +0,0 @@
 # GPT4All Docker
 This directory will contain Dockerfiles to build out different gpt4all recipes.
 For example:
 1. Docker container that builds out gpt4all RESTful API.
 2. Docker container that builds out gpt4all model backends and Python bindings.
 3. Docker container that builds out everything.
 4. etc.
--- a/monorepo_plan.md
+++ b/monorepo_plan.md
@ -1,36 +0,0 @@
 # Monorepo Plan (DRAFT)
 ## Directory Structure
 - gpt4all-api
    - RESTful API
 - gpt4all-backend
    - C/C++ (ggml) model backends
 - gpt4all-bindings
    - Language bindings for model backends
 - gpt4all-chat
    - Chat GUI
 - gpt4all-docker
    - Dockerfile recipes for various gpt4all builds
 - gpt4all-training
    - Model training/inference/eval code
 ## Transition Plan:
 This is roughly based on what's feasible now and path of least resistance.
 1. Clean up gpt4all-training.
    - Remove deprecated/unneeded files
    - Organize into separate training, inference, eval, etc. directories
 2. Clean up gpt4all-chat so it roughly has same structures as above 
    - Separate into gpt4all-chat and gpt4all-backends
    - Separate model backends into separate subdirectories (e.g. llama, gptj)
 3. Develop Python bindings (high priority and in-flight)
    - Release Python binding as PyPi package
    - Reimplement [Nomic GPT4All](https://github.com/nomic-ai/nomic/blob/main/nomic/gpt4all/gpt4all.py#L58-L190) to call new Python bindings
 4. Develop Dockerfiles for different combinations of model backends and bindings
    - Dockerfile for just model backend
    - Dockerfile for model backend and Python bindings
 5. Develop RESTful API / FastAPI
		`@ -1,3 +0,0 @@`
			`desc = 'GPT4All API'`

			`endpoint_paths = {'health': '/health'}`
		`@ -1 +0,0 @@`
			`### Drop GGUF compatible models here, make sure it matches MODEL_BIN on your .env file`