Blame: docs/server.md - abetlen/llama-cpp-python

Python bindings for llama.cpp

0 0 1 Python

Multimodal Support (Llava 1.5) (#821) * llava v1.5 integration * Point llama.cpp to fork * Add llava shared library target * Fix type * Update llama.cpp * Add llava api * Revert changes to llama and llama_cpp * Update llava example * Add types for new gpt-4-vision-preview api * Fix typo * Update llama.cpp * Update llama_types to match OpenAI v1 API * Update ChatCompletionFunction type * Reorder request parameters * More API type fixes * Even More Type Updates * Add parameter for custom chat_handler to Llama class * Fix circular import * Convert to absolute imports * Fix * Fix pydantic Jsontype bug * Accept list of prompt tokens in create_completion * Add llava1.5 chat handler * Add Multimodal notebook * Clean up examples * Add server docs --------- Co-authored-by: Andrei Betlen <abetlen@gmail.com> 2023-11-08 04:48:51 +01:00			`# OpenAI Compatible Server`

			`llama-cpp-python` offers an OpenAI API compatible web server.

			`This web server can be used to serve local models and easily connect them to existing clients.`

			`## Setup`

			`### Installation`

			`The server can be installed by running the following command:`

			```bash
			`pip install llama-cpp-python[server]`
			```

			`### Running the server`

			`The server can then be started by running the following command:`

			```bash
			`python3 -m llama_cpp.server --model <model_path>`
			```

			`### Server options`

			`For a full list of options, run:`

			```bash
			`python3 -m llama_cpp.server --help`
			```

			NOTE: All server options are also available as environment variables. For example, `--model` can be set by setting the `MODEL` environment variable.

docs: add server config docs 2023-12-22 14:37:24 -05:00			`Check out the server config reference below settings for more information on the available options.`
			CLI arguments and environment variables are available for all of the fields defined in [`ServerSettings`](#llama_cpp.server.settings.ServerSettings) and [`ModelSettings`](#llama_cpp.server.settings.ModelSettings)

			`Additionally the server supports configuration check out the [configuration section](#configuration-and-multi-model-support) for more information and examples.`


Multimodal Support (Llava 1.5) (#821) * llava v1.5 integration * Point llama.cpp to fork * Add llava shared library target * Fix type * Update llama.cpp * Add llava api * Revert changes to llama and llama_cpp * Update llava example * Add types for new gpt-4-vision-preview api * Fix typo * Update llama.cpp * Update llama_types to match OpenAI v1 API * Update ChatCompletionFunction type * Reorder request parameters * More API type fixes * Even More Type Updates * Add parameter for custom chat_handler to Llama class * Fix circular import * Convert to absolute imports * Fix * Fix pydantic Jsontype bug * Accept list of prompt tokens in create_completion * Add llava1.5 chat handler * Add Multimodal notebook * Clean up examples * Add server docs --------- Co-authored-by: Andrei Betlen <abetlen@gmail.com> 2023-11-08 04:48:51 +01:00			`## Guides`

Add Code Completion section to docs 2023-11-10 04:06:14 -05:00			`### Code Completion`

			`llama-cpp-python` supports code completion via GitHub Copilot.

			`NOTE: Without GPU acceleration this is unlikely to be fast enough to be usable.`

			`You'll first need to download one of the available code completion models in GGUF format:`

			`- [replit-code-v1_5-GGUF](https://huggingface.co/abetlen/replit-code-v1_5-3b-GGUF)`

			`Then you'll need to run the OpenAI compatible web server with a increased context size substantially for GitHub Copilot requests:`

			```bash
			`python3 -m llama_cpp.server --model <model_path> --n_ctx 16192`
			```

			Then just update your settings in `.vscode/settings.json` to point to your code completion server:

			```json
			`{`
			`// ...`
			`"github.copilot.advanced": {`
			`"debug.testOverrideProxyUrl": "http://<host>:<port>",`
			`"debug.overrideProxyUrl": "http://<host>:<port>"`
			`}`
			`// ...`
			`}`
			```

Update server docs 2023-11-08 00:52:13 -05:00			`### Function Calling`

			`llama-cpp-python` supports structured function calling based on a JSON schema.
docs: edit function calling docs 2023-11-24 00:17:54 -05:00			`Function calling is completely compatible with the OpenAI function calling API and can be used by connecting with the official OpenAI Python client.`
Update server docs 2023-11-08 00:52:13 -05:00
			`You'll first need to download one of the available function calling models in GGUF format:`

docs: Update Functionary OpenAI Server Readme (#1193) * update functionary parts in server readme * add write-up about hf tokenizer 2024-02-24 01:24:10 +08:00			`- [functionary](https://huggingface.co/meetkai)`
Update server docs 2023-11-08 00:52:13 -05:00
docs: Update Functionary OpenAI Server Readme (#1193) * update functionary parts in server readme * add write-up about hf tokenizer 2024-02-24 01:24:10 +08:00			Then when you run the server you'll need to also specify either `functionary-v1` or `functionary-v2` chat_format.

			`Note that since functionary requires a HF Tokenizer due to discrepancies between llama.cpp and HuggingFace's tokenizers as mentioned [here](https://github.com/abetlen/llama-cpp-python/blob/main?tab=readme-ov-file#function-calling), you will need to pass in the path to the tokenizer too. The tokenizer files are already included in the respective HF repositories hosting the gguf files.`
Update server docs 2023-11-08 00:52:13 -05:00
			```bash
docs: Update Functionary OpenAI Server Readme (#1193) * update functionary parts in server readme * add write-up about hf tokenizer 2024-02-24 01:24:10 +08:00			`python3 -m llama_cpp.server --model <model_path_to_functionary_v2_model> --chat_format functionary-v2 --hf_pretrained_model_name_or_path <model_path_to_functionary_v2_tokenizer>`
Update server docs 2023-11-08 00:52:13 -05:00			```

docs: update link 2023-11-24 00:18:32 -05:00			`Check out this [example notebook](https://github.com/abetlen/llama-cpp-python/blob/main/examples/notebooks/Functions.ipynb) for a walkthrough of some interesting use cases for function calling.`
docs: Add link to function calling notebook 2023-11-24 00:15:02 -05:00
Update server docs 2023-11-08 00:52:13 -05:00			`### Multimodal Models`
Multimodal Support (Llava 1.5) (#821) * llava v1.5 integration * Point llama.cpp to fork * Add llava shared library target * Fix type * Update llama.cpp * Add llava api * Revert changes to llama and llama_cpp * Update llava example * Add types for new gpt-4-vision-preview api * Fix typo * Update llama.cpp * Update llama_types to match OpenAI v1 API * Update ChatCompletionFunction type * Reorder request parameters * More API type fixes * Even More Type Updates * Add parameter for custom chat_handler to Llama class * Fix circular import * Convert to absolute imports * Fix * Fix pydantic Jsontype bug * Accept list of prompt tokens in create_completion * Add llava1.5 chat handler * Add Multimodal notebook * Clean up examples * Add server docs --------- Co-authored-by: Andrei Betlen <abetlen@gmail.com> 2023-11-08 04:48:51 +01:00
			`llama-cpp-python` supports the llava1.5 family of multi-modal models which allow the language model to
			`read information from both text and images.`

			`You'll first need to download one of the available multi-modal models in GGUF format:`

Update server docs 2023-11-08 00:52:13 -05:00			`- [llava-v1.5-7b](https://huggingface.co/mys/ggml_llava-v1.5-7b)`
			`- [llava-v1.5-13b](https://huggingface.co/mys/ggml_llava-v1.5-13b)`
Add link to bakllava gguf model 2023-11-09 03:05:18 -05:00			`- [bakllava-1-7b](https://huggingface.co/mys/ggml_bakllava-1)`
feat: Generic Chat Formats, Tool Calling, and Huggingface Pull Support for Multimodal Models (Obsidian, LLaVA1.6, Moondream) (#1147) * Test dummy image tags in chat templates * Format and improve types for llava_cpp.py * Add from_pretrained support to llava chat format. * Refactor llava chat format to use a jinja2 * Revert chat format test * Add moondream support (wip) * Update moondream chat format * Update moondream chat format * Update moondream prompt * Add function calling support * Cache last image embed * Add Llava1.6 support * Add nanollava support * Add obisidian support * Remove unnecessary import * Re-order multimodal chat formats * Logits all no longer required for multi-modal models * Update README.md * Update docs * Update README * Fix typo * Update README * Fix typo 2024-04-30 01:35:38 -04:00			`- [llava-v1.6-34b](https://huggingface.co/cjpais/llava-v1.6-34B-gguf)`
			`- [moondream2](https://huggingface.co/vikhyatk/moondream2)`
Multimodal Support (Llava 1.5) (#821) * llava v1.5 integration * Point llama.cpp to fork * Add llava shared library target * Fix type * Update llama.cpp * Add llava api * Revert changes to llama and llama_cpp * Update llava example * Add types for new gpt-4-vision-preview api * Fix typo * Update llama.cpp * Update llama_types to match OpenAI v1 API * Update ChatCompletionFunction type * Reorder request parameters * More API type fixes * Even More Type Updates * Add parameter for custom chat_handler to Llama class * Fix circular import * Convert to absolute imports * Fix * Fix pydantic Jsontype bug * Accept list of prompt tokens in create_completion * Add llava1.5 chat handler * Add Multimodal notebook * Clean up examples * Add server docs --------- Co-authored-by: Andrei Betlen <abetlen@gmail.com> 2023-11-08 04:48:51 +01:00
Fix docs multi-modal docs 2023-11-07 22:52:08 -05:00			Then when you run the server you'll need to also specify the path to the clip model used for image embedding and the `llava-1-5` chat_format
Multimodal Support (Llava 1.5) (#821) * llava v1.5 integration * Point llama.cpp to fork * Add llava shared library target * Fix type * Update llama.cpp * Add llava api * Revert changes to llama and llama_cpp * Update llava example * Add types for new gpt-4-vision-preview api * Fix typo * Update llama.cpp * Update llama_types to match OpenAI v1 API * Update ChatCompletionFunction type * Reorder request parameters * More API type fixes * Even More Type Updates * Add parameter for custom chat_handler to Llama class * Fix circular import * Convert to absolute imports * Fix * Fix pydantic Jsontype bug * Accept list of prompt tokens in create_completion * Add llava1.5 chat handler * Add Multimodal notebook * Clean up examples * Add server docs --------- Co-authored-by: Andrei Betlen <abetlen@gmail.com> 2023-11-08 04:48:51 +01:00
			```bash
Fix server doc arguments (#892) 2023-11-08 23:53:00 -05:00			`python3 -m llama_cpp.server --model <model_path> --clip_model_path <clip_model_path> --chat_format llava-1-5`
Multimodal Support (Llava 1.5) (#821) * llava v1.5 integration * Point llama.cpp to fork * Add llava shared library target * Fix type * Update llama.cpp * Add llava api * Revert changes to llama and llama_cpp * Update llava example * Add types for new gpt-4-vision-preview api * Fix typo * Update llama.cpp * Update llama_types to match OpenAI v1 API * Update ChatCompletionFunction type * Reorder request parameters * More API type fixes * Even More Type Updates * Add parameter for custom chat_handler to Llama class * Fix circular import * Convert to absolute imports * Fix * Fix pydantic Jsontype bug * Accept list of prompt tokens in create_completion * Add llava1.5 chat handler * Add Multimodal notebook * Clean up examples * Add server docs --------- Co-authored-by: Andrei Betlen <abetlen@gmail.com> 2023-11-08 04:48:51 +01:00			```

			`Then you can just use the OpenAI API as normal`

			```python3
			`from openai import OpenAI`

			`client = OpenAI(base_url="http://<host>:<port>/v1", api_key="sk-xxx")`
			`response = client.chat.completions.create(`
			`model="gpt-4-vision-preview",`
			`messages=[`
			`{`
			`"role": "user",`
			`"content": [`
			`{`
			`"type": "image_url",`
			`"image_url": {`
			`"url": "<image_url>"`
			`},`
			`},`
			`{"type": "text", "text": "What does the image say"},`
			`],`
			`}`
			`],`
			`)`
			`print(response)`
docs: add server config docs 2023-12-22 14:37:24 -05:00			```

			`## Configuration and Multi-Model Support`

			The server supports configuration via a JSON config file that can be passed using the `--config_file` parameter or the `CONFIG_FILE` environment variable.

			```bash
			`python3 -m llama_cpp.server --config_file <config_file>`
			```

			`Config files support all of the server and model options supported by the cli and environment variables however instead of only a single model the config file can specify multiple models.`

			The server supports routing requests to multiple models based on the `model` parameter in the request which matches against the `model_alias` in the config file.

			`At the moment only a single model is loaded into memory at, the server will automatically load and unload models as needed.`

			```json
			`{`
			`"host": "0.0.0.0",`
			`"port": 8080,`
			`"models": [`
			`{`
			`"model": "models/OpenHermes-2.5-Mistral-7B-GGUF/openhermes-2.5-mistral-7b.Q4_K_M.gguf",`
			`"model_alias": "gpt-3.5-turbo",`
			`"chat_format": "chatml",`
			`"n_gpu_layers": -1,`
			`"offload_kqv": true,`
			`"n_threads": 12,`
			`"n_batch": 512,`
			`"n_ctx": 2048`
			`},`
			`{`
			`"model": "models/OpenHermes-2.5-Mistral-7B-GGUF/openhermes-2.5-mistral-7b.Q4_K_M.gguf",`
			`"model_alias": "gpt-4",`
			`"chat_format": "chatml",`
			`"n_gpu_layers": -1,`
			`"offload_kqv": true,`
			`"n_threads": 12,`
			`"n_batch": 512,`
			`"n_ctx": 2048`
			`},`
			`{`
			`"model": "models/ggml_llava-v1.5-7b/ggml-model-q4_k.gguf",`
			`"model_alias": "gpt-4-vision-preview",`
			`"chat_format": "llava-1-5",`
			`"clip_model_path": "models/ggml_llava-v1.5-7b/mmproj-model-f16.gguf",`
			`"n_gpu_layers": -1,`
			`"offload_kqv": true,`
			`"n_threads": 12,`
			`"n_batch": 512,`
			`"n_ctx": 2048`
			`},`
			`{`
			`"model": "models/mistral-7b-v0.1-GGUF/ggml-model-Q4_K.gguf",`
			`"model_alias": "text-davinci-003",`
			`"n_gpu_layers": -1,`
			`"offload_kqv": true,`
			`"n_threads": 12,`
			`"n_batch": 512,`
			`"n_ctx": 2048`
			`},`
			`{`
			`"model": "models/replit-code-v1_5-3b-GGUF/replit-code-v1_5-3b.Q4_0.gguf",`
			`"model_alias": "copilot-codex",`
			`"n_gpu_layers": -1,`
			`"offload_kqv": true,`
			`"n_threads": 12,`
			`"n_batch": 1024,`
			`"n_ctx": 9216`
			`}`
			`]`
			`}`
			```

			The config file format is defined by the [`ConfigFileSettings`](#llama_cpp.server.settings.ConfigFileSettings) class.

			`## Server Options Reference`

			`::: llama_cpp.server.settings.ConfigFileSettings`
			`options:`
			`show_if_no_docstring: true`

			`::: llama_cpp.server.settings.ServerSettings`
			`options:`
			`show_if_no_docstring: true`

			`::: llama_cpp.server.settings.ModelSettings`
			`options:`
			`show_if_no_docstring: true`