2023-11-08 04:48:51 +01:00
# OpenAI Compatible Server
`llama-cpp-python` offers an OpenAI API compatible web server.
This web server can be used to serve local models and easily connect them to existing clients.
## Setup
### Installation
The server can be installed by running the following command:
``` bash
pip install llama-cpp-python[ server]
```
### Running the server
The server can then be started by running the following command:
``` bash
python3 -m llama_cpp.server --model <model_path>
```
### Server options
For a full list of options, run:
``` bash
python3 -m llama_cpp.server --help
```
NOTE: All server options are also available as environment variables. For example, `--model` can be set by setting the `MODEL` environment variable.
2023-12-22 14:37:24 -05:00
Check out the server config reference below settings for more information on the available options.
CLI arguments and environment variables are available for all of the fields defined in [`ServerSettings` ](#llama_cpp.server.settings.ServerSettings ) and [`ModelSettings` ](#llama_cpp.server.settings.ModelSettings )
Additionally the server supports configuration check out the [configuration section ](#configuration-and-multi-model-support ) for more information and examples.
2023-11-08 04:48:51 +01:00
## Guides
2023-11-10 04:06:14 -05:00
### Code Completion
`llama-cpp-python` supports code completion via GitHub Copilot.
* NOTE * : Without GPU acceleration this is unlikely to be fast enough to be usable.
You'll first need to download one of the available code completion models in GGUF format:
- [replit-code-v1_5-GGUF ](https://huggingface.co/abetlen/replit-code-v1_5-3b-GGUF )
Then you'll need to run the OpenAI compatible web server with a increased context size substantially for GitHub Copilot requests:
``` bash
python3 -m llama_cpp.server --model <model_path> --n_ctx 16192
```
Then just update your settings in `.vscode/settings.json` to point to your code completion server:
``` json
{
// ...
"github.copilot.advanced" : {
"debug.testOverrideProxyUrl" : "http://<host>:<port>" ,
"debug.overrideProxyUrl" : "http://<host>:<port>"
}
// ...
}
```
2023-11-08 00:52:13 -05:00
### Function Calling
`llama-cpp-python` supports structured function calling based on a JSON schema.
2023-11-24 00:17:54 -05:00
Function calling is completely compatible with the OpenAI function calling API and can be used by connecting with the official OpenAI Python client.
2023-11-08 00:52:13 -05:00
You'll first need to download one of the available function calling models in GGUF format:
2024-02-24 01:24:10 +08:00
- [functionary ](https://huggingface.co/meetkai )
2023-11-08 00:52:13 -05:00
2024-02-24 01:24:10 +08:00
Then when you run the server you'll need to also specify either `functionary-v1` or `functionary-v2` chat_format.
Note that since functionary requires a HF Tokenizer due to discrepancies between llama.cpp and HuggingFace's tokenizers as mentioned [here ](https://github.com/abetlen/llama-cpp-python/blob/main?tab=readme-ov-file#function-calling ), you will need to pass in the path to the tokenizer too. The tokenizer files are already included in the respective HF repositories hosting the gguf files.
2023-11-08 00:52:13 -05:00
``` bash
2024-02-24 01:24:10 +08:00
python3 -m llama_cpp.server --model <model_path_to_functionary_v2_model> --chat_format functionary-v2 --hf_pretrained_model_name_or_path <model_path_to_functionary_v2_tokenizer>
2023-11-08 00:52:13 -05:00
```
2023-11-24 00:18:32 -05:00
Check out this [example notebook ](https://github.com/abetlen/llama-cpp-python/blob/main/examples/notebooks/Functions.ipynb ) for a walkthrough of some interesting use cases for function calling.
2023-11-24 00:15:02 -05:00
2023-11-08 00:52:13 -05:00
### Multimodal Models
2023-11-08 04:48:51 +01:00
`llama-cpp-python` supports the llava1.5 family of multi-modal models which allow the language model to
read information from both text and images.
You'll first need to download one of the available multi-modal models in GGUF format:
2023-11-08 00:52:13 -05:00
- [llava-v1.5-7b ](https://huggingface.co/mys/ggml_llava-v1.5-7b )
- [llava-v1.5-13b ](https://huggingface.co/mys/ggml_llava-v1.5-13b )
2023-11-09 03:05:18 -05:00
- [bakllava-1-7b ](https://huggingface.co/mys/ggml_bakllava-1 )
2024-04-30 01:35:38 -04:00
- [llava-v1.6-34b ](https://huggingface.co/cjpais/llava-v1.6-34B-gguf )
- [moondream2 ](https://huggingface.co/vikhyatk/moondream2 )
2023-11-08 04:48:51 +01:00
2023-11-07 22:52:08 -05:00
Then when you run the server you'll need to also specify the path to the clip model used for image embedding and the `llava-1-5` chat_format
2023-11-08 04:48:51 +01:00
``` bash
2023-11-08 23:53:00 -05:00
python3 -m llama_cpp.server --model <model_path> --clip_model_path <clip_model_path> --chat_format llava-1-5
2023-11-08 04:48:51 +01:00
```
Then you can just use the OpenAI API as normal
``` python3
from openai import OpenAI
client = OpenAI ( base_url = " http://<host>:<port>/v1 " , api_key = " sk-xxx " )
response = client . chat . completions . create (
model = " gpt-4-vision-preview " ,
messages = [
{
" role " : " user " ,
" content " : [
{
" type " : " image_url " ,
" image_url " : {
" url " : " <image_url> "
} ,
} ,
{ " type " : " text " , " text " : " What does the image say " } ,
] ,
}
] ,
)
print ( response )
2023-12-22 14:37:24 -05:00
```
## Configuration and Multi-Model Support
The server supports configuration via a JSON config file that can be passed using the `--config_file` parameter or the `CONFIG_FILE` environment variable.
``` bash
python3 -m llama_cpp.server --config_file <config_file>
```
Config files support all of the server and model options supported by the cli and environment variables however instead of only a single model the config file can specify multiple models.
The server supports routing requests to multiple models based on the `model` parameter in the request which matches against the `model_alias` in the config file.
At the moment only a single model is loaded into memory at, the server will automatically load and unload models as needed.
``` json
{
"host" : "0.0.0.0" ,
"port" : 8080 ,
"models" : [
{
"model" : "models/OpenHermes-2.5-Mistral-7B-GGUF/openhermes-2.5-mistral-7b.Q4_K_M.gguf" ,
"model_alias" : "gpt-3.5-turbo" ,
"chat_format" : "chatml" ,
"n_gpu_layers" : -1 ,
"offload_kqv" : true ,
"n_threads" : 12 ,
"n_batch" : 512 ,
"n_ctx" : 2048
} ,
{
"model" : "models/OpenHermes-2.5-Mistral-7B-GGUF/openhermes-2.5-mistral-7b.Q4_K_M.gguf" ,
"model_alias" : "gpt-4" ,
"chat_format" : "chatml" ,
"n_gpu_layers" : -1 ,
"offload_kqv" : true ,
"n_threads" : 12 ,
"n_batch" : 512 ,
"n_ctx" : 2048
} ,
{
"model" : "models/ggml_llava-v1.5-7b/ggml-model-q4_k.gguf" ,
"model_alias" : "gpt-4-vision-preview" ,
"chat_format" : "llava-1-5" ,
"clip_model_path" : "models/ggml_llava-v1.5-7b/mmproj-model-f16.gguf" ,
"n_gpu_layers" : -1 ,
"offload_kqv" : true ,
"n_threads" : 12 ,
"n_batch" : 512 ,
"n_ctx" : 2048
} ,
{
"model" : "models/mistral-7b-v0.1-GGUF/ggml-model-Q4_K.gguf" ,
"model_alias" : "text-davinci-003" ,
"n_gpu_layers" : -1 ,
"offload_kqv" : true ,
"n_threads" : 12 ,
"n_batch" : 512 ,
"n_ctx" : 2048
} ,
{
"model" : "models/replit-code-v1_5-3b-GGUF/replit-code-v1_5-3b.Q4_0.gguf" ,
"model_alias" : "copilot-codex" ,
"n_gpu_layers" : -1 ,
"offload_kqv" : true ,
"n_threads" : 12 ,
"n_batch" : 1024 ,
"n_ctx" : 9216
}
]
}
```
The config file format is defined by the [`ConfigFileSettings` ](#llama_cpp.server.settings.ConfigFileSettings ) class.
## Server Options Reference
::: llama_cpp.server.settings.ConfigFileSettings
options:
show_if_no_docstring: true
::: llama_cpp.server.settings.ServerSettings
options:
show_if_no_docstring: true
::: llama_cpp.server.settings.ModelSettings
options:
show_if_no_docstring: true