Blame: docs/content/features/quantization.md - mudler/LocalAI

mudler / LocalAI UNCLAIMED

:robot: The free, Open Source alternative to OpenAI, Claude and others. Self-hosted and local-first. Drop-in replacement, running on consumer-grade hardware. No GPU required. Runs gguf, transformers, diffusers and many more. Features: Generate Text, MCP, Audio, Video, Images, Voice Cloning, Distributed, P2P and decentralized inference

44528 0 0 Go

Normal View History Raw

feat(quantization): add quantization backend (#9096) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> 2026-03-22 00:56:34 +01:00			`+++`
			`disableToc = false`
			`title = "Model Quantization"`
			`weight = 19`
			`url = '/features/quantization/'`
			`+++`

			`LocalAI supports model quantization directly through the API and Web UI. Quantization converts HuggingFace models to GGUF format and compresses them to smaller sizes for efficient inference with llama.cpp.`

fix(docs): Use notice instead of alert (#9134) Signed-off-by: Richard Palethorpe <io@richiejp.com> 2026-03-25 12:55:48 +00:00			`{{% notice note %}}`
feat(quantization): add quantization backend (#9096) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> 2026-03-22 00:56:34 +01:00			`This feature is experimental and may change in future releases.`
fix(docs): Use notice instead of alert (#9134) Signed-off-by: Richard Palethorpe <io@richiejp.com> 2026-03-25 12:55:48 +00:00			`{{% /notice %}}`
feat(quantization): add quantization backend (#9096) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> 2026-03-22 00:56:34 +01:00
			`## Supported Backends`

			`\| Backend \| Description \| Quantization Types \| Platforms \|`
			`\|---------\|-------------\|-------------------\|-----------\|`
			`\| llama-cpp-quantization \| Downloads HF models, converts to GGUF, and quantizes using llama.cpp \| q2_k, q3_k_s, q3_k_m, q3_k_l, q4_0, q4_k_s, q4_k_m, q5_0, q5_k_s, q5_k_m, q6_k, q8_0, f16 \| CPU (Linux, macOS) \|`

			`## Quick Start`

			`### 1. Start a quantization job`

			```bash
			`curl -X POST http://localhost:8080/api/quantization/jobs \`
			`-H "Content-Type: application/json" \`
			`-d '{`
			`"model": "unsloth/functiongemma-270m-it",`
			`"quantization_type": "q4_k_m"`
			`}'`
			```

			`### 2. Monitor progress (SSE stream)`

			```bash
			`curl -N http://localhost:8080/api/quantization/jobs/{job_id}/progress`
			```

			`### 3. Download the quantized model`

			```bash
			`curl -o model.gguf http://localhost:8080/api/quantization/jobs/{job_id}/download`
			```

			`### 4. Or import it directly into LocalAI`

			```bash
			`curl -X POST http://localhost:8080/api/quantization/jobs/{job_id}/import \`
			`-H "Content-Type: application/json" \`
			`-d '{`
			`"name": "my-quantized-model"`
			`}'`
			```

			`## API Reference`

			`### Endpoints`

			`\| Method \| Path \| Description \|`
			`\|--------\|------\|-------------\|`
			\| `POST` \| `/api/quantization/jobs` \| Start a quantization job \|
			\| `GET` \| `/api/quantization/jobs` \| List all jobs \|
			\| `GET` \| `/api/quantization/jobs/:id` \| Get job details \|
			\| `POST` \| `/api/quantization/jobs/:id/stop` \| Stop a running job \|
			\| `DELETE` \| `/api/quantization/jobs/:id` \| Delete a job and its data \|
			\| `GET` \| `/api/quantization/jobs/:id/progress` \| SSE progress stream \|
			\| `POST` \| `/api/quantization/jobs/:id/import` \| Import quantized model into LocalAI \|
			\| `GET` \| `/api/quantization/jobs/:id/download` \| Download quantized GGUF file \|
			\| `GET` \| `/api/quantization/backends` \| List available quantization backends \|

			`### Job Request Fields`

			`\| Field \| Type \| Description \|`
			`\|-------\|------\|-------------\|`
			\| `model` \| string \| HuggingFace model ID or local path (required) \|
			\| `backend` \| string \| Backend name (default: `llama-cpp-quantization`) \|
			\| `quantization_type` \| string \| Quantization format (default: `q4_k_m`) \|
			\| `extra_options` \| map \| Backend-specific options (see below) \|

			`### Extra Options`

			`\| Key \| Description \|`
			`\|-----\|-------------\|`
			\| `hf_token` \| HuggingFace token for gated models \|

			`### Import Request Fields`

			`\| Field \| Type \| Description \|`
			`\|-------\|------\|-------------\|`
			\| `name` \| string \| Model name for LocalAI (auto-generated if empty) \|

			`### Job Status Values`

			`\| Status \| Description \|`
			`\|--------\|-------------\|`
			\| `queued` \| Job created, waiting to start \|
			\| `downloading` \| Downloading model from HuggingFace \|
			\| `converting` \| Converting model to f16 GGUF \|
			\| `quantizing` \| Running quantization \|
			\| `completed` \| Quantization finished successfully \|
			\| `failed` \| Job failed (check message for details) \|
			\| `stopped` \| Job stopped by user \|

			`### Progress Stream`

			The `GET /api/quantization/jobs/:id/progress` endpoint returns Server-Sent Events (SSE) with JSON payloads:

			```json
			`{`
			`"job_id": "abc-123",`
			`"progress_percent": 65.0,`
			`"status": "quantizing",`
			`"message": "[ 234/ 567] quantizing blk.5.attn_k.weight ...",`
			`"output_file": "",`
			`"extra_metrics": {}`
			`}`
			```

			When the job completes, `output_file` contains the path to the quantized GGUF file and `extra_metrics` includes `file_size_mb`.

			`## Quantization Types`

			`\| Type \| Size \| Quality \| Description \|`
			`\|------\|------\|---------\|-------------\|`
			\| `q2_k` \| Smallest \| Lowest \| 2-bit quantization \|
			\| `q3_k_s` \| Very small \| Low \| 3-bit small \|
			\| `q3_k_m` \| Very small \| Low \| 3-bit medium \|
			\| `q3_k_l` \| Small \| Low-medium \| 3-bit large \|
			\| `q4_0` \| Small \| Medium \| 4-bit legacy \|
			\| `q4_k_s` \| Small \| Medium \| 4-bit small \|
			\| `q4_k_m` \| Small \| Good \| 4-bit medium (recommended) \|
			\| `q5_0` \| Medium \| Good \| 5-bit legacy \|
			\| `q5_k_s` \| Medium \| Good \| 5-bit small \|
			\| `q5_k_m` \| Medium \| Very good \| 5-bit medium \|
			\| `q6_k` \| Large \| Excellent \| 6-bit \|
			\| `q8_0` \| Large \| Near-lossless \| 8-bit \|
			\| `f16` \| Largest \| Lossless \| 16-bit (no quantization, GGUF conversion only) \|

			The UI also supports entering a custom quantization type string for any format supported by `llama-quantize`.

			`## Web UI`

			`A "Quantize" page appears in the sidebar under the Tools section. The UI provides:`

			`1. Job Configuration — Select model, quantization type (dropdown with presets or custom input), backend, and HuggingFace token`
			`2. Progress Monitor — Real-time progress bar and log output via SSE`
			`3. Jobs List — View all quantization jobs with status, stop/delete actions`
			`4. Output — Download the quantized GGUF file or import it directly into LocalAI for immediate use`

			`## Architecture`

			`Quantization uses the same gRPC backend architecture as fine-tuning:`

			1. Proto layer: `QuantizationRequest`, `QuantizationProgress` (streaming), `StopQuantization`
			2. Python backend: Downloads model, runs `convert_hf_to_gguf.py` and `llama-quantize`
			`3. Go service: Manages job lifecycle, state persistence, async import`
			`4. REST API: HTTP endpoints with SSE progress streaming`
			`5. React UI: Configuration form, real-time progress monitor, download/import panel`