Blame: README.md - microsoft/BitNet

Official inference framework for 1-bit LLMs

0 0 0 Python

initial commit 2024-10-17 21:21:10 +08:00			`# bitnet.cpp`
			`[![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](https://opensource.org/licenses/MIT)`
			`![version](https://img.shields.io/badge/version-1.0-blue)`

Update README.md (#172) add two FAQs for windows build requestions. 2025-04-15 17:07:20 +08:00			`[<img src="./assets/header_model_release.png" alt="BitNet Model on Hugging Face" width="800"/>](https://huggingface.co/microsoft/BitNet-b1.58-2B-4T)`
update readme and setup script to support official BitNet b1.58 model (#171) * update readme and setup file for new model. * update model file name --------- Co-authored-by: Yan Xia <yanxia@microsoft.com> 2025-04-15 14:53:56 +08:00
Update demo link in README.md 2026-03-10 15:49:46 +08:00			`Try it out via this [demo](https://demo-bitnet-h0h8hcfqeqhrf5gf.canadacentral-01.azurewebsites.net/), or build and run it on your own [CPU](https://github.com/microsoft/BitNet?tab=readme-ov-file#build-from-source) or [GPU](https://github.com/microsoft/BitNet/blob/main/gpu/README.md).`
add third-party demo 2025-04-15 14:36:05 +00:00
refine readme for gpu kernel 2025-05-20 12:29:56 +08:00			`bitnet.cpp is the official inference framework for 1-bit LLMs (e.g., BitNet b1.58). It offers a suite of optimized kernels, that support fast and lossless inference of 1.58-bit models on CPU and GPU (NPU support will coming next).`
update readme 2024-10-17 23:27:30 +08:00
update the new technical report in readme 2024-10-22 11:11:44 +08:00			The first release of bitnet.cpp is to support inference on CPUs. bitnet.cpp achieves speedups of 1.37x to 5.07x on ARM CPUs, with larger models experiencing greater performance gains. Additionally, it reduces energy consumption by 55.4% to 70.0%, further boosting overall efficiency. On x86 CPUs, speedups range from 2.37x to 6.17x with energy reductions between 71.9% to 82.2%. Furthermore, bitnet.cpp can run a 100B BitNet b1.58 model on a single CPU, achieving speeds comparable to human reading (5-7 tokens per second), significantly enhancing the potential for running LLMs on local devices. Please refer to the [technical report](https://arxiv.org/abs/2410.16144) for more details.
update readme 2024-10-17 23:27:30 +08:00
[modify] some test picture and add power test script 2026-01-25 06:51:33 +00:00			`Latest optimization introduces parallel kernel implementations with configurable tiling and embedding quantization support, achieving 1.15x to 2.1x additional speedup over the original implementation across different hardware platforms and workloads. For detailed technical information, see the [optimization guide](src/README.md).`
[modify] update README; [feat] some test script in utils 2026-01-22 06:33:03 +00:00
			`<img src="./assets/performance.png" alt="performance_comparison" width="800"/>`
update readme 2024-10-17 23:27:30 +08:00

initial commit 2024-10-17 21:21:10 +08:00			`## Demo`

Fix typos. 2024-10-17 20:57:45 +01:00			`A demo of bitnet.cpp running a BitNet b1.58 3B model on Apple M2:`
initial commit 2024-10-17 21:21:10 +08:00
			`https://github.com/user-attachments/assets/7f46b736-edec-4828-b809-4be780a3e5b1`

update the new technical report in readme 2024-10-22 11:11:44 +08:00			`## What's New:`
[fix] change README link 2026-01-27 03:39:11 +00:00			`- 01/15/2026 [BitNet CPU Inference Optimization](https://github.com/microsoft/BitNet/blob/main/src/README.md) ![NEW](https://img.shields.io/badge/NEW-red)`
[chore] update README 2026-01-15 03:37:16 +00:00			`- 05/20/2025 [BitNet Official GPU inference kernel](https://github.com/microsoft/BitNet/blob/main/gpu/README.md)`
refine readme for gpu kernel 2025-05-20 12:29:56 +08:00			`- 04/14/2025 [BitNet Official 2B Parameter Model on Hugging Face](https://huggingface.co/microsoft/BitNet-b1.58-2B-4T)`
update readme and setup script to support official BitNet b1.58 model (#171) * update readme and setup file for new model. * update model file name --------- Co-authored-by: Yan Xia <yanxia@microsoft.com> 2025-04-15 14:53:56 +08:00			`- 02/18/2025 [Bitnet.cpp: Efficient Edge Inference for Ternary LLMs](https://arxiv.org/abs/2502.11880)`
update README 2025-02-18 21:13:27 +08:00			`- 11/08/2024 [BitNet a4.8: 4-bit Activations for 1-bit LLMs](https://arxiv.org/abs/2411.04965)`
update the new technical report in readme 2024-10-22 11:11:44 +08:00			`- 10/21/2024 [1-bit AI Infra: Part 1.1, Fast and Lossless BitNet b1.58 Inference on CPUs](https://arxiv.org/abs/2410.16144)`
initial commit 2024-10-17 21:21:10 +08:00			`- 10/17/2024 bitnet.cpp 1.0 released.`
update the new technical report in readme 2024-10-22 11:11:44 +08:00			`- 03/21/2024 [The-Era-of-1-bit-LLMs__Training_Tips_Code_FAQ](https://github.com/microsoft/unilm/blob/master/bitnet/The-Era-of-1-bit-LLMs__Training_Tips_Code_FAQ.pdf)`
initial commit 2024-10-17 21:21:10 +08:00			`- 02/27/2024 [The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits](https://arxiv.org/abs/2402.17764)`
			`- 10/17/2023 [BitNet: Scaling 1-bit Transformers for Large Language Models](https://arxiv.org/abs/2310.11453)`

Update README.md acknowledgement section 2024-10-29 21:25:19 +08:00			`## Acknowledgements`

			`This project is based on the [llama.cpp](https://github.com/ggerganov/llama.cpp) framework. We would like to thank all the authors for their contributions to the open-source community. Also, bitnet.cpp's kernels are built on top of the Lookup Table methodologies pioneered in [T-MAC](https://github.com/microsoft/T-MAC/). For inference of general low-bit LLMs beyond ternary models, we recommend using T-MAC.`
update readme and setup script to support official BitNet b1.58 model (#171) * update readme and setup file for new model. * update model file name --------- Co-authored-by: Yan Xia <yanxia@microsoft.com> 2025-04-15 14:53:56 +08:00			`## Official Models`
			`<table>`
			`</tr>`
			`<tr>`
			`<th rowspan="2">Model</th>`
			`<th rowspan="2">Parameters</th>`
			`<th rowspan="2">CPU</th>`
			`<th colspan="3">Kernel</th>`
			`</tr>`
			`<tr>`
			`<th>I2_S</th>`
			`<th>TL1</th>`
			`<th>TL2</th>`
			`</tr>`
			`<tr>`
			`<td rowspan="2"><a href="https://huggingface.co/microsoft/BitNet-b1.58-2B-4T">BitNet-b1.58-2B-4T</a></td>`
			`<td rowspan="2">2.4B</td>`
			`<td>x86</td>`
			`<td>✅</td>`
			`<td>❌</td>`
			`<td>✅</td>`
			`</tr>`
			`<tr>`
			`<td>ARM</td>`
			`<td>✅</td>`
			`<td>✅</td>`
			`<td>❌</td>`
			`</tr>`
			`</table>`
Update README.md acknowledgement section 2024-10-29 21:25:19 +08:00
initial commit 2024-10-17 21:21:10 +08:00			`## Supported Models`
update readme and setup script to support official BitNet b1.58 model (#171) * update readme and setup file for new model. * update model file name --------- Co-authored-by: Yan Xia <yanxia@microsoft.com> 2025-04-15 14:53:56 +08:00			`❗️We use existing 1-bit LLMs available on [Hugging Face](https://huggingface.co/) to demonstrate the inference capabilities of bitnet.cpp. We hope the release of bitnet.cpp will inspire the development of 1-bit LLMs in large-scale settings in terms of model size and training tokens.`
initial commit 2024-10-17 21:21:10 +08:00
			`<table>`
			`</tr>`
			`<tr>`
			`<th rowspan="2">Model</th>`
			`<th rowspan="2">Parameters</th>`
			`<th rowspan="2">CPU</th>`
			`<th colspan="3">Kernel</th>`
			`</tr>`
			`<tr>`
			`<th>I2_S</th>`
			`<th>TL1</th>`
			`<th>TL2</th>`
			`</tr>`
			`<tr>`
			`<td rowspan="2"><a href="https://huggingface.co/1bitLLM/bitnet_b1_58-large">bitnet_b1_58-large</a></td>`
			`<td rowspan="2">0.7B</td>`
			`<td>x86</td>`
Make the coverage table more readable with both dark and light theme 2024-12-05 12:02:16 +00:00			`<td>✅</td>`
			`<td>❌</td>`
			`<td>✅</td>`
initial commit 2024-10-17 21:21:10 +08:00			`</tr>`
			`<tr>`
			`<td>ARM</td>`
Make the coverage table more readable with both dark and light theme 2024-12-05 12:02:16 +00:00			`<td>✅</td>`
			`<td>✅</td>`
			`<td>❌</td>`
initial commit 2024-10-17 21:21:10 +08:00			`</tr>`
			`<tr>`
			`<td rowspan="2"><a href="https://huggingface.co/1bitLLM/bitnet_b1_58-3B">bitnet_b1_58-3B</a></td>`
			`<td rowspan="2">3.3B</td>`
			`<td>x86</td>`
Make the coverage table more readable with both dark and light theme 2024-12-05 12:02:16 +00:00			`<td>❌</td>`
			`<td>❌</td>`
			`<td>✅</td>`
initial commit 2024-10-17 21:21:10 +08:00			`</tr>`
			`<tr>`
			`<td>ARM</td>`
Make the coverage table more readable with both dark and light theme 2024-12-05 12:02:16 +00:00			`<td>❌</td>`
			`<td>✅</td>`
			`<td>❌</td>`
initial commit 2024-10-17 21:21:10 +08:00			`</tr>`
			`<tr>`
			`<td rowspan="2"><a href="https://huggingface.co/HF1BitLLM/Llama3-8B-1.58-100B-tokens">Llama3-8B-1.58-100B-tokens</a></td>`
			`<td rowspan="2">8.0B</td>`
			`<td>x86</td>`
Make the coverage table more readable with both dark and light theme 2024-12-05 12:02:16 +00:00			`<td>✅</td>`
			`<td>❌</td>`
			`<td>✅</td>`
initial commit 2024-10-17 21:21:10 +08:00			`</tr>`
			`<tr>`
			`<td>ARM</td>`
Make the coverage table more readable with both dark and light theme 2024-12-05 12:02:16 +00:00			`<td>✅</td>`
			`<td>✅</td>`
			`<td>❌</td>`
initial commit 2024-10-17 21:21:10 +08:00			`</tr>`
Update README.md 2024-12-19 18:32:54 +08:00			`<tr>`
Update README.md 2024-12-20 14:58:53 +08:00			`<td rowspan="2"><a href="https://huggingface.co/collections/tiiuae/falcon3-67605ae03578be86e4e87026">Falcon3 Family</a></td>`
			`<td rowspan="2">1B-10B</td>`
Update README.md 2024-12-19 18:32:54 +08:00			`<td>x86</td>`
Update README.md 2024-12-20 14:58:53 +08:00			`<td>✅</td>`
			`<td>❌</td>`
			`<td>✅</td>`
Update README.md 2024-12-19 18:32:54 +08:00			`</tr>`
			`<tr>`
			`<td>ARM</td>`
Update README.md 2024-12-20 14:58:53 +08:00			`<td>✅</td>`
			`<td>✅</td>`
			`<td>❌</td>`
Update README.md 2024-12-19 18:32:54 +08:00			`</tr>`
Update README.md 2025-05-15 18:49:28 +04:00			`<tr>`
			`<td rowspan="2"><a href="https://huggingface.co/collections/tiiuae/falcon-edge-series-6804fd13344d6d8a8fa71130">Falcon-E Family</a></td>`
			`<td rowspan="2">1B-3B</td>`
			`<td>x86</td>`
			`<td>✅</td>`
			`<td>❌</td>`
			`<td>✅</td>`
			`</tr>`
			`<tr>`
			`<td>ARM</td>`
			`<td>✅</td>`
			`<td>✅</td>`
			`<td>❌</td>`
			`</tr>`
initial commit 2024-10-17 21:21:10 +08:00			`</table>`



			`## Installation`

			`### Requirements`
			`- python>=3.9`
			`- cmake>=3.22`
			`- clang>=18`
			`- For Windows users, install [Visual Studio 2022](https://visualstudio.microsoft.com/downloads/). In the installer, toggle on at least the following options(this also automatically installs the required additional tools like CMake):`
			`- Desktop-development with C++`
			`- C++-CMake Tools for Windows`
			`- Git for Windows`
			`- C++-Clang Compiler for Windows`
			`- MS-Build Support for LLVM-Toolset (clang)`
			`- For Debian/Ubuntu users, you can download with [Automatic installation script](https://apt.llvm.org/)`

Update README.md 2024-10-25 13:52:52 +08:00			`bash -c "$(wget -O - https://apt.llvm.org/llvm.sh)"`
initial commit 2024-10-17 21:21:10 +08:00			`- conda (highly recommend)`

			`### Build from source`

			`> [!IMPORTANT]`
Update README.md (#172) add two FAQs for windows build requestions. 2025-04-15 17:07:20 +08:00			`> If you are using Windows, please remember to always use a Developer Command Prompt / PowerShell for VS2022 for the following commands. Please refer to the FAQs below if you see any issues.`
initial commit 2024-10-17 21:21:10 +08:00
			`1. Clone the repo`
			```bash
			`git clone --recursive https://github.com/microsoft/BitNet.git`
			`cd BitNet`
			```
			`2. Install the dependencies`
			```bash
			`# (Recommended) Create a new conda environment`
			`conda create -n bitnet-cpp python=3.9`
			`conda activate bitnet-cpp`

			`pip install -r requirements.txt`
			```
			`3. Build the project`
			```bash
update readme and setup script to support official BitNet b1.58 model (#171) * update readme and setup file for new model. * update model file name --------- Co-authored-by: Yan Xia <yanxia@microsoft.com> 2025-04-15 14:53:56 +08:00			`# Manually download the model and run with local path`
Update README.md 2025-04-15 15:24:42 +08:00			`huggingface-cli download microsoft/BitNet-b1.58-2B-4T-gguf --local-dir models/BitNet-b1.58-2B-4T`
update readme and setup script to support official BitNet b1.58 model (#171) * update readme and setup file for new model. * update model file name --------- Co-authored-by: Yan Xia <yanxia@microsoft.com> 2025-04-15 14:53:56 +08:00			`python setup_env.py -md models/BitNet-b1.58-2B-4T -q i2_s`

initial commit 2024-10-17 21:21:10 +08:00			```
			`<pre>`
add changes on README 2024-12-17 07:05:35 +00:00			`usage: setup_env.py [-h] [--hf-repo {1bitLLM/bitnet_b1_58-large,1bitLLM/bitnet_b1_58-3B,HF1BitLLM/Llama3-8B-1.58-100B-tokens,tiiuae/Falcon3-1B-Instruct-1.58bit,tiiuae/Falcon3-3B-Instruct-1.58bit,tiiuae/Falcon3-7B-Instruct-1.58bit,tiiuae/Falcon3-10B-Instruct-1.58bit}] [--model-dir MODEL_DIR] [--log-dir LOG_DIR] [--quant-type {i2_s,tl1}] [--quant-embd]`
initial commit 2024-10-17 21:21:10 +08:00			`[--use-pretuned]`

			`Setup the environment for running inference`

			`optional arguments:`
			`-h, --help show this help message and exit`
add changes on README 2024-12-17 07:05:35 +00:00			`--hf-repo {1bitLLM/bitnet_b1_58-large,1bitLLM/bitnet_b1_58-3B,HF1BitLLM/Llama3-8B-1.58-100B-tokens,tiiuae/Falcon3-1B-Instruct-1.58bit,tiiuae/Falcon3-3B-Instruct-1.58bit,tiiuae/Falcon3-7B-Instruct-1.58bit,tiiuae/Falcon3-10B-Instruct-1.58bit}, -hr {1bitLLM/bitnet_b1_58-large,1bitLLM/bitnet_b1_58-3B,HF1BitLLM/Llama3-8B-1.58-100B-tokens,tiiuae/Falcon3-1B-Instruct-1.58bit,tiiuae/Falcon3-3B-Instruct-1.58bit,tiiuae/Falcon3-7B-Instruct-1.58bit,tiiuae/Falcon3-10B-Instruct-1.58bit}`
initial commit 2024-10-17 21:21:10 +08:00			`Model used for inference`
			`--model-dir MODEL_DIR, -md MODEL_DIR`
			`Directory to save/load the model`
			`--log-dir LOG_DIR, -ld LOG_DIR`
			`Directory to save the logging info`
			`--quant-type {i2_s,tl1}, -q {i2_s,tl1}`
			`Quantization type`
[modify] update README; [feat] some test script in utils 2026-01-22 06:33:03 +00:00			`--quant-embd Quantize the embeddings to f16`
initial commit 2024-10-17 21:21:10 +08:00			`--use-pretuned, -p Use the pretuned kernel parameters`
			`</pre>`
			`## Usage`
			`### Basic usage`
			```bash
			`# Run inference with the quantized model`
update readme and setup script to support official BitNet b1.58 model (#171) * update readme and setup file for new model. * update model file name --------- Co-authored-by: Yan Xia <yanxia@microsoft.com> 2025-04-15 14:53:56 +08:00			`python run_inference.py -m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf -p "You are a helpful assistant" -cnv`
initial commit 2024-10-17 21:21:10 +08:00			```
			`<pre>`
fix readme issue and -cnv option issue 2024-12-18 21:20:26 +08:00			`usage: run_inference.py [-h] [-m MODEL] [-n N_PREDICT] -p PROMPT [-t THREADS] [-c CTX_SIZE] [-temp TEMPERATURE] [-cnv]`
initial commit 2024-10-17 21:21:10 +08:00
			`Run inference`

			`optional arguments:`
			`-h, --help show this help message and exit`
			`-m MODEL, --model MODEL`
			`Path to model file`
			`-n N_PREDICT, --n-predict N_PREDICT`
			`Number of tokens to predict when generating text`
			`-p PROMPT, --prompt PROMPT`
			`Prompt to generate text from`
			`-t THREADS, --threads THREADS`
			`Number of threads to use`
			`-c CTX_SIZE, --ctx-size CTX_SIZE`
			`Size of the prompt context`
			`-temp TEMPERATURE, --temperature TEMPERATURE`
			`Temperature, a hyperparameter that controls the randomness of the generated text`
fix readme issue and -cnv option issue 2024-12-18 21:20:26 +08:00			`-cnv, --conversation Whether to enable chat mode or not (for instruct models.)`
Update README.md 2024-12-19 18:47:53 +08:00			`(When this option is turned on, the prompt specified by -p will be used as the system prompt.)`
initial commit 2024-10-17 21:21:10 +08:00			`</pre>`

			`### Benchmark`
			`We provide scripts to run the inference benchmark providing a model.`

			```
			`usage: e2e_benchmark.py -m MODEL [-n N_TOKEN] [-p N_PROMPT] [-t THREADS]`

			`Setup the environment for running the inference`

			`required arguments:`
			`-m MODEL, --model MODEL`
			`Path to the model file.`

			`optional arguments:`
			`-h, --help`
			`Show this help message and exit.`
			`-n N_TOKEN, --n-token N_TOKEN`
			`Number of generated tokens.`
			`-p N_PROMPT, --n-prompt N_PROMPT`
			`Prompt to generate text from.`
			`-t THREADS, --threads THREADS`
			`Number of threads to use.`
			```

			`Here's a brief explanation of each argument:`

			- `-m`, `--model`: The path to the model file. This is a required argument that must be provided when running the script.
			- `-n`, `--n-token`: The number of tokens to generate during the inference. It is an optional argument with a default value of 128.
			- `-p`, `--n-prompt`: The number of prompt tokens to use for generating text. This is an optional argument with a default value of 512.
			- `-t`, `--threads`: The number of threads to use for running the inference. It is an optional argument with a default value of 2.
			- `-h`, `--help`: Show the help message and exit. Use this argument to display usage information.

			`For example:`

			```sh
			`python utils/e2e_benchmark.py -m /path/to/model -n 200 -p 256 -t 4`
			```

			This command would run the inference benchmark using the model located at `/path/to/model`, generating 200 tokens from a 256 token prompt, utilizing 4 threads.

			`For the model layout that do not supported by any public model, we provide scripts to generate a dummy model with the given model layout, and run the benchmark on your machine:`

			```bash
			`python utils/generate-dummy-bitnet-model.py models/bitnet_b1_58-large --outfile models/dummy-bitnet-125m.tl1.gguf --outtype tl1 --model-size 125M`

			`# Run benchmark with the generated model, use -m to specify the model path, -p to specify the prompt processed, -n to specify the number of token to generate`
			`python utils/e2e_benchmark.py -m models/dummy-bitnet-125m.tl1.gguf -p 512 -n 128`
			```
Enable conversion from .safetensors checkpoints to gguf files 2025-05-21 20:13:37 +08:00
			### Convert from `.safetensors` Checkpoints

			```sh
			`# Prepare the .safetensors model file`
			`huggingface-cli download microsoft/bitnet-b1.58-2B-4T-bf16 --local-dir ./models/bitnet-b1.58-2B-4T-bf16`

			`# Convert to gguf model`
			`python ./utils/convert-helper-bitnet.py ./models/bitnet-b1.58-2B-4T-bf16`
			```

Update README.md (#172) add two FAQs for windows build requestions. 2025-04-15 17:07:20 +08:00			`### FAQ (Frequently Asked Questions)📌`
initial commit 2024-10-17 21:21:10 +08:00
Update README.md (#172) add two FAQs for windows build requestions. 2025-04-15 17:07:20 +08:00			`#### Q1: The build dies with errors building llama.cpp due to issues with std::chrono in log.cpp?`

			`A:`
			`This is an issue introduced in recent version of llama.cpp. Please refer to this [commit](https://github.com/tinglou/llama.cpp/commit/4e3db1e3d78cc1bcd22bcb3af54bd2a4628dd323) in the [discussion](https://github.com/abetlen/llama-cpp-python/issues/1942) to fix this issue.`

			`#### Q2: How to build with clang in conda environment on windows?`

			`A:`
			`Before building the project, verify your clang installation and access to Visual Studio tools by running:`
			```
			`clang -v`
			```

			`This command checks that you are using the correct version of clang and that the Visual Studio tools are available. If you see an error message such as:`
			```
			`'clang' is not recognized as an internal or external command, operable program or batch file.`
			```

			`It indicates that your command line window is not properly initialized for Visual Studio tools.`

			`• If you are using Command Prompt, run:`
			```
			`"C:\Program Files\Microsoft Visual Studio\2022\Professional\Common7\Tools\VsDevCmd.bat" -startdir=none -arch=x64 -host_arch=x64`
			```

			`• If you are using Windows PowerShell, run the following commands:`
			```
			`Import-Module "C:\Program Files\Microsoft Visual Studio\2022\Professional\Common7\Tools\Microsoft.VisualStudio.DevShell.dll" Enter-VsDevShell 3f0e31ad -SkipAutomaticLocation -DevCmdArguments "-arch=x64 -host_arch=x64"`
			```

			`These steps will initialize your environment and allow you to use the correct Visual Studio tools.`