2020-01-10 12:18:16 -05:00
<p align="center">
<br>
<img src="https://huggingface.co/landing/assets/tokenizers/tokenizers-logo.png" width="600"/>
<br>
<p>
<p align="center">
<a href="https://badge.fury.io/py/tokenizers">
<img alt="Build" src="https://badge.fury.io/py/tokenizers.svg">
</a>
<a href="https://github.com/huggingface/tokenizers/blob/master/LICENSE">
<img alt="GitHub" src="https://img.shields.io/github/license/huggingface/tokenizers.svg?color=blue">
</a>
</p>
<br>
2019-11-01 19:42:36 -04:00
2019-12-03 10:25:55 -05:00
# Tokenizers
2020-01-08 14:07:48 -05:00
Provides an implementation of today's most used tokenizers, with a focus on performance and
versatility.
2019-12-03 10:25:55 -05:00
2020-01-10 12:18:16 -05:00
Bindings over the [Rust ](https://github.com/huggingface/tokenizers/tree/master/tokenizers ) implementation.
2020-01-08 14:07:48 -05:00
If you are interested in the High-level design, you can go check it there.
2019-12-03 10:25:55 -05:00
2020-01-08 14:07:48 -05:00
Otherwise, let's dive in!
## Main features:
- Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3
most common BPE versions).
- Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes
less than 20 seconds to tokenize a GB of text on a server's CPU.
- Easy to use, but also extremely versatile.
- Designed for research and production.
- Normalization comes with alignments tracking. It's always possible to get the part of the
original sentence that corresponds to a given token.
- Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.
2019-12-03 16:11:03 -05:00
2019-12-03 10:25:55 -05:00
### Installation
#### With pip:
2019-12-03 17:26:20 -05:00
``` bash
2019-12-03 10:25:55 -05:00
pip install tokenizers
```
#### From sources:
2020-01-08 14:07:48 -05:00
To use this method, you need to have the Rust installed:
2019-12-03 17:26:20 -05:00
``` bash
2019-12-03 10:25:55 -05:00
# Install with:
2020-01-08 14:07:48 -05:00
curl https://sh.rustup.rs -sSf | sh -s -- -y
2019-12-03 10:25:55 -05:00
export PATH = " $HOME /.cargo/bin: $PATH "
```
2020-01-08 14:07:48 -05:00
Once Rust is installed, you can compile doing the following
2019-12-03 17:26:20 -05:00
``` bash
2019-12-03 10:25:55 -05:00
git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python
2019-11-19 19:38:57 -05:00
2019-12-03 10:25:55 -05:00
# Create a virtual env (you can use yours as well)
python -m venv .env
2019-11-01 19:42:36 -04:00
source .env/bin/activate
2019-12-03 10:25:55 -05:00
# Install `tokenizers` in the current virtual env
2023-08-29 13:15:26 +02:00
pip install -e .
2020-01-08 14:07:48 -05:00
```
2021-08-19 17:03:43 +02:00
### Load a pretrained tokenizer from the Hub
``` python
from tokenizers import Tokenizer
tokenizer = Tokenizer . from_pretrained ( " bert-base-cased " )
```
2020-01-08 14:07:48 -05:00
### Using the provided Tokenizers
2020-05-20 19:42:40 -04:00
We provide some pre-build tokenizers to cover the most common cases. You can easily load one of
these using some `vocab.json` and `merges.txt` files:
2020-01-08 14:07:48 -05:00
``` python
2020-02-18 16:25:59 -05:00
from tokenizers import CharBPETokenizer
2020-01-08 14:07:48 -05:00
# Initialize a tokenizer
vocab = " ./path/to/vocab.json "
merges = " ./path/to/merges.txt "
2020-02-18 16:25:59 -05:00
tokenizer = CharBPETokenizer ( vocab , merges )
2020-01-08 14:07:48 -05:00
# And then encode:
encoded = tokenizer . encode ( " I can feel the magic, can you? " )
print ( encoded . ids )
print ( encoded . tokens )
2019-12-03 10:25:55 -05:00
```
2020-05-20 19:42:40 -04:00
And you can train them just as simply:
2020-01-08 14:07:48 -05:00
``` python
2020-02-18 16:25:59 -05:00
from tokenizers import CharBPETokenizer
2020-01-08 14:07:48 -05:00
# Initialize a tokenizer
2020-02-18 16:25:59 -05:00
tokenizer = CharBPETokenizer ( )
2020-01-08 14:07:48 -05:00
# Then train it!
tokenizer . train ( [ " ./path/to/files/1.txt " , " ./path/to/files/2.txt " ] )
2020-05-20 19:42:40 -04:00
# Now, let's use it:
2020-01-08 14:07:48 -05:00
encoded = tokenizer . encode ( " I can feel the magic, can you? " )
# And finally save it somewhere
2020-05-20 19:42:40 -04:00
tokenizer . save ( " ./path/to/directory/my-bpe.tokenizer.json " )
2020-01-08 14:07:48 -05:00
```
2020-05-20 19:42:40 -04:00
#### Provided Tokenizers
2020-01-08 14:07:48 -05:00
2020-02-18 16:25:59 -05:00
- `CharBPETokenizer` : The original BPE
2020-01-08 14:07:48 -05:00
- `ByteLevelBPETokenizer` : The byte level version of the BPE
- `SentencePieceBPETokenizer` : A BPE implementation compatible with the one used by SentencePiece
- `BertWordPieceTokenizer` : The famous Bert tokenizer, using WordPiece
All of these can be used and trained as explained above!
### Build your own
2020-05-20 19:42:40 -04:00
Whenever these provided tokenizers don't give you enough freedom, you can build your own tokenizer,
by putting all the different parts you need together.
2020-09-25 16:41:31 +03:00
You can check how we implemented the [provided tokenizers ](https://github.com/huggingface/tokenizers/tree/master/bindings/python/py_src/tokenizers/implementations ) and adapt them easily to your own needs.
2019-12-03 10:25:55 -05:00
2020-05-20 19:42:40 -04:00
#### Building a byte-level BPE
2019-12-03 16:11:03 -05:00
2020-05-20 19:42:40 -04:00
Here is an example showing how to build your own byte-level BPE by putting all the different pieces
together, and then saving it to a single file:
2019-12-03 16:11:03 -05:00
``` python
2020-03-05 17:26:53 -05:00
from tokenizers import Tokenizer , models , pre_tokenizers , decoders , trainers , processors
2019-12-03 16:11:03 -05:00
# Initialize a tokenizer
2020-04-08 14:27:16 -04:00
tokenizer = Tokenizer ( models . BPE ( ) )
2019-12-03 16:11:03 -05:00
# Customize pre-tokenization and decoding
2020-03-09 18:37:03 -04:00
tokenizer . pre_tokenizer = pre_tokenizers . ByteLevel ( add_prefix_space = True )
2020-02-10 11:57:30 -05:00
tokenizer . decoder = decoders . ByteLevel ( )
2020-03-10 12:28:24 -04:00
tokenizer . post_processor = processors . ByteLevel ( trim_offsets = True )
2019-12-03 16:11:03 -05:00
# And then train
2021-10-07 16:56:48 +02:00
trainer = trainers . BpeTrainer (
vocab_size = 20000 ,
min_frequency = 2 ,
initial_alphabet = pre_tokenizers . ByteLevel . alphabet ( )
)
2020-10-07 21:25:32 -04:00
tokenizer . train ( [
2021-10-07 16:56:48 +02:00
" ./path/to/dataset/1.txt " ,
" ./path/to/dataset/2.txt " ,
" ./path/to/dataset/3.txt "
2020-10-07 21:25:32 -04:00
] , trainer = trainer )
2019-12-03 16:11:03 -05:00
2020-05-20 19:42:40 -04:00
# And Save it
tokenizer . save ( " byte-level-bpe.tokenizer.json " , pretty = True )
```
Now, when you want to use this tokenizer, this is as simple as:
``` python
from tokenizers import Tokenizer
tokenizer = Tokenizer . from_file ( " byte-level-bpe.tokenizer.json " )
2019-12-03 16:11:03 -05:00
encoded = tokenizer . encode ( " I can feel the magic, can you? " )
```