Blame: bindings/python/README.md - huggingface/tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

0 0 0 Rust

Update Python Readme 2020-01-10 12:18:16 -05:00			`<p align="center">`
			`<br>`
			`<img src="https://huggingface.co/landing/assets/tokenizers/tokenizers-logo.png" width="600"/>`
			`<br>`
			`<p>`
			`<p align="center">`
			`<a href="https://badge.fury.io/py/tokenizers">`
			`<img alt="Build" src="https://badge.fury.io/py/tokenizers.svg">`
			`</a>`
			`<a href="https://github.com/huggingface/tokenizers/blob/master/LICENSE">`
			`<img alt="GitHub" src="https://img.shields.io/github/license/huggingface/tokenizers.svg?color=blue">`
			`</a>`
			`</p>`
			`<br>`
Python readme 2019-11-01 19:42:36 -04:00
Update python readme 2019-12-03 10:25:55 -05:00			`# Tokenizers`

Quick README update 2020-01-08 14:07:48 -05:00			`Provides an implementation of today's most used tokenizers, with a focus on performance and`
			`versatility.`
Update python readme 2019-12-03 10:25:55 -05:00
Update Python Readme 2020-01-10 12:18:16 -05:00			`Bindings over the [Rust](https://github.com/huggingface/tokenizers/tree/master/tokenizers) implementation.`
Quick README update 2020-01-08 14:07:48 -05:00			`If you are interested in the High-level design, you can go check it there.`
Update python readme 2019-12-03 10:25:55 -05:00
Quick README update 2020-01-08 14:07:48 -05:00			`Otherwise, let's dive in!`

			`## Main features:`

			`- Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3`
			`most common BPE versions).`
			`- Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes`
			`less than 20 seconds to tokenize a GB of text on a server's CPU.`
			`- Easy to use, but also extremely versatile.`
			`- Designed for research and production.`
			`- Normalization comes with alignments tracking. It's always possible to get the part of the`
			`original sentence that corresponds to a given token.`
			`- Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.`
Improve python readme with training example 2019-12-03 16:11:03 -05:00
Update python readme 2019-12-03 10:25:55 -05:00			`### Installation`

			`#### With pip:`
Update README 2019-12-03 17:26:20 -05:00
			```bash
Update python readme 2019-12-03 10:25:55 -05:00			`pip install tokenizers`
			```

			`#### From sources:`

Quick README update 2020-01-08 14:07:48 -05:00			`To use this method, you need to have the Rust installed:`
Update README 2019-12-03 17:26:20 -05:00
			```bash
Update python readme 2019-12-03 10:25:55 -05:00			`# Install with:`
Quick README update 2020-01-08 14:07:48 -05:00			`curl https://sh.rustup.rs -sSf \| sh -s -- -y`
Update python readme 2019-12-03 10:25:55 -05:00			`export PATH="$HOME/.cargo/bin:$PATH"`
			```

Quick README update 2020-01-08 14:07:48 -05:00			`Once Rust is installed, you can compile doing the following`
Update README 2019-12-03 17:26:20 -05:00
			```bash
Update python readme 2019-12-03 10:25:55 -05:00			`git clone https://github.com/huggingface/tokenizers`
			`cd tokenizers/bindings/python`
Update readme and fix example 2019-11-19 19:38:57 -05:00
Update python readme 2019-12-03 10:25:55 -05:00			`# Create a virtual env (you can use yours as well)`
			`python -m venv .env`
Python readme 2019-11-01 19:42:36 -04:00			`source .env/bin/activate`
Update python readme 2019-12-03 10:25:55 -05:00
			# Install `tokenizers` in the current virtual env
Updating the docs with the new command. (#1333) 2023-08-29 13:15:26 +02:00			`pip install -e .`
Quick README update 2020-01-08 14:07:48 -05:00			```

Improve READMEs for from_pretrained 2021-08-19 17:03:43 +02:00			`### Load a pretrained tokenizer from the Hub`

			```python
			`from tokenizers import Tokenizer`

			`tokenizer = Tokenizer.from_pretrained("bert-base-cased")`
			```

Quick README update 2020-01-08 14:07:48 -05:00			`### Using the provided Tokenizers`

Update READMEs and CHANGELOGs 2020-05-20 19:42:40 -04:00			`We provide some pre-build tokenizers to cover the most common cases. You can easily load one of`
			these using some `vocab.json` and `merges.txt` files:
Quick README update 2020-01-08 14:07:48 -05:00
			```python
Python - Replace last BPETokenizer occurences 2020-02-18 16:25:59 -05:00			`from tokenizers import CharBPETokenizer`
Quick README update 2020-01-08 14:07:48 -05:00
			`# Initialize a tokenizer`
			`vocab = "./path/to/vocab.json"`
			`merges = "./path/to/merges.txt"`
Python - Replace last BPETokenizer occurences 2020-02-18 16:25:59 -05:00			`tokenizer = CharBPETokenizer(vocab, merges)`
Quick README update 2020-01-08 14:07:48 -05:00
			`# And then encode:`
			`encoded = tokenizer.encode("I can feel the magic, can you?")`
			`print(encoded.ids)`
			`print(encoded.tokens)`
Update python readme 2019-12-03 10:25:55 -05:00			```

Update READMEs and CHANGELOGs 2020-05-20 19:42:40 -04:00			`And you can train them just as simply:`
Quick README update 2020-01-08 14:07:48 -05:00
			```python
Python - Replace last BPETokenizer occurences 2020-02-18 16:25:59 -05:00			`from tokenizers import CharBPETokenizer`
Quick README update 2020-01-08 14:07:48 -05:00
			`# Initialize a tokenizer`
Python - Replace last BPETokenizer occurences 2020-02-18 16:25:59 -05:00			`tokenizer = CharBPETokenizer()`
Quick README update 2020-01-08 14:07:48 -05:00
			`# Then train it!`
			`tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])`

Update READMEs and CHANGELOGs 2020-05-20 19:42:40 -04:00			`# Now, let's use it:`
Quick README update 2020-01-08 14:07:48 -05:00			`encoded = tokenizer.encode("I can feel the magic, can you?")`

			`# And finally save it somewhere`
Update READMEs and CHANGELOGs 2020-05-20 19:42:40 -04:00			`tokenizer.save("./path/to/directory/my-bpe.tokenizer.json")`
Quick README update 2020-01-08 14:07:48 -05:00			```

Update READMEs and CHANGELOGs 2020-05-20 19:42:40 -04:00			`#### Provided Tokenizers`
Quick README update 2020-01-08 14:07:48 -05:00
Python - Replace last BPETokenizer occurences 2020-02-18 16:25:59 -05:00			- `CharBPETokenizer`: The original BPE
Quick README update 2020-01-08 14:07:48 -05:00			- `ByteLevelBPETokenizer`: The byte level version of the BPE
			- `SentencePieceBPETokenizer`: A BPE implementation compatible with the one used by SentencePiece
			- `BertWordPieceTokenizer`: The famous Bert tokenizer, using WordPiece

			`All of these can be used and trained as explained above!`

			`### Build your own`

Update READMEs and CHANGELOGs 2020-05-20 19:42:40 -04:00			`Whenever these provided tokenizers don't give you enough freedom, you can build your own tokenizer,`
			`by putting all the different parts you need together.`
Fixed Dead Link: Build your own #435 (#436) * Fixed Dead Link: Build your own #435 * Update bindings/python/README.md Co-authored-by: Anthony MOI <xn1t0x@gmail.com> 2020-09-25 16:41:31 +03:00			`You can check how we implemented the [provided tokenizers](https://github.com/huggingface/tokenizers/tree/master/bindings/python/py_src/tokenizers/implementations) and adapt them easily to your own needs.`
Update python readme 2019-12-03 10:25:55 -05:00
Update READMEs and CHANGELOGs 2020-05-20 19:42:40 -04:00			`#### Building a byte-level BPE`
Improve python readme with training example 2019-12-03 16:11:03 -05:00
Update READMEs and CHANGELOGs 2020-05-20 19:42:40 -04:00			`Here is an example showing how to build your own byte-level BPE by putting all the different pieces`
			`together, and then saving it to a single file:`
Improve python readme with training example 2019-12-03 16:11:03 -05:00
			```python
Python - Update README and implementation 2020-03-05 17:26:53 -05:00			`from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors`
Improve python readme with training example 2019-12-03 16:11:03 -05:00
			`# Initialize a tokenizer`
Python - Update README with new API 2020-04-08 14:27:16 -04:00			`tokenizer = Tokenizer(models.BPE())`
Improve python readme with training example 2019-12-03 16:11:03 -05:00
			`# Customize pre-tokenization and decoding`
Python - Update bindings 2020-03-09 18:37:03 -04:00			`tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)`
Python - More updates to the new API 2020-02-10 11:57:30 -05:00			`tokenizer.decoder = decoders.ByteLevel()`
Update bindings and typings 2020-03-10 12:28:24 -04:00			`tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)`
Improve python readme with training example 2019-12-03 16:11:03 -05:00
			`# And then train`
Fix Python README example 2021-10-07 16:56:48 +02:00			`trainer = trainers.BpeTrainer(`
			`vocab_size=20000,`
			`min_frequency=2,`
			`initial_alphabet=pre_tokenizers.ByteLevel.alphabet()`
			`)`
Python - Make the trainer optional on Tokenizer.train 2020-10-07 21:25:32 -04:00			`tokenizer.train([`
Fix Python README example 2021-10-07 16:56:48 +02:00			`"./path/to/dataset/1.txt",`
			`"./path/to/dataset/2.txt",`
			`"./path/to/dataset/3.txt"`
Python - Make the trainer optional on Tokenizer.train 2020-10-07 21:25:32 -04:00			`], trainer=trainer)`
Improve python readme with training example 2019-12-03 16:11:03 -05:00
Update READMEs and CHANGELOGs 2020-05-20 19:42:40 -04:00			`# And Save it`
			`tokenizer.save("byte-level-bpe.tokenizer.json", pretty=True)`
			```

			`Now, when you want to use this tokenizer, this is as simple as:`

			```python
			`from tokenizers import Tokenizer`

			`tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json")`

Improve python readme with training example 2019-12-03 16:11:03 -05:00			`encoded = tokenizer.encode("I can feel the magic, can you?")`
			```