tokenizers

mirror of https://github.com/huggingface/tokenizers.git synced 2026-03-27 06:01:18 +00:00

Author	SHA1	Message	Date
Arthur	d1ac6e9654	Final doc fix (#1989 ) * update * there was an error * update	2026-03-26 17:51:09 +01:00
Arthur	22823c7803	update (#1988 )	2026-03-26 17:35:30 +01:00
Arthur	88f4f79f36	Fix doc builds (#1987 ) * fix binding test post merge * update * update * something * simple * simplify * up * update * remove * update workflow * update * add pip * fix	2026-03-26 17:25:05 +01:00
Arthur	bd19f4b997	Fix doc builds (#1986 ) * fix binding test post merge * update * update * something * simple * simplify * up * update * remove * update workflow * update	2026-03-26 16:56:50 +01:00
Arthur	44ccca5abd	Fix doc builds (#1984 ) * fix binding test post merge * update * update * something * simple * simplify * up * update * remove * update workflow	2026-03-26 16:15:26 +01:00
Arthur	a5a221069d	Fix doc builds (#1983 ) * fix binding test post merge * update * update * something * simple * simplify * up * update * remove	2026-03-26 16:06:08 +01:00
Arthur	d36499e3aa	Fix doc builds (#1982 ) * fix binding test post merge * update * update * something * simple * simplify	2026-03-26 15:40:54 +01:00
Trevor Gamblin	44a84169fd	Add riscv64 build, make Linux wheel build matrix more explicit (#1951 ) * workflows/CI: make rustc targets more explicit For Linux builds, distinguish between 'target' and 'arch', since the two are not always the same (e.g. the target for ppc64le is actually powerpc64le-unknown-linux-gnu). This allows more explicit support for other platforms when needed. Signed-off-by: Trevor Gamblin <tgamblin@baylibre.com> * workflows/CI: add riscv64 build Note that the 'target' and 'arch' values here are different - arch is riscv64, but the actual rustc target is riscv64gc-unknown-linux-gnu, hence the previous change. Signed-off-by: Trevor Gamblin <tgamblin@baylibre.com> --------- Signed-off-by: Trevor Gamblin <tgamblin@baylibre.com>	2026-03-26 10:46:45 +01:00
Shivam	e0502118e6	Fix broken source links in documentation (#1934 ) The documentation source links were pointing to `src/tokenizers/...` which doesn't exist. The Python source files are located at `bindings/python/py_src/tokenizers/...`. Add `version_tag_suffix` parameter to documentation build workflows to generate correct GitHub source links. Fixes #1910	2026-03-25 18:22:53 +01:00
Arthur	8ec1976d19	fix ci (#1978 ) * fix ci * fix stubs * nit * exclude * full fix * update * up * revert * workflow up * thius? * up * add logs I suspect its just maturin missing * marutin not installed but not needed * update * check style after running tests since I mess up the .pyi * nit?	2026-03-25 18:13:04 +01:00
Arthur	50352f73a5	Add type hint, update to pyo3 0.27, add automatic type hint generator (#1928 ) * something that is supposed to work but my env does not allow it, seems to be uv related * ? * up * nits * let' s try * part of tthe update for pyo3 0.27 * more pyo3 fixes * update * does this help? * help * finally * update stub accordingly * export more of the submodules * moooore * add individual .pypi * cleanup * update pyo3 signatures and fix warning * style * update * more updates * sytle * clippy happy * does this help? * fix * fix * ? * what? * add dwarwub case co * up? * update * clippy and fmt * this time it works * remove offending one * update * remove shit * remove more shit that was unwanted * ? * simplify a bit * more verbose? * more simplification * fmt * fix some of the typing in rust directly to please TY (but also just fix some typing.Any * fix script running * fix , ignore and exclude * style * update * fmt + add it to style? * cleanup * Simplify stub.py docstring injection - Replace complex modifications dict with simple insertions list - Remove nested process_function_or_method function - Use bottom-to-top line replacement for cleaner logic - Remove unused importlib import * isolate stub generation into separate tools/stub-gen crate - Move stub_generation.rs to tools/stub-gen/ as standalone crate - Remove stub-gen feature and pyo3-introspection from main crate - Auto-detect PYTHONHOME for uv/venv environments - Update Makefile and README with new instructions Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-11 14:26:43 +01:00
Nathan Goldbaum	6eba494a37	Add a multithreaded tokenizer test and as well as 3.14t CI (#1864 ) * Add multithreaded tokenizer test * Add 3.13t CI * update to use 3.14t * fix ty check * Run ruff --------- Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>	2026-02-11 11:42:28 +01:00
Finn Womack	d3b76e2e5d	Add windows arm64 wheel build to python release (#1907 ) * Add windows arm64 support to python release workflow * Run on fork Updated workflow to include 'arm64-runner' branch and commented out conditions. * fix typo * add arm64 python install for all versions * use python-install option * clean up fork changes * Update .github/workflows/python-release.yml * revert 3.14 addition Waiting to add in a different PR that adds all 3.14 builds at the same time --------- Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>	2026-02-11 11:30:03 +01:00
Nathan Goldbaum	995a477f25	Add Python 3.14 CI (#1925 )	2026-02-06 09:21:02 +01:00
Arthur	a2fe1cc0a9	use macos-15-intel (#1923 )	2026-01-05 13:28:34 +01:00
Finn Womack	e60793ef1c	Python release fix (#1905 ) * add interpretor install, enable workflow run in fork * add additional python versions * Refactor python version setup for x86 windows * try splitting interpreter into an array * revert to hard coded list * try using extra argument * Fix quotes * Clean up python install * revert workflow conditions	2026-01-05 11:09:46 +01:00
Arthur	8604740782	update stub for typing (#1896 ) * update stub for typing * up * add ty type checker * update stub * up * some update * add owner to stub? * update * no print * uptime funk * mm * wtf * fix * fix more * some fixses are manual but come on * up * # type: ignore[import] * reduce the scope of ty for less changes * ups * up?	2025-12-02 12:48:56 +01:00
Arthur	d6a4acc0d2	Update serialization (#1891 ) * Add benchmark for deserializing large added vocab * revert dumb stuff, isolate changes * try to only normalize once * small improvement? * some updates * nit * fmt * normalized string are a fucking waste of time when you just want to add tokens to the vocab man.... * more attempts * works * let's fucking go, parity * update * hahahhahaha * revert changes that are not actually even needed * add a python test! * use normalizer before come on * nit * update to a more concrete usecase * fix build * style * reduce sample size * --allow unmaintained * clippy happy * up * up * derive impl * revert unrelated * fmt * ignore * remove stupid file	2025-11-27 23:07:18 +01:00
Haixuan Xavier Tao	007fc767ac	Add cargo-semver-checks to Rust CI workflow (#1875 ) This adds semver validation to catch breaking changes before release. The check runs on Ubuntu during CI and compares against the published crate on crates.io. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude <noreply@anthropic.com>	2025-10-16 11:22:48 +02:00
MUGUNDAN	386f3d8267	ci: add support for building Win-ARM64 wheels (#1869 ) * ci: add support for building Win-ARM64 wheels * ci: add support for building Win-ARM64 wheels	2025-10-16 10:49:18 +02:00
Arthur	01f8bc834c	clippy (#1781 ) * clippy * fmtr * rutc? * fix onig issue * up * decode stream default * jump a release for cargo audit ... * more cliippy stuff * clippy? * proper style * fmt	2025-05-27 11:30:32 +02:00
Nicolas Patry	4383a25787	Update the release builds following 0.21.1. (#1746 ) * Update the release builds following 0.21.1. * Clippy fix.	2025-03-13 13:01:41 +01:00
Arthur	c45aebd102	🚨 Support updating template processors (#1652 ) * current updates * simplify * set_item works, but `tokenizer._tokenizer.post_processor[1].single = ["$0", "</s>"]` does not ! * fix: `normalizers` deserialization and other refactoring * fix: `pre_tokenizer` deserialization * feat: add `__len__` implementation for `normalizer::PySequence` * feat: add `__setitem__` impl for `normalizers::PySequence` * feat: add `__setitem__` impl to `pre_tokenizer::PySequence` * feat: add `__setitem__` impl to `post_processor::PySequence` * test: add normalizer sequence setter check * refactor: allow unused `processors::setter` macro * test: add `__setitem__` test for processors & pretok * refactor: `unwrap` -> `PyException::new_err()?` * refactor: fmt * refactor: remove unnecessary `pub` * feat(bindings): add missing getters & setters for pretoks * feat(bindings): add missing getters & setters for processors * refactor(bindings): rewrite RwLock poison error msg * refactor: remove debug print * feat(bindings): add description as to why custom deser is needed * feat: make post proc sequence elements mutable * fix(binding): serialization --------- Co-authored-by: Luc Georges <luc.sydney.georges@gmail.com>	2025-01-28 14:58:35 +01:00
Nicolas Patry	3a6504d274	Upgrade to PyO3 0.23 (#1708 ) * Upgrade to PyO3 0.23 * Macos-12 deprecated? * Clippy. * Clippy auto ellision.	2024-12-31 18:36:01 +01:00
Arthur Zucker	1bf2a66b80	v0.20.4-dev0	2024-11-27 10:07:49 +01:00
Nicolas Patry	f4c9fd7f40	Testing ABI3 wheels to reduce number of wheels (#1674 ) * Testing ABI3 wheels to reduce number of wheels * No need for py-clone anymore. * Upgrade python versions. * Remove those flakes. * Promoting new CI + Fixing secret.	2024-11-15 06:02:22 +01:00
Nicolas Patry	1740bff7a6	Revert "Upgrade python versions." This reverts commit `b81ec467a6`.	2024-11-06 13:18:03 +08:00
Nicolas Patry	b81ec467a6	Upgrade python versions.	2024-11-06 13:17:22 +08:00
Arthur Zucker	0f3a3f957e	update workflow	2024-11-04 18:38:32 +01:00
tinyboxvk	6c15458868	Bump actions versions (#1669 ) * Update docs-check.yml Bump actions/setup-python to v5 Bump python-version to 3.12 (default on ubuntu-latest) Switch actions-rs/toolchain to dtolnay/rust-toolchain as the former one is no longer maintained * Update node-release.yml Bump actions/setup-python to v5 Switch actions-rs/toolchain to dtolnay/rust-toolchain as the former one is no longer maintained Bump actions/cache to v4 Bump actions/setup-node to v4 Bump actions/upload-artifact to v4 Bump actions/download-artifact to v4 * Update node.yml Switch actions-rs/toolchain to dtolnay/rust-toolchain as the former one is no longer maintained Bump actions/cache to v4 Bump actions/setup-node to v4 * Update python-release-conda.yml Switch actions-rs/toolchain to dtolnay/rust-toolchain as the former one is no longer maintained Bump conda-incubator/setup-miniconda to v3 * Update python-release.yml Bump actions/setup-python to v5 Bump actions/download-artifact to v4 * Update rust-release.yml Switch actions-rs/toolchain to dtolnay/rust-toolchain as the former one is no longer maintained Bump actions/cache to v4 * Update stale.yml Bump actions/stale to v9 * Update python.yml Bump actions/setup-python to v5	2024-11-01 10:19:35 +01:00
tinyboxvk	41e0eaa561	Bump actions/checkout to v4 (#1667 ) Signed-off-by: tinyboxvk <tinyboxvk@users.noreply.github.com>	2024-10-29 14:32:07 +01:00
Arthur	3d51a1695f	Fix documentation build (#1642 ) * use v4 * fix ruff * style	2024-10-01 14:48:02 +02:00
dependabot[bot]	b4a38c4f63	Bump actions/download-artifact from 3 to 4.1.7 in /.github/workflows (#1626 ) Bumps [actions/download-artifact](https://github.com/actions/download-artifact) from 3 to 4.1.7. - [Release notes](https://github.com/actions/download-artifact/releases) - [Commits](https://github.com/actions/download-artifact/compare/v3...v4.1.7) --- updated-dependencies: - dependency-name: actions/download-artifact dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2024-09-30 16:38:28 +02:00
Nicolas Patry	85cc05a32f	Fix CI (#1607 )	2024-08-08 17:09:30 +02:00
Nicolas Patry	7b80359dd2	Fixing release CI strict (taken from safetensors).	2024-08-06 09:11:30 +02:00
Luc Georges	418c35c09e	feat(ci): add trufflehog secrets detection (#1551 ) * feat(ci): add trufflehog secrets detection * fix(ci): remove unnecessary permissions	2024-06-10 16:10:23 +02:00
Nicolas Patry	25aee8b88c	[BREAKING CHANGE] Ignore added_tokens (both special and not) in the decoder (#1513 ) * [BREAKING CHANGE] Ignore added_tokens (both special and not) in the decoder Causes issues with `ByteLevel` messing up some `AddedTokens` with some utf-8 range used in the bytelevel mapping. This commit tests the extend of the damage of ignoring the decoder for those tokens. * Format. * Installing cargo audit. * Minor fix. * Fixing "bug" in node/python. * Autoformat. * Clippy. * Only prefix space when there's no decoder.	2024-05-06 11:49:38 +02:00
Arthur	f2ec3b239b	remove enforcement of non special when adding tokens (#1521 ) * remove enforcement of non special when adding tokens * mut no longer needed * add a small test * nit * style * audit * ignore cargo audit's own vulnerability * update * revert * remove CVE	2024-04-30 15:53:47 +02:00
Nicolas Patry	e0defa7355	Remove 3.13 (potential undefined behavior.) (#1497 )	2024-04-16 15:56:47 +02:00
Arthur	accd0650b8	Update release for python3.12 windows (#1438 ) Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-01-19 15:56:47 +01:00
Nicolas Patry	8f9b945c75	Stale bot. (#1404 )	2023-12-05 14:11:37 +01:00
Remy	985d49ae64	fix: remove useless token (#1371 )	2023-10-19 14:29:01 +02:00
Nicolas Patry	aed491df8c	Fixing the progressbar. (#1353 ) * Fixing the progressbar. * Upgrade deps. * Update cargo audit * Ssh this action. * Fixing esaxx by using slower rust version. * Trying the new esaxx version. * Publish. * Get cache again.	2023-10-05 15:33:58 +02:00
Nicolas Patry	d2010d5165	Move to maturing mimicking move for `safetensors`. + Rewritten node bindings. (#1331 ) * Move to maturing mimicking move for `safetensors`. * Tmp. * Fix sdist. * Wat? * Clippy 1.72 * Remove if. * Conda sed. * Fix doc check workflow. * Moving to maturin AND removing http + openssl mess (smoothing transition moving to `huggingface_hub`) * Fix dep * Black. * New node bindings. * Fix docs + node cache ? * Yarn. * Working dir. * Extension module. * Put back interpreter. * Remove cache. * New attempt * Multi python. * Remove FromPretrained. * Remove traces of `fromPretrained`. * Drop 3.12 for windows? * Typo. * Put back the default feature for ignoring links during simple test. * Fix ? * x86_64 -> x64. * Remove warning for windows bindings. * Excluse aarch. * Include/exclude. * Put back workflows in correct states.	2023-08-28 16:24:14 +02:00
Nicolas Patry	f2952020d5	Python 38 arm (#1330 )	2023-08-23 16:29:16 +02:00
Nicolas Patry	6c350d88fe	Re-using scritpts from safetensors. (#1328 )	2023-08-23 15:37:38 +02:00
Nicolas Patry	b35d33f981	Release all at once for simplicity. (#1320 )	2023-08-14 13:49:45 +02:00
Chris Ha	862046ac94	CD backports (#1318 ) * CD backports follow huggingface/safetensors#317 * fix node bindings? `cargo check` doesnt work on my local configuration from `tokenizers/bindings/node/native` i don't think it will be a problem but i have difficulty telling * backport #315 * safetensors#317 back ports	2023-08-10 18:52:22 +02:00
Mishig	348ed70e58	[doc build] Use secrets (#1273 )	2023-06-09 12:58:27 +02:00
Funtowicz Morgan	a03330607b	Update all GH Actions with dependency on actions/checkout from v[1,2] to v3 to notably improve performance (retrieve only the commit being checked-out) (#1256 )	2023-05-22 14:50:00 +02:00

1 2 3 4

175 Commits