Releases: huggingface/tokenizers
Release v0.20.1
What's Changed
The most awaited offset
issue with Llama
is fixed 🥳
- Update README.md by @ArthurZucker in #1608
- fix benchmark file link by @152334H in #1610
- Bump actions/download-artifact from 3 to 4.1.7 in /.github/workflows by @dependabot in #1626
- [
ignore_merges
] Fix offsets by @ArthurZucker in #1640 - Bump body-parser and express in /tokenizers/examples/unstable_wasm/www by @dependabot in #1629
- Bump serve-static and express in /tokenizers/examples/unstable_wasm/www by @dependabot in #1630
- Bump send and express in /tokenizers/examples/unstable_wasm/www by @dependabot in #1631
- Bump webpack from 5.76.0 to 5.95.0 in /tokenizers/examples/unstable_wasm/www by @dependabot in #1641
- Fix documentation build by @ArthurZucker in #1642
- style: simplify string formatting for readability by @hamirmahal in #1632
New Contributors
- @152334H made their first contribution in #1610
- @hamirmahal made their first contribution in #1632
Full Changelog: v0.20.0...v0.20.1
Release v0.20.0: faster encode, better python support
Release v0.20.0
This release is focused on performances and user experience.
Performances:
First off, we did a bit of benchmarking, and found some place for improvement for us!
With a few minor changes (mostly #1587) here is what we get on Llama3
running on a g6 instances on AWS https://github.com/huggingface/tokenizers/blob/main/bindings/python/benches/test_tiktoken.py
:
Python API
We shipped better deserialization errors in general, and support for __str__
and __repr__
for all the object. This allows for a lot easier debugging see this:
>>> from tokenizers import Tokenizer;
>>> tokenizer = Tokenizer.from_pretrained("bert-base-uncased");
>>> print(tokenizer)
Tokenizer(version="1.0", truncation=None, padding=None, added_tokens=[{"id":0, "content":"[PAD]", "single_word":False, "lstrip":False, "rstrip":False, ...}, {"id":100, "content":"[UNK]", "single_word":False, "lstrip":False, "rstrip":False, ...}, {"id":101, "content":"[CLS]", "single_word":False, "lstrip":False, "rstrip":False, ...}, {"id":102, "content":"[SEP]", "single_word":False, "lstrip":False, "rstrip":False, ...}, {"id":103, "content":"[MASK]", "single_word":False, "lstrip":False, "rstrip":False, ...}], normalizer=BertNormalizer(clean_text=True, handle_chinese_chars=True, strip_accents=None, lowercase=True), pre_tokenizer=BertPreTokenizer(), post_processor=TemplateProcessing(single=[SpecialToken(id="[CLS]", type_id=0), Sequence(id=A, type_id=0), SpecialToken(id="[SEP]", type_id=0)], pair=[SpecialToken(id="[CLS]", type_id=0), Sequence(id=A, type_id=0), SpecialToken(id="[SEP]", type_id=0), Sequence(id=B, type_id=1), SpecialToken(id="[SEP]", type_id=1)], special_tokens={"[CLS]":SpecialToken(id="[CLS]", ids=[101], tokens=["[CLS]"]), "[SEP]":SpecialToken(id="[SEP]", ids=[102], tokens=["[SEP]"])}), decoder=WordPiece(prefix="##", cleanup=True), model=WordPiece(unk_token="[UNK]", continuing_subword_prefix="##", max_input_chars_per_word=100, vocab={"[PAD]":0, "[unused0]":1, "[unused1]":2, "[unused2]":3, "[unused3]":4, ...}))
>>> tokenizer
Tokenizer(version="1.0", truncation=None, padding=None, added_tokens=[{"id":0, "content":"[PAD]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":100, "content":"[UNK]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":101, "content":"[CLS]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":102, "content":"[SEP]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":103, "content":"[MASK]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}], normalizer=BertNormalizer(clean_text=True, handle_chinese_chars=True, strip_accents=None, lowercase=True), pre_tokenizer=BertPreTokenizer(), post_processor=TemplateProcessing(single=[SpecialToken(id="[CLS]", type_id=0), Sequence(id=A, type_id=0), SpecialToken(id="[SEP]", type_id=0)], pair=[SpecialToken(id="[CLS]", type_id=0), Sequence(id=A, type_id=0), SpecialToken(id="[SEP]", type_id=0), Sequence(id=B, type_id=1), SpecialToken(id="[SEP]", type_id=1)], special_tokens={"[CLS]":SpecialToken(id="[CLS]", ids=[101], tokens=["[CLS]"]), "[SEP]":SpecialToken(id="[SEP]", ids=[102], tokens=["[SEP]"])}), decoder=WordPiece(prefix="##", cleanup=True), model=WordPiece(unk_token="[UNK]", continuing_subword_prefix="##", max_input_chars_per_word=100, vocab={"[PAD]":0, "[unused0]":1, "[unused1]":2, ...}))
The pre_tokenizer.Sequence
and normalizer.Sequence
are also more accessible now:
from tokenizers import normalizers
norm = normalizers.Sequence([normalizers.Strip(), normalizers.BertNormalizer()])
norm[0]
norm[1].lowercase=False
What's Changed
- remove enforcement of non special when adding tokens by @ArthurZucker in #1521
- [BREAKING CHANGE] Ignore added_tokens (both special and not) in the decoder by @Narsil in #1513
- Make
USED_PARALLELISM
atomic by @nathaniel-daniel in #1532 - Fixing for clippy 1.78 by @Narsil in #1548
- feat(ci): add trufflehog secrets detection by @McPatate in #1551
- Switch from
cached_download
tohf_hub_download
in tests by @Wauplin in #1547 - Fix "dictionnary" typo by @nprisbrey in #1511
- make sure we don't warn on empty tokens by @ArthurZucker in #1554
- Enable
dropout = 0.0
as an equivalent tonone
in BPE by @mcognetta in #1550 - Revert "[BREAKING CHANGE] Ignore added_tokens (both special and not) … by @ArthurZucker in #1569
- Add bytelevel normalizer to fix decode when adding tokens to BPE by @ArthurZucker in #1555
- Fix clippy + feature test management. by @Narsil in #1580
- Bump spm_precompiled to 0.1.3 by @MikeIvanichev in #1571
- Add benchmark vs tiktoken by @Narsil in #1582
- Fixing the benchmark. by @Narsil in #1583
- Tiny improvement by @Narsil in #1585
- Enable fancy regex by @Narsil in #1586
- Fixing release CI strict (taken from safetensors). by @Narsil in #1593
- Adding some serialization testing around the wrapper. by @Narsil in #1594
- Add-legacy-tests by @ArthurZucker in #1597
- Adding a few tests for decoder deserialization. by @Narsil in #1598
- Better serialization error by @Narsil in #1595
- Add test normalizers by @ArthurZucker in #1600
- Improve decoder deserialization by @Narsil in #1599
- Using serde (serde_pyo3) to get str and repr easily. by @Narsil in #1588
- Merges cannot handle tokens containing spaces. by @Narsil in #909
- Fix doc about split by @ArthurZucker in #1591
- Support
None
to reset pre_tokenizers and normalizers, and index sequences by @ArthurZucker in #1590 - Fix strip python type by @ArthurZucker in #1602
- Tests + Deserialization improvement for normalizers. by @Narsil in #1604
- add deserialize for pre tokenizers by @ArthurZucker in #1603
- Perf improvement 16% by removing offsets. by @Narsil in #1587
New Contributors
- @nathaniel-daniel made their first contribution in #1532
- @nprisbrey made their first contribution in #1511
- @mcognetta made their first contribution in #1550
- @MikeIvanichev made their first contribution in #1571
Full Changelog: v0.19.1...v0.20.0rc1
v0.19.1
What's Changed
- add serialization for
ignore_merges
by @ArthurZucker in #1504
Full Changelog: v0.19.0...v0.19.1
v0.19.0
What's Changed
- chore: Remove CLI - this was originally intended for local development by @bryantbiggs in #1442
- [
remove black
] And use ruff by @ArthurZucker in #1436 - Bump ip from 2.0.0 to 2.0.1 in /bindings/node by @dependabot in #1456
- Added ability to inspect a 'Sequence' decoder and the
AddedVocabulary
. by @eaplatanios in #1443 - 🚨🚨 BREAKING CHANGE 🚨🚨: (add_prefix_space dropped everything is using prepend_scheme enum instead) Refactor metaspace by @ArthurZucker in #1476
- Add more support for tiktoken based tokenizers by @ArthurZucker in #1493
- PyO3 0.21. by @Narsil in #1494
- Remove 3.13 (potential undefined behavior.) by @Narsil in #1497
- Bumping all versions 3 times (ty transformers :) ) by @Narsil in #1498
- Fixing doc. by @Narsil in #1499
Full Changelog: v0.15.2...v0.19.0
v0.19.0rc0
Bumping 3 versions because of this: https://github.com/huggingface/transformers/blob/60dea593edd0b94ee15dc3917900b26e3acfbbee/setup.py#L177
What's Changed
- chore: Remove CLI - this was originally intended for local development by @bryantbiggs in #1442
- [
remove black
] And use ruff by @ArthurZucker in #1436 - Bump ip from 2.0.0 to 2.0.1 in /bindings/node by @dependabot in #1456
- Added ability to inspect a 'Sequence' decoder and the
AddedVocabulary
. by @eaplatanios in #1443 - 🚨🚨 BREAKING CHANGE 🚨🚨: (add_prefix_space dropped everything is using prepend_scheme enum instead) Refactor metaspace by @ArthurZucker in #1476
- Add more support for tiktoken based tokenizers by @ArthurZucker in #1493
- PyO3 0.21. by @Narsil in #1494
- Remove 3.13 (potential undefined behavior.) by @Narsil in #1497
- Bumping all versions 3 times (ty transformers :) ) by @Narsil in #1498
Full Changelog: v0.15.2...v0.19.0rc0
v0.15.2
What's Changed
Big shoutout to @rlrs for the fast replace normalizers PR. This boosts the performances of the tokenizers:
- chore: Update dependencies to latest supported versions by @bryantbiggs in #1441
- Convert word counts to u64 by @stephenroller in #1433
- Efficient Replace normalizer by @rlrs in #1413
New Contributors
- @bryantbiggs made their first contribution in #1441
- @stephenroller made their first contribution in #1433
- @rlrs made their first contribution in #1413
Full Changelog: v0.15.1...v0.15.2rc1
v0.15.1
What's Changed
- udpate to version = "0.15.1-dev0" by @ArthurZucker in #1390
- Derive
Clone
onTokenizer
, addEncoding.into_tokens()
method by @epwalsh in #1381 - Stale bot. by @Narsil in #1404
- Fix doc links in readme by @Pierrci in #1367
- Faster HF dataset iteration in docs by @mariosasko in #1414
- Add quick doc to byte_level.rs by @steventrouble in #1420
- Fix make bench. by @Narsil in #1428
- Bump follow-redirects from 1.15.1 to 1.15.4 in /tokenizers/examples/unstable_wasm/www by @dependabot in #1430
- pyo3: update to 0.20 by @mikelui in #1386
- Encode special tokens by @ArthurZucker in #1437
- Update release for python3.12 windows by @ArthurZucker in #1438
New Contributors
- @steventrouble made their first contribution in #1420
Full Changelog: v0.15.0...v0.15.1
v0.15.1.rc0
What's Changed
- pyo3: update to 0.19 by @mikelui in #1322
- Add
expect()
for disabling truncation by @boyleconnor in #1316 - Re-using scritpts from safetensors. by @Narsil in #1328
- Reduce number of different revisions by 1 by @Narsil in #1329
- Python 38 arm by @Narsil in #1330
- Move to maturing mimicking move for
safetensors
. + Rewritten node bindings. by @Narsil in #1331 - Updating the docs with the new command. by @Narsil in #1333
- Update added tokens by @ArthurZucker in #1335
- update package version for dev by @ArthurZucker in #1339
- Added ability to inspect a 'Sequence' pre-tokenizer. by @eaplatanios in #1341
- Let's allow hf_hub < 1.0 by @ArthurZucker in #1344
- Fixing the progressbar. by @Narsil in #1353
- Preparing release. by @Narsil in #1355
- fix a clerical error in the comment by @tiandiweizun in #1356
- fix: remove useless token by @rtrompier in #1371
- Bump @babel/traverse from 7.22.11 to 7.23.2 in /bindings/node by @dependabot in #1370
- Allow hf_hub 0.18 by @mariosasko in #1383
- Allow
huggingface_hub<1.0
by @Wauplin in #1385 - [
pre_tokenizers
] Fix sentencepiece based Metaspace by @ArthurZucker in #1357 - udpate to version = "0.15.1-dev0" by @ArthurZucker in #1390
- Derive
Clone
onTokenizer
, addEncoding.into_tokens()
method by @epwalsh in #1381 - Stale bot. by @Narsil in #1404
- Fix doc links in readme by @Pierrci in #1367
- Faster HF dataset iteration in docs by @mariosasko in #1414
- Add quick doc to byte_level.rs by @steventrouble in #1420
- Fix make bench. by @Narsil in #1428
- Bump follow-redirects from 1.15.1 to 1.15.4 in /tokenizers/examples/unstable_wasm/www by @dependabot in #1430
- pyo3: update to 0.20 by @mikelui in #1386
New Contributors
- @mikelui made their first contribution in #1322
- @eaplatanios made their first contribution in #1341
- @tiandiweizun made their first contribution in #1356
- @rtrompier made their first contribution in #1371
- @mariosasko made their first contribution in #1383
- @Wauplin made their first contribution in #1385
- @steventrouble made their first contribution in #1420
Full Changelog: v0.13.4.rc2...v0.15.1.rc0
v0.15.0
What's Changed
- fix a clerical error in the comment by @tiandiweizun in #1356
- fix: remove useless token by @rtrompier in #1371
- Bump @babel/traverse from 7.22.11 to 7.23.2 in /bindings/node by @dependabot in #1370
- Allow hf_hub 0.18 by @mariosasko in #1383
- Allow
huggingface_hub<1.0
by @Wauplin in #1385 - [
pre_tokenizers
] Fix sentencepiece based Metaspace by @ArthurZucker in #1357
New Contributors
- @tiandiweizun made their first contribution in #1356
- @rtrompier made their first contribution in #1371
- @mariosasko made their first contribution in #1383
- @Wauplin made their first contribution in #1385
Full Changelog: v0.14.1...v0.15.0
v0.14.1
What's Changed
- Fix conda release by @ArthurZucker in #1211
- Fix node release by @ArthurZucker in #1212
- Printing warning to stderr. by @Narsil in #1222
- Fixing padding_left sequence_ids. by @Narsil in #1233
- Use LTO for release and benchmark builds by @csko in #1157
- fix unigram.rs test_sample() by @chris-ha458 in #1244
- implement a simple max_sentencepiece_length into BPE by @chris-ha458 in #1228
- Makes
decode
anddecode_batch
work on borrowed content. by @mfuntowicz in #1251 - Update all GH Actions with dependency on actions/checkout by @mfuntowicz in #1256
- Parallelize unigram trainer by @mishig25 in #976
- Update unigram/trainer.rs by @chris-ha458 in #1257
- Fixing broken link. by @Narsil in #1268
- fix documentation regarding regex by @chris-ha458 in #1264
- Update Cargo.toml by @chris-ha458 in #1266
- Update README.md - Broken link by @sbhavani in #1272
- [doc build] Use secrets by @mishig25 in #1273
- Improve error for truncation with too high stride by @boyleconnor in #1275
- Add unigram bytefallback by @ArthurZucker in #1217
- revise type specification by @hiroshi-matsuda-rit in #1289
- Bump tough-cookie from 4.0.0 to 4.1.3 in /bindings/node by @dependabot in #1291
- Update path name: master -> main by @bact in #1292
- import Tuple from typing by @kellymarchisio in #1295
- Fixing clippy warnings on 1.71. by @Narsil in #1296
- Bump word-wrap from 1.2.3 to 1.2.4 in /bindings/node by @dependabot in #1299
- feat: Added CITATION.cff. by @SamuelLarkin in #1302
- Single warning for holes. by @Narsil in #1303
- Give error when initializing tokenizer with too high stride by @boyleconnor in #1306
- Handle when precompiled charsmap is empty by @kellymarchisio in #1308
- Derive clone for TrainerWrapper by @jonatanklosko in #1317
- CD backports by @chris-ha458 in #1318
- 0.13.4.rc1 by @Narsil in #1319
- Release all at once for simplicity. by @Narsil in #1320
- Fix stride condition. by @Narsil in #1321
- pyo3: update to 0.19 by @mikelui in #1322
- Add
expect()
for disabling truncation by @boyleconnor in #1316 - Re-using scritpts from safetensors. by @Narsil in #1328
- Reduce number of different revisions by 1 by @Narsil in #1329
- Python 38 arm by @Narsil in #1330
- Move to maturing mimicking move for
safetensors
. + Rewritten node bindings. by @Narsil in #1331 - Updating the docs with the new command. by @Narsil in #1333
- Update added tokens by @ArthurZucker in #1335
- update package version for dev by @ArthurZucker in #1339
- Added ability to inspect a 'Sequence' pre-tokenizer. by @eaplatanios in #1341
- Let's allow hf_hub < 1.0 by @ArthurZucker in #1344
- Fixing the progressbar. by @Narsil in #1353
- Preparing release. by @Narsil in #1355
New Contributors
- @csko made their first contribution in #1157
- @chris-ha458 made their first contribution in #1244
- @sbhavani made their first contribution in #1272
- @boyleconnor made their first contribution in #1275
- @hiroshi-matsuda-rit made their first contribution in #1289
- @bact made their first contribution in #1292
- @kellymarchisio made their first contribution in #1295
- @SamuelLarkin made their first contribution in #1302
- @jonatanklosko made their first contribution in #1317
- @mikelui made their first contribution in #1322
- @eaplatanios made their first contribution in #1341
Full Changelog: v0.13.3...v0.14.1