10 Oct 09:56

ArthurZucker

d98298a

Release v0.20.1 Latest

Latest

What's Changed

The most awaited offset issue with Llama is fixed 🥳

Update README.md by @ArthurZucker in #1608
fix benchmark file link by @152334H in #1610
Bump actions/download-artifact from 3 to 4.1.7 in /.github/workflows by @dependabot in #1626
[ignore_merges] Fix offsets by @ArthurZucker in #1640
Bump body-parser and express in /tokenizers/examples/unstable_wasm/www by @dependabot in #1629
Bump serve-static and express in /tokenizers/examples/unstable_wasm/www by @dependabot in #1630
Bump send and express in /tokenizers/examples/unstable_wasm/www by @dependabot in #1631
Bump webpack from 5.76.0 to 5.95.0 in /tokenizers/examples/unstable_wasm/www by @dependabot in #1641
Fix documentation build by @ArthurZucker in #1642
style: simplify string formatting for readability by @hamirmahal in #1632

New Contributors

@152334H made their first contribution in #1610
@hamirmahal made their first contribution in #1632

Full Changelog: v0.20.0...v0.20.1

Contributors

dependabot, hamirmahal, and 2 other contributors

Assets 2

08 Aug 16:56

ArthurZucker

v0.20.0

a5adaac

Release v0.20.0: faster encode, better python support

Release v0.20.0

This release is focused on performances and user experience.

Performances:

First off, we did a bit of benchmarking, and found some place for improvement for us!
With a few minor changes (mostly #1587) here is what we get on Llama3 running on a g6 instances on AWS https://github.com/huggingface/tokenizers/blob/main/bindings/python/benches/test_tiktoken.py :

Python API

We shipped better deserialization errors in general, and support for __str__ and __repr__ for all the object. This allows for a lot easier debugging see this:

>>> from tokenizers import Tokenizer;
>>> tokenizer = Tokenizer.from_pretrained("bert-base-uncased");
>>> print(tokenizer)
Tokenizer(version="1.0", truncation=None, padding=None, added_tokens=[{"id":0, "content":"[PAD]", "single_word":False, "lstrip":False, "rstrip":False, ...}, {"id":100, "content":"[UNK]", "single_word":False, "lstrip":False, "rstrip":False, ...}, {"id":101, "content":"[CLS]", "single_word":False, "lstrip":False, "rstrip":False, ...}, {"id":102, "content":"[SEP]", "single_word":False, "lstrip":False, "rstrip":False, ...}, {"id":103, "content":"[MASK]", "single_word":False, "lstrip":False, "rstrip":False, ...}], normalizer=BertNormalizer(clean_text=True, handle_chinese_chars=True, strip_accents=None, lowercase=True), pre_tokenizer=BertPreTokenizer(), post_processor=TemplateProcessing(single=[SpecialToken(id="[CLS]", type_id=0), Sequence(id=A, type_id=0), SpecialToken(id="[SEP]", type_id=0)], pair=[SpecialToken(id="[CLS]", type_id=0), Sequence(id=A, type_id=0), SpecialToken(id="[SEP]", type_id=0), Sequence(id=B, type_id=1), SpecialToken(id="[SEP]", type_id=1)], special_tokens={"[CLS]":SpecialToken(id="[CLS]", ids=[101], tokens=["[CLS]"]), "[SEP]":SpecialToken(id="[SEP]", ids=[102], tokens=["[SEP]"])}), decoder=WordPiece(prefix="##", cleanup=True), model=WordPiece(unk_token="[UNK]", continuing_subword_prefix="##", max_input_chars_per_word=100, vocab={"[PAD]":0, "[unused0]":1, "[unused1]":2, "[unused2]":3, "[unused3]":4, ...}))

>>> tokenizer
Tokenizer(version="1.0", truncation=None, padding=None, added_tokens=[{"id":0, "content":"[PAD]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":100, "content":"[UNK]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":101, "content":"[CLS]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":102, "content":"[SEP]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":103, "content":"[MASK]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}], normalizer=BertNormalizer(clean_text=True, handle_chinese_chars=True, strip_accents=None, lowercase=True), pre_tokenizer=BertPreTokenizer(), post_processor=TemplateProcessing(single=[SpecialToken(id="[CLS]", type_id=0), Sequence(id=A, type_id=0), SpecialToken(id="[SEP]", type_id=0)], pair=[SpecialToken(id="[CLS]", type_id=0), Sequence(id=A, type_id=0), SpecialToken(id="[SEP]", type_id=0), Sequence(id=B, type_id=1), SpecialToken(id="[SEP]", type_id=1)], special_tokens={"[CLS]":SpecialToken(id="[CLS]", ids=[101], tokens=["[CLS]"]), "[SEP]":SpecialToken(id="[SEP]", ids=[102], tokens=["[SEP]"])}), decoder=WordPiece(prefix="##", cleanup=True), model=WordPiece(unk_token="[UNK]", continuing_subword_prefix="##", max_input_chars_per_word=100, vocab={"[PAD]":0, "[unused0]":1, "[unused1]":2, ...}))

The pre_tokenizer.Sequence and normalizer.Sequence are also more accessible now:

from tokenizers import normalizers
norm = normalizers.Sequence([normalizers.Strip(), normalizers.BertNormalizer()])
norm[0]
norm[1].lowercase=False

What's Changed

remove enforcement of non special when adding tokens by @ArthurZucker in #1521
[BREAKING CHANGE] Ignore added_tokens (both special and not) in the decoder by @Narsil in #1513
Make USED_PARALLELISM atomic by @nathaniel-daniel in #1532
Fixing for clippy 1.78 by @Narsil in #1548
feat(ci): add trufflehog secrets detection by @McPatate in #1551
Switch from cached_download to hf_hub_download in tests by @Wauplin in #1547
Fix "dictionnary" typo by @nprisbrey in #1511
make sure we don't warn on empty tokens by @ArthurZucker in #1554
Enable dropout = 0.0 as an equivalent to none in BPE by @mcognetta in #1550
Revert "[BREAKING CHANGE] Ignore added_tokens (both special and not) … by @ArthurZucker in #1569
Add bytelevel normalizer to fix decode when adding tokens to BPE by @ArthurZucker in #1555
Fix clippy + feature test management. by @Narsil in #1580
Bump spm_precompiled to 0.1.3 by @MikeIvanichev in #1571
Add benchmark vs tiktoken by @Narsil in #1582
Fixing the benchmark. by @Narsil in #1583
Tiny improvement by @Narsil in #1585
Enable fancy regex by @Narsil in #1586
Fixing release CI strict (taken from safetensors). by @Narsil in #1593
Adding some serialization testing around the wrapper. by @Narsil in #1594
Add-legacy-tests by @ArthurZucker in #1597
Adding a few tests for decoder deserialization. by @Narsil in #1598
Better serialization error by @Narsil in #1595
Add test normalizers by @ArthurZucker in #1600
Improve decoder deserialization by @Narsil in #1599
Using serde (serde_pyo3) to get str and repr easily. by @Narsil in #1588
Merges cannot handle tokens containing spaces. by @Narsil in #909
Fix doc about split by @ArthurZucker in #1591
Support None to reset pre_tokenizers and normalizers, and index sequences by @ArthurZucker in #1590
Fix strip python type by @ArthurZucker in #1602
Tests + Deserialization improvement for normalizers. by @Narsil in #1604
add deserialize for pre tokenizers by @ArthurZucker in #1603
Perf improvement 16% by removing offsets. by @Narsil in #1587

New Contributors

@nathaniel-daniel made their first contribution in #1532
@nprisbrey made their first contribution in #1511
@mcognetta made their first contribution in #1550
@MikeIvanichev made their first contribution in #1571

Full Changelog: v0.19.1...v0.20.0rc1

Contributors

Narsil, mcognetta, and 6 other contributors

Assets 2

17 Apr 21:37

ArthurZucker

v0.19.1

3b3c960

v0.19.1

What's Changed

add serialization for ignore_merges by @ArthurZucker in #1504

Full Changelog: v0.19.0...v0.19.1

Contributors

ArthurZucker

Assets 2

17 Apr 08:51

Narsil

v0.19.0

e59020d

v0.19.0

What's Changed

chore: Remove CLI - this was originally intended for local development by @bryantbiggs in #1442
[remove black] And use ruff by @ArthurZucker in #1436
Bump ip from 2.0.0 to 2.0.1 in /bindings/node by @dependabot in #1456
Added ability to inspect a 'Sequence' decoder and the AddedVocabulary. by @eaplatanios in #1443
🚨🚨 BREAKING CHANGE 🚨🚨: (add_prefix_space dropped everything is using prepend_scheme enum instead) Refactor metaspace by @ArthurZucker in #1476
Add more support for tiktoken based tokenizers by @ArthurZucker in #1493
PyO3 0.21. by @Narsil in #1494
Remove 3.13 (potential undefined behavior.) by @Narsil in #1497
Bumping all versions 3 times (ty transformers :) ) by @Narsil in #1498
Fixing doc. by @Narsil in #1499

Full Changelog: v0.15.2...v0.19.0

Contributors

Narsil, eaplatanios, and 3 other contributors

Assets 2

16 Apr 14:06

Narsil

v0.19.0rc0

36846e8

v0.19.0rc0 Pre-release

Pre-release

Bumping 3 versions because of this: https://github.com/huggingface/transformers/blob/60dea593edd0b94ee15dc3917900b26e3acfbbee/setup.py#L177

What's Changed

chore: Remove CLI - this was originally intended for local development by @bryantbiggs in #1442
[remove black] And use ruff by @ArthurZucker in #1436
Bump ip from 2.0.0 to 2.0.1 in /bindings/node by @dependabot in #1456
Added ability to inspect a 'Sequence' decoder and the AddedVocabulary. by @eaplatanios in #1443
🚨🚨 BREAKING CHANGE 🚨🚨: (add_prefix_space dropped everything is using prepend_scheme enum instead) Refactor metaspace by @ArthurZucker in #1476
Add more support for tiktoken based tokenizers by @ArthurZucker in #1493
PyO3 0.21. by @Narsil in #1494
Remove 3.13 (potential undefined behavior.) by @Narsil in #1497
Bumping all versions 3 times (ty transformers :) ) by @Narsil in #1498

Full Changelog: v0.15.2...v0.19.0rc0

Contributors

Narsil, eaplatanios, and 3 other contributors

Assets 2

12 Feb 02:35

ArthurZucker

v0.15.2

701a73b

v0.15.2

What's Changed

Big shoutout to @rlrs for the fast replace normalizers PR. This boosts the performances of the tokenizers:

chore: Update dependencies to latest supported versions by @bryantbiggs in #1441
Convert word counts to u64 by @stephenroller in #1433
Efficient Replace normalizer by @rlrs in #1413

New Contributors

@bryantbiggs made their first contribution in #1441
@stephenroller made their first contribution in #1433
@rlrs made their first contribution in #1413

Full Changelog: v0.15.1...v0.15.2rc1

Contributors

stephenroller, rlrs, and bryantbiggs

Assets 2

22 Jan 16:49

ArthurZucker

v0.15.1

d38be16

v0.15.1

What's Changed

udpate to version = "0.15.1-dev0" by @ArthurZucker in #1390
Derive Clone on Tokenizer, add Encoding.into_tokens() method by @epwalsh in #1381
Stale bot. by @Narsil in #1404
Fix doc links in readme by @Pierrci in #1367
Faster HF dataset iteration in docs by @mariosasko in #1414
Add quick doc to byte_level.rs by @steventrouble in #1420
Fix make bench. by @Narsil in #1428
Bump follow-redirects from 1.15.1 to 1.15.4 in /tokenizers/examples/unstable_wasm/www by @dependabot in #1430
pyo3: update to 0.20 by @mikelui in #1386
Encode special tokens by @ArthurZucker in #1437
Update release for python3.12 windows by @ArthurZucker in #1438

New Contributors

@steventrouble made their first contribution in #1420

Full Changelog: v0.15.0...v0.15.1

Contributors

Narsil, steventrouble, and 6 other contributors

Assets 2

18 Jan 16:34

Narsil

v0.15.1.rc0

888dd4b

v0.15.1.rc0 Pre-release

Pre-release

What's Changed

pyo3: update to 0.19 by @mikelui in #1322
Add expect() for disabling truncation by @boyleconnor in #1316
Re-using scritpts from safetensors. by @Narsil in #1328
Reduce number of different revisions by 1 by @Narsil in #1329
Python 38 arm by @Narsil in #1330
Move to maturing mimicking move for safetensors. + Rewritten node bindings. by @Narsil in #1331
Updating the docs with the new command. by @Narsil in #1333
Update added tokens by @ArthurZucker in #1335
update package version for dev by @ArthurZucker in #1339
Added ability to inspect a 'Sequence' pre-tokenizer. by @eaplatanios in #1341
Let's allow hf_hub < 1.0 by @ArthurZucker in #1344
Fixing the progressbar. by @Narsil in #1353
Preparing release. by @Narsil in #1355
fix a clerical error in the comment by @tiandiweizun in #1356
fix: remove useless token by @rtrompier in #1371
Bump @babel/traverse from 7.22.11 to 7.23.2 in /bindings/node by @dependabot in #1370
Allow hf_hub 0.18 by @mariosasko in #1383
Allow huggingface_hub<1.0 by @Wauplin in #1385
[pre_tokenizers] Fix sentencepiece based Metaspace by @ArthurZucker in #1357
udpate to version = "0.15.1-dev0" by @ArthurZucker in #1390
Derive Clone on Tokenizer, add Encoding.into_tokens() method by @epwalsh in #1381
Stale bot. by @Narsil in #1404
Fix doc links in readme by @Pierrci in #1367
Faster HF dataset iteration in docs by @mariosasko in #1414
Add quick doc to byte_level.rs by @steventrouble in #1420
Fix make bench. by @Narsil in #1428
Bump follow-redirects from 1.15.1 to 1.15.4 in /tokenizers/examples/unstable_wasm/www by @dependabot in #1430
pyo3: update to 0.20 by @mikelui in #1386

New Contributors

@mikelui made their first contribution in #1322
@eaplatanios made their first contribution in #1341
@tiandiweizun made their first contribution in #1356
@rtrompier made their first contribution in #1371
@mariosasko made their first contribution in #1383
@Wauplin made their first contribution in #1385
@steventrouble made their first contribution in #1420

Full Changelog: v0.13.4.rc2...v0.15.1.rc0

Contributors

Narsil, steventrouble, and 11 other contributors

Assets 2

14 Nov 19:06

ArthurZucker

v0.15.0

41c4864

v0.15.0

What's Changed

fix a clerical error in the comment by @tiandiweizun in #1356
fix: remove useless token by @rtrompier in #1371
Bump @babel/traverse from 7.22.11 to 7.23.2 in /bindings/node by @dependabot in #1370
Allow hf_hub 0.18 by @mariosasko in #1383
Allow huggingface_hub<1.0 by @Wauplin in #1385
[pre_tokenizers] Fix sentencepiece based Metaspace by @ArthurZucker in #1357

New Contributors

@tiandiweizun made their first contribution in #1356
@rtrompier made their first contribution in #1371
@mariosasko made their first contribution in #1383
@Wauplin made their first contribution in #1385

Full Changelog: v0.14.1...v0.15.0

Contributors

rtrompier, Wauplin, and 4 other contributors

Assets 2

06 Oct 11:10

Narsil

v0.14.1

6357206

v0.14.1

What's Changed

Fix conda release by @ArthurZucker in #1211
Fix node release by @ArthurZucker in #1212
Printing warning to stderr. by @Narsil in #1222
Fixing padding_left sequence_ids. by @Narsil in #1233
Use LTO for release and benchmark builds by @csko in #1157
fix unigram.rs test_sample() by @chris-ha458 in #1244
implement a simple max_sentencepiece_length into BPE by @chris-ha458 in #1228
Makes decode and decode_batch work on borrowed content. by @mfuntowicz in #1251
Update all GH Actions with dependency on actions/checkout by @mfuntowicz in #1256
Parallelize unigram trainer by @mishig25 in #976
Update unigram/trainer.rs by @chris-ha458 in #1257
Fixing broken link. by @Narsil in #1268
fix documentation regarding regex by @chris-ha458 in #1264
Update Cargo.toml by @chris-ha458 in #1266
Update README.md - Broken link by @sbhavani in #1272
[doc build] Use secrets by @mishig25 in #1273
Improve error for truncation with too high stride by @boyleconnor in #1275
Add unigram bytefallback by @ArthurZucker in #1217
revise type specification by @hiroshi-matsuda-rit in #1289
Bump tough-cookie from 4.0.0 to 4.1.3 in /bindings/node by @dependabot in #1291
Update path name: master -> main by @bact in #1292
import Tuple from typing by @kellymarchisio in #1295
Fixing clippy warnings on 1.71. by @Narsil in #1296
Bump word-wrap from 1.2.3 to 1.2.4 in /bindings/node by @dependabot in #1299
feat: Added CITATION.cff. by @SamuelLarkin in #1302
Single warning for holes. by @Narsil in #1303
Give error when initializing tokenizer with too high stride by @boyleconnor in #1306
Handle when precompiled charsmap is empty by @kellymarchisio in #1308
Derive clone for TrainerWrapper by @jonatanklosko in #1317
CD backports by @chris-ha458 in #1318
0.13.4.rc1 by @Narsil in #1319
Release all at once for simplicity. by @Narsil in #1320
Fix stride condition. by @Narsil in #1321
pyo3: update to 0.19 by @mikelui in #1322
Add expect() for disabling truncation by @boyleconnor in #1316
Re-using scritpts from safetensors. by @Narsil in #1328
Reduce number of different revisions by 1 by @Narsil in #1329
Python 38 arm by @Narsil in #1330
Move to maturing mimicking move for safetensors. + Rewritten node bindings. by @Narsil in #1331
Updating the docs with the new command. by @Narsil in #1333
Update added tokens by @ArthurZucker in #1335
update package version for dev by @ArthurZucker in #1339
Added ability to inspect a 'Sequence' pre-tokenizer. by @eaplatanios in #1341
Let's allow hf_hub < 1.0 by @ArthurZucker in #1344
Fixing the progressbar. by @Narsil in #1353
Preparing release. by @Narsil in #1355

New Contributors

@csko made their first contribution in #1157
@chris-ha458 made their first contribution in #1244
@sbhavani made their first contribution in #1272
@boyleconnor made their first contribution in #1275
@hiroshi-matsuda-rit made their first contribution in #1289
@bact made their first contribution in #1292
@kellymarchisio made their first contribution in #1295
@SamuelLarkin made their first contribution in #1302
@jonatanklosko made their first contribution in #1317
@mikelui made their first contribution in #1322
@eaplatanios made their first contribution in #1341

Full Changelog: v0.13.3...v0.14.1

Contributors

bact, sbhavani, and 14 other contributors

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's Changed

New Contributors

Contributors

Release v0.20.0

Performances:

Python API

What's Changed

New Contributors

Contributors

What's Changed

Contributors

What's Changed

Contributors

What's Changed

Contributors

What's Changed

New Contributors

Contributors

What's Changed

New Contributors

Contributors

What's Changed

New Contributors

Contributors

What's Changed

New Contributors

Contributors

What's Changed

New Contributors

Contributors

Releases: huggingface/tokenizers

Release v0.20.1

What's Changed

New Contributors

Contributors

Release v0.20.0: faster encode, better python support

Release v0.20.0

Performances:

Python API

What's Changed

New Contributors

Contributors

v0.19.1

What's Changed

Contributors

v0.19.0

What's Changed

Contributors

v0.19.0rc0

What's Changed

Contributors

v0.15.2

What's Changed

New Contributors

Contributors

v0.15.1

What's Changed

New Contributors

Contributors

v0.15.1.rc0

What's Changed

New Contributors

Contributors

v0.15.0

What's Changed

New Contributors

Contributors

v0.14.1

What's Changed

New Contributors

Contributors