- Fix ModuleNotFoundError and test optional dependencies (#142)
- Simplify code and add missing type annotations (#144)
- Add a memory-efficient dictionary factory backed by MARISA-tries by @Dunedan in #133
- Drop support for Python 3.6 & 3.7 by @Dunedan in #134
- Update setup files (#138)
Extensive refactoring by @juanjoDiaz: - Series of modular classes - Different lemmatization strategies available - Customization of dictionary loading and handling (DictionaryFactory) - LanguageDetector class with extended options - See readme and [detailed documentation](https://adbar.github.io/simplemma/)
Breaking changes: - The extensive argument is now greedy - The langdetect submodule is now language_detector from simplemma.langdetect import ... → from simplemma.language_detector import ...
Fixes and improvements: - is_known() function now restored to its state in v0.9.0 (full dictionary) - More languages and better rules (with @juanjoDiaz) - Use binary strings in dictionaries to save memory - Dictionary sort before compression by @1over137
Documentation: - Classes and general doc pages by @juanjoDiaz - Section on classes in the readme by @osma
- smaller language data footprint with smallest possible impact on performance, using a combination of rules, upper limit on word length, and better data cleaning (#31)
- unsupervised approach to affixes activated by default for some languages
- reviewed rules for English and German (less greedy)
- added rules for Dutch, Finnish, Polish and Russian
- improved Russian and Ukrainian language data (#3)
- improved tokenizer
- smaller data files (especially for fi, la, pl, pt, sk & tr, #19)
- added support for Asturian (
ast
, #20) - bug fixes (#18, #26)
- languages added: Albanian, Hindi, Icelandic, Malay, Middle English, Northern Sámi, Nynorsk, Serbo-Croatian, Swahili, Tagalog
- fix for slow language detection introduced in 0.7.0
- better rules for English and German
- inconsistencies fixed for cy, de, en, ga, sv (#16)
- docs: added language detection and citation info
- code fully type checked, optional pre-compilation with
mypyc
- fixes: logging error (#11), input type (#12)
- code style: black
- breaking change: language data pre-loading now occurs internally, language codes are now directly provided in
lemmatize()
call, e.g.simplemma.lemmatize("test", lang="en")
- faster lemmatization, result cache
- sentence-aware
text_lemmatizer()
- optional iterators for tokenization and lemmatization
- improved language models
- improved tokenizer
- maintenance and code efficiency
- added basic language detection (undocumented)
- faster, more efficient code
- dropped support for Python 3.5
- new languages: Armenian, Greek, Macedonian, Norwegian (Bokmål), and Polish
- language data reviewed for: Dutch, Finnish, German, Hungarian, Latin, Russian, and Swedish
- Urdu removed of language list due to issues with the data
- add support for Python 3.10 and drop support for Python 3.4
- improved decomposition and tokenization algorithms
- improved models and disambiguation
- improved tokenization
- extended rules for German
- Work on decomposition rules
- Reviewed language data
- Cleaner code
- Better decomposition into subwords by greedy algorithm
- First benchmarks and data-based corrections: German, French, English, Spanish
- Languages added: Danish, Dutch, Finnish, Georgian, Indonesian, Latin, Latvian, Lithuanian, Luxembourgish, Turkish, Urdu
- Improved word pair coverage
- Tokenization functions added
- Limit greediness and range of potential candidates
- First release on PyPI