Skip to content

Commit

Permalink
updated readme
Browse files Browse the repository at this point in the history
  • Loading branch information
adbar committed Oct 19, 2021
1 parent f24bd9a commit 9e99770
Showing 1 changed file with 66 additions and 53 deletions.
119 changes: 66 additions & 53 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,26 +15,34 @@ Simplemma: a simple multilingual lemmatizer for Python
:target: https://pypi.python.org/pypi/simplemma
:alt: Python versions

.. image:: https://img.shields.io/travis/adbar/simplemma.svg
:target: https://travis-ci.org/adbar/simplemma
:alt: Travis build status
.. image:: https://img.shields.io/codecov/c/github/adbar/simplemma.svg
:target: https://codecov.io/gh/adbar/simplemma
:alt: Code Coverage


Purpose
-------

`Lemmatization <https://en.wikipedia.org/wiki/Lemmatisation>`_ is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form. Unlike stemming, lemmatization outputs word units that are still valid linguistic forms.

In modern natural language processing (NLP), this task is often indirectly tackled by more complex systems encompassing a whole processing pipeline. However, it appears that there is no straightforward way to address lemmatization in Python although this task is useful in information retrieval and natural language processing.

*Simplemma* provides a simple and multilingual approach to look for base forms or lemmata. It may not be as powerful as full-fledged solutions but it is generic, easy to install and straightforward to use. In particular, it doesn't need morphosyntactic information and can process a raw series of tokens or even a text with its built-in (simple) tokenizer. By design it should be reasonably fast and work in a large majority of cases, without being perfect. Currently, 35 languages are partly or fully supported (see table below).
*Simplemma* provides a simple and multilingual approach to look for base forms or lemmata. It may not be as powerful as full-fledged solutions but it is generic, easy to install and straightforward to use. In particular, it doesn't need morphosyntactic information and can process a raw series of tokens or even a text with its built-in (simple) tokenizer. By design it should be reasonably fast and work in a large majority of cases, without being perfect.

With its comparatively small footprint it is especially useful when speed and simplicity matter, for educational purposes or as a baseline system for lemmatization and morphological analysis.

Currently, 38 languages are partly or fully supported (see table below).


Installation
------------

The current library is written in pure Python with no dependencies:

``pip install simplemma`` (or ``pip3`` where applicable)
``pip install simplemma``

- ``pip3`` where applicable
- ``pip install -U simplemma`` for updates


Usage
Expand Down Expand Up @@ -142,9 +150,9 @@ Caveats
As the focus lies on overall coverage, some short frequent words (typically: pronouns) can need post-processing, this generally concerns 10-20 tokens per language.

Additionally, the current absence of morphosyntactic information is both an advantage in terms of simplicity and an impassable frontier with respect to lemmatization accuracy, e.g. to disambiguate between past participles and adjectives derived from verbs in Germanic and Romance languages. For most languages simplemma often doesn't change the input in such cases. For some others it acts more like a linguistically motivated stemmer.
Additionally, the current absence of morphosyntactic information is both an advantage in terms of simplicity and an impassable frontier with respect to lemmatization accuracy, e.g. to disambiguate between past participles and adjectives derived from verbs in Germanic and Romance languages. In most cases, ``simplemma`` often doesn't change the input then.

The greedy algorithm rarely produces forms that are not valid. It is designed to work best in the low-frequency range, notably for compound words and neologisms. Aggressive decomposition is only useful as a general approach in the case of morphologically-rich languages.
The greedy algorithm rarely produces forms that are not valid. It is designed to work best in the low-frequency range, notably for compound words and neologisms. Aggressive decomposition is only useful as a general approach in the case of morphologically-rich languages. It can also act as a linguistically motivated stemmer.

Bug reports over the `issues page <https://github.com/adbar/simplemma/issues>`_ are welcome.

Expand All @@ -155,49 +163,53 @@ Supported languages
The following languages are available using their `ISO 639-1 code <https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes>`_:


====== ============= ========== ===== =========================================================================
Available languages (2021-03-30)
---------------------------------------------------------------------------------------------------------------
Code Language Word pairs Acc. Comments
====== ============= ========== ===== =========================================================================
``bg`` Bulgarian 73,847 low coverage
``ca`` Catalan 579,507
``cs`` Czech 34,674 low coverage
``cy`` Welsh 360,412
``da`` Danish 554,238 alternative: `lemmy <https://github.com/sorenlind/lemmy>`_
``de`` German 610,718 0.95 on UD DE-GSD, see also `German-NLP list <https://github.com/adbar/German-NLP#Lemmatization>`_
``en`` English 136,162 0.94 on UD EN-GUM, alternative: `LemmInflect <https://github.com/bjascob/LemmInflect>`_
``es`` Spanish 720,623 0.94 on UD ES-GSD
``et`` Estonian 133,104 low coverage
``fa`` Persian 10,967 low coverage
``fi`` Finnish 2,094,238 alternatives: `voikko <https://voikko.puimula.org/python.html>`_ or `NLP list <https://blogs.helsinki.fi/language-technology/hi-nlp/morphology/>`_
``fr`` French 217,213 0.94 on UD FR-GSD
``ga`` Irish 383,448
``gd`` Gaelic 48,661
``gl`` Galician 384,183
``gv`` Manx 62,765
``hu`` Hungarian 447,762
``id`` Indonesian 17,419 0.91 on UD ID-CSUI
``it`` Italian 333,680 0.92 on UD IT-ISDT
``ka`` Georgian 65,936
``la`` Latin 96,045 low coverage
``lb`` Luxembourgish 305,367
``lt`` Lithuanian 247,337
``lv`` Latvian 57,153
``nl`` Dutch 226,210 0.91 on UD-NL-Alpino
``pt`` Portuguese 933,730 0.92 on UD-PT-GSD
``ro`` Romanian 311,411
``ru`` Russian 823,323 alternative: `pymorphy2 <https://github.com/kmike/pymorphy2/>`_
``sk`` Slovak 846,453 0.87 on UD SK-SNK
``sl`` Slovene 97,050 low coverage
``sv`` Swedish 659,249 alternative: `lemmy <https://github.com/sorenlind/lemmy>`_
``tr`` Turkish 1,333,137 0.88 on UD-TR-Boun
``uk`` Ukranian 190,472 alternative: `pymorphy2 <https://github.com/kmike/pymorphy2/>`_
``ur`` Urdu 28,848 issues with the data
====== ============= ========== ===== =========================================================================


*Low coverage* mentions means you'd probably be better off with a language-specific library, but *simplemma* will work to a limited extent. Open-source alternatives for Python are referenced if available.
====== ================== ========== ===== =========================================================================
Available languages (2021-10-19)
--------------------------------------------------------------------------------------------------------------------
Code Language Word pairs Acc. Comments
====== ================== ========== ===== =========================================================================
``bg`` Bulgarian 73,847 low coverage
``ca`` Catalan 579,507
``cs`` Czech 34,674 low coverage
``cy`` Welsh 360,412
``da`` Danish 554,238 alternative: `lemmy <https://github.com/sorenlind/lemmy>`_
``de`` German 683,207 0.95 on UD DE-GSD, see also `German-NLP list <https://github.com/adbar/German-NLP#Lemmatization>`_
``el`` Greek 76,388 low coverage
``en`` English 136,162 0.94 on UD EN-GUM, alternative: `LemmInflect <https://github.com/bjascob/LemmInflect>`_
``es`` Spanish 720,623 0.94 on UD ES-GSD
``et`` Estonian 133,104 low coverage
``fa`` Persian 10,967 low coverage
``fi`` Finnish 2,106,359 alternatives: `voikko <https://voikko.puimula.org/python.html>`_ or `NLP list <https://blogs.helsinki.fi/language-technology/hi-nlp/morphology/>`_
``fr`` French 217,213 0.94 on UD FR-GSD
``ga`` Irish 383,448
``gd`` Gaelic 48,661
``gl`` Galician 384,183
``gv`` Manx 62,765
``hu`` Hungarian 458,847
``hy`` Armenian 323,820
``id`` Indonesian 17,419 0.91 on UD ID-CSUI
``it`` Italian 333,680 0.92 on UD IT-ISDT
``ka`` Georgian 65,936
``la`` Latin 850,283
``lb`` Luxembourgish 305,367
``lt`` Lithuanian 247,337
``lv`` Latvian 57,153
``mk`` Macedonian 57,063
``nb`` Norwegian (Bokmål) 617,940
``nl`` Dutch 254,073 0.91 on UD-NL-Alpino
``pl`` Polish 3,723,580
``pt`` Portuguese 933,730 0.92 on UD-PT-GSD
``ro`` Romanian 311,411
``ru`` Russian 607,416 alternative: `pymorphy2 <https://github.com/kmike/pymorphy2/>`_
``sk`` Slovak 846,453 0.87 on UD SK-SNK
``sl`` Slovene 97,050 low coverage
``sv`` Swedish 658,606 alternative: `lemmy <https://github.com/sorenlind/lemmy>`_
``tr`` Turkish 1,333,137 0.88 on UD-TR-Boun
``uk`` Ukranian 190,472 alternative: `pymorphy2 <https://github.com/kmike/pymorphy2/>`_
====== ================== ========== ===== =========================================================================


*Low coverage* mentions means you'd probably be better off with a language-specific library, but *simplemma* will work to a limited extent. Open-source alternatives for Python are referenced if possible.

The scores are calculated on `Universal Dependencies <https://universaldependencies.org/>`_ treebanks on single word tokens (including some contractions but not merged prepositions), they describe to what extent simplemma can accurately map tokens to their lemma form. They can be reproduced using the script ``udscore.py`` in the ``tests/`` folder.

Expand All @@ -216,13 +228,14 @@ Credits

Software under MIT license, for the linguistic information databases see ``licenses`` folder.

The surface lookups (non-greedy mode) mostly act like a wrapper for lemmatization lists:
The surface lookups (non-greedy mode) use lemmatization lists taken from various sources:

- `Lemmatization lists <https://github.com/michmech/lemmatization-lists>`_ by Michal Měchura (Open Database License)
- `Wikinflection corpus <https://github.com/lenakmeth/Wikinflection-Corpus>`_ by Eleni Metheniti (CC BY 4.0 License)
- `Unimorph Project <http://unimorph.ethz.ch/languages>`_
- `FreeLing project <https://github.com/TALP-UPC/FreeLing>`_
- `spaCy lookups data <https://github.com/explosion/spacy-lookups-data/tree/master/spacy_lookups_data/data>`_
- Wiktionary entries parsed by the `Kaikki project <https://kaikki.org/>`_
- `Wikinflection corpus <https://github.com/lenakmeth/Wikinflection-Corpus>`_ by Eleni Metheniti (CC BY 4.0 License)
- `Unimorph Project <http://unimorph.ethz.ch/languages>`_

This rule-based approach based on flexion and lemmatizations dictionaries is to this day an approach used in popular libraries such as `spacy <https://spacy.io/usage/adding-languages#lemmatizer>`_.

Expand Down

0 comments on commit 9e99770

Please sign in to comment.