updated readme

adbar · Oct 19, 2021 · 9e99770 · 9e99770
1 parent f24bd9a
commit 9e99770
Showing 1 changed file with 66 additions and 53 deletions.
diff --git a/README.rst b/README.rst
@@ -15,26 +15,34 @@ Simplemma: a simple multilingual lemmatizer for Python
     :target: https://pypi.python.org/pypi/simplemma
     :alt: Python versions
 
-.. image:: https://img.shields.io/travis/adbar/simplemma.svg
-    :target: https://travis-ci.org/adbar/simplemma
-    :alt: Travis build status
+.. image:: https://img.shields.io/codecov/c/github/adbar/simplemma.svg
+    :target: https://codecov.io/gh/adbar/simplemma
+    :alt: Code Coverage
 
 
+Purpose
+-------
+
 `Lemmatization <https://en.wikipedia.org/wiki/Lemmatisation>`_ is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form. Unlike stemming, lemmatization outputs word units that are still valid linguistic forms.
 
 In modern natural language processing (NLP), this task is often indirectly tackled by more complex systems encompassing a whole processing pipeline. However, it appears that there is no straightforward way to address lemmatization in Python although this task is useful in information retrieval and natural language processing.
 
-*Simplemma* provides a simple and multilingual approach to look for base forms or lemmata. It may not be as powerful as full-fledged solutions but it is generic, easy to install and straightforward to use. In particular, it doesn't need morphosyntactic information and can process a raw series of tokens or even a text with its built-in (simple) tokenizer. By design it should be reasonably fast and work in a large majority of cases, without being perfect. Currently, 35 languages are partly or fully supported (see table below).
+*Simplemma* provides a simple and multilingual approach to look for base forms or lemmata. It may not be as powerful as full-fledged solutions but it is generic, easy to install and straightforward to use. In particular, it doesn't need morphosyntactic information and can process a raw series of tokens or even a text with its built-in (simple) tokenizer. By design it should be reasonably fast and work in a large majority of cases, without being perfect.
 
 With its comparatively small footprint it is especially useful when speed and simplicity matter, for educational purposes or as a baseline system for lemmatization and morphological analysis.
 
+Currently, 38 languages are partly or fully supported (see table below).
+
 
 Installation
 ------------
 
 The current library is written in pure Python with no dependencies:
 
-``pip install simplemma`` (or ``pip3`` where applicable)
+``pip install simplemma``
+
+- ``pip3`` where applicable
+- ``pip install -U simplemma`` for updates
 
 
 Usage
@@ -142,9 +150,9 @@ Caveats
 
 As the focus lies on overall coverage, some short frequent words (typically: pronouns) can need post-processing, this generally concerns 10-20 tokens per language.
 
-Additionally, the current absence of morphosyntactic information is both an advantage in terms of simplicity and an impassable frontier with respect to lemmatization accuracy, e.g. to disambiguate between past participles and adjectives derived from verbs in Germanic and Romance languages. For most languages simplemma often doesn't change the input in such cases. For some others it acts more like a linguistically motivated stemmer.
+Additionally, the current absence of morphosyntactic information is both an advantage in terms of simplicity and an impassable frontier with respect to lemmatization accuracy, e.g. to disambiguate between past participles and adjectives derived from verbs in Germanic and Romance languages. In most cases, ``simplemma`` often doesn't change the input then.
 
-The greedy algorithm rarely produces forms that are not valid. It is designed to work best in the low-frequency range, notably for compound words and neologisms. Aggressive decomposition is only useful as a general approach in the case of morphologically-rich languages.
+The greedy algorithm rarely produces forms that are not valid. It is designed to work best in the low-frequency range, notably for compound words and neologisms. Aggressive decomposition is only useful as a general approach in the case of morphologically-rich languages. It can also act as a linguistically motivated stemmer.
 
 Bug reports over the `issues page <https://github.com/adbar/simplemma/issues>`_ are welcome.
 
@@ -155,49 +163,53 @@ Supported languages
 The following languages are available using their `ISO 639-1 code <https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes>`_:
 
 
-====== ============= ========== ===== =========================================================================
-Available languages (2021-03-30)
----------------------------------------------------------------------------------------------------------------
-Code   Language      Word pairs Acc.  Comments
-====== ============= ========== ===== =========================================================================
-``bg`` Bulgarian     73,847           low coverage
-``ca`` Catalan       579,507
-``cs`` Czech         34,674           low coverage
-``cy`` Welsh         360,412
-``da`` Danish        554,238          alternative: `lemmy <https://github.com/sorenlind/lemmy>`_
-``de`` German        610,718    0.95  on UD DE-GSD, see also `German-NLP list <https://github.com/adbar/German-NLP#Lemmatization>`_
-``en`` English       136,162    0.94  on UD EN-GUM, alternative: `LemmInflect <https://github.com/bjascob/LemmInflect>`_
-``es`` Spanish       720,623    0.94  on UD ES-GSD
-``et`` Estonian      133,104          low coverage
-``fa`` Persian       10,967           low coverage
-``fi`` Finnish       2,094,238        alternatives: `voikko <https://voikko.puimula.org/python.html>`_ or `NLP list <https://blogs.helsinki.fi/language-technology/hi-nlp/morphology/>`_
-``fr`` French        217,213    0.94  on UD FR-GSD
-``ga`` Irish         383,448
-``gd`` Gaelic        48,661
-``gl`` Galician      384,183
-``gv`` Manx          62,765
-``hu`` Hungarian     447,762
-``id`` Indonesian    17,419     0.91  on UD ID-CSUI
-``it`` Italian       333,680    0.92  on UD IT-ISDT
-``ka`` Georgian      65,936
-``la`` Latin         96,045           low coverage
-``lb`` Luxembourgish 305,367
-``lt`` Lithuanian    247,337
-``lv`` Latvian       57,153
-``nl`` Dutch         226,210    0.91  on UD-NL-Alpino
-``pt`` Portuguese    933,730    0.92  on UD-PT-GSD
-``ro`` Romanian      311,411
-``ru`` Russian       823,323          alternative: `pymorphy2 <https://github.com/kmike/pymorphy2/>`_
-``sk`` Slovak        846,453    0.87  on UD SK-SNK
-``sl`` Slovene       97,050           low coverage
-``sv`` Swedish       659,249          alternative: `lemmy <https://github.com/sorenlind/lemmy>`_
-``tr`` Turkish       1,333,137  0.88  on UD-TR-Boun
-``uk`` Ukranian      190,472          alternative: `pymorphy2 <https://github.com/kmike/pymorphy2/>`_
-``ur`` Urdu          28,848           issues with the data
-====== ============= ========== ===== =========================================================================
-
-
-*Low coverage* mentions means you'd probably be better off with a language-specific library, but *simplemma* will work to a limited extent. Open-source alternatives for Python are referenced if available.
+====== ================== ========== ===== =========================================================================
+Available languages (2021-10-19)
+--------------------------------------------------------------------------------------------------------------------
+Code   Language           Word pairs Acc.  Comments
+====== ================== ========== ===== =========================================================================
+``bg`` Bulgarian          73,847           low coverage
+``ca`` Catalan            579,507
+``cs`` Czech              34,674           low coverage
+``cy`` Welsh              360,412
+``da`` Danish             554,238          alternative: `lemmy <https://github.com/sorenlind/lemmy>`_
+``de`` German             683,207    0.95  on UD DE-GSD, see also `German-NLP list <https://github.com/adbar/German-NLP#Lemmatization>`_
+``el`` Greek              76,388           low coverage
+``en`` English            136,162    0.94  on UD EN-GUM, alternative: `LemmInflect <https://github.com/bjascob/LemmInflect>`_
+``es`` Spanish            720,623    0.94  on UD ES-GSD
+``et`` Estonian           133,104          low coverage
+``fa`` Persian            10,967           low coverage
+``fi`` Finnish            2,106,359        alternatives: `voikko <https://voikko.puimula.org/python.html>`_ or `NLP list <https://blogs.helsinki.fi/language-technology/hi-nlp/morphology/>`_
+``fr`` French             217,213    0.94  on UD FR-GSD
+``ga`` Irish              383,448
+``gd`` Gaelic             48,661
+``gl`` Galician           384,183
+``gv`` Manx               62,765
+``hu`` Hungarian          458,847
+``hy`` Armenian           323,820
+``id`` Indonesian         17,419     0.91  on UD ID-CSUI
+``it`` Italian            333,680    0.92  on UD IT-ISDT
+``ka`` Georgian           65,936
+``la`` Latin              850,283
+``lb`` Luxembourgish      305,367
+``lt`` Lithuanian         247,337
+``lv`` Latvian            57,153
+``mk`` Macedonian         57,063
+``nb`` Norwegian (Bokmål) 617,940
+``nl`` Dutch              254,073    0.91  on UD-NL-Alpino
+``pl`` Polish             3,723,580
+``pt`` Portuguese         933,730    0.92  on UD-PT-GSD
+``ro`` Romanian           311,411
+``ru`` Russian            607,416          alternative: `pymorphy2 <https://github.com/kmike/pymorphy2/>`_
+``sk`` Slovak             846,453    0.87  on UD SK-SNK
+``sl`` Slovene            97,050           low coverage
+``sv`` Swedish            658,606          alternative: `lemmy <https://github.com/sorenlind/lemmy>`_
+``tr`` Turkish            1,333,137  0.88  on UD-TR-Boun
+``uk`` Ukranian           190,472          alternative: `pymorphy2 <https://github.com/kmike/pymorphy2/>`_
+====== ================== ========== ===== =========================================================================
+
+
+*Low coverage* mentions means you'd probably be better off with a language-specific library, but *simplemma* will work to a limited extent. Open-source alternatives for Python are referenced if possible.
 
 The scores are calculated on `Universal Dependencies <https://universaldependencies.org/>`_ treebanks on single word tokens (including some contractions but not merged prepositions), they describe to what extent simplemma can accurately map tokens to their lemma form. They can be reproduced using the script ``udscore.py`` in the ``tests/`` folder.
 
@@ -216,13 +228,14 @@ Credits
 
 Software under MIT license, for the linguistic information databases see ``licenses`` folder.
 
-The surface lookups (non-greedy mode) mostly act like a wrapper for lemmatization lists:
+The surface lookups (non-greedy mode) use lemmatization lists taken from various sources:
 
 - `Lemmatization lists <https://github.com/michmech/lemmatization-lists>`_ by Michal Měchura (Open Database License)
-- `Wikinflection corpus <https://github.com/lenakmeth/Wikinflection-Corpus>`_ by Eleni Metheniti (CC BY 4.0 License)
-- `Unimorph Project <http://unimorph.ethz.ch/languages>`_
 - `FreeLing project <https://github.com/TALP-UPC/FreeLing>`_
 - `spaCy lookups data <https://github.com/explosion/spacy-lookups-data/tree/master/spacy_lookups_data/data>`_
+- Wiktionary entries parsed by the `Kaikki project <https://kaikki.org/>`_
+- `Wikinflection corpus <https://github.com/lenakmeth/Wikinflection-Corpus>`_ by Eleni Metheniti (CC BY 4.0 License)
+- `Unimorph Project <http://unimorph.ethz.ch/languages>`_
 
 This rule-based approach based on flexion and lemmatizations dictionaries is to this day an approach used in popular libraries such as `spacy <https://spacy.io/usage/adding-languages#lemmatizer>`_.