From b881e8d50a342e1d5b86a3e3d8972a6f73d543d4 Mon Sep 17 00:00:00 2001 From: Adrien Barbaresi Date: Wed, 17 Apr 2024 17:08:58 +0200 Subject: [PATCH 1/2] docs: add info on training data --- .coveragerc | 3 ++- training/README.rst | 66 +++++++++++++++++++++++++++++++++++++++++++-- 2 files changed, 66 insertions(+), 3 deletions(-) diff --git a/.coveragerc b/.coveragerc index ef265dd..0d274c9 100644 --- a/.coveragerc +++ b/.coveragerc @@ -2,10 +2,11 @@ source = simplemma omit = + training/* tests/* setup.py [report] exclude_lines = pragma: no cover - if __name__ == .__main__.: \ No newline at end of file + if __name__ == .__main__.: diff --git a/training/README.rst b/training/README.rst index 06e194a..34e0752 100644 --- a/training/README.rst +++ b/training/README.rst @@ -1,5 +1,5 @@ -Instructions to run the evaluation ----------------------------------- +Running the evaluation +---------------------- The scores are calculated on `Universal Dependencies `_ treebanks on single word tokens (including some contractions but not merged prepositions). They can be reproduced by the following steps: @@ -13,3 +13,65 @@ The scores are calculated on `Universal Dependencies `_. + + +Input data +^^^^^^^^^^ + +- The input data has to be in tab-separated columns, first lemma, then word form, e.g. ``pelican TAB pelicans``. +- Redundant and noisy cases are mostly filtered out by input script but it is best if the data are controlled by hand as errors in lists or machine-generated data are common. +- The data should be reviewed and tested on an authoritative source like the universal dependencies (see above). + + +Adding languages +^^^^^^^^^^^^^^^^ + +- The Simplemma approach currently works best on languages written from left-to-right, results will not be as good on other languages (e.g. Urdu). +- The target language has to be prone to lemmatization by allowing for the reduction of at least two word forms to a single dictionary entry (e.g. Korean is not quite adapted in the current state of Simplemma). +- The new language (two- or three-letter ISO code) has to be added to the dictionary data (using the dictionary_pickler script), it should then be available in `SUPPORTED_LANGUAGES`. + + +Example using ``kaikki.org`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Since a source has to comprise enough words without sacrificing quality, the `kaikki.org `_ project is currently a good place to start. It leverages information from the Wiktionary project and is rather extensive. Its main drawbacks are lack of coverage for less-resourced languages and errors during processing of entries as the Wiktionary form tables are not all alike. + + +1. Find the link to all word senses for a given langauge, e.g. "Download JSON data for all word senses in the Lithuanian dictionary" leading to `https://kaikki.org/dictionary/Lithuanian/kaikki.org-dictionary-Lithuanian.json`. +2. Convert the JSON file to the required tabular data by extracting word forms related to a dictionary entry. +3. Deduplicate the entries. +4. Control the output by skipping lines which are too short or contain unexpected characters, converting lines if they are not in the right character set, exploring the data by hand to spot inconsistencies. + + +Here is an example of how the data can be extracted, the attributes may not be the same for all languages in Kaikki, hence the two different ways, ``senses`` and ``forms`` mostly corresponding to tables in the source. + + +.. code-block:: python + + with open('de-wikt.txt', 'w') as outfh, open('kaikki.org-dictionary-German.json') as infh: + for line in infh: + item = json.loads(line) + i = 0 + # use senses + if 'senses' in item: + for s in item['senses']: + if 'form_of' in s and item['word']: + i += 1 + lemma = s['form_of'][0]['word'] + outfh.write(lemma + '\t' + item['word'] + '\n') + elif 'alt_of' in s and item['word']: + i += 1 + lemma = s['alt_of'][0]['word'] + outfh.write(lemma + '\t' + item['word'] + '\n') + # use forms + if i == 0 and 'forms' in item: + for f in item['forms']: + if f['form']: + lemma = item['word'] + outfh.write(lemma + '\t' + f['form'] + '\n') From fe629a63f22a40c771a011906c069f41cacc251a Mon Sep 17 00:00:00 2001 From: Adrien Barbaresi Date: Wed, 17 Apr 2024 17:18:57 +0200 Subject: [PATCH 2/2] improve syntax --- training/README.rst | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/training/README.rst b/training/README.rst index 34e0752..d655e33 100644 --- a/training/README.rst +++ b/training/README.rst @@ -4,7 +4,7 @@ Running the evaluation The scores are calculated on `Universal Dependencies `_ treebanks on single word tokens (including some contractions but not merged prepositions). They can be reproduced by the following steps: 1. Install the evaluation dependencies, Python >= 3.8 required (``pip install -r training/requirements.txt``) -2. Update ``DATA_URL`` in ``training/download-eval-data.py`` to point to the latest treebanks archive from `Universal Dependencies ` (or the version that you which to use). +2. Update ``DATA_URL`` in ``training/download-eval-data.py`` to point to the latest treebanks archive from `Universal Dependencies `_ (or the version that you which to use). 3. Run ``python3 training/download-eval-data.py`` which will 1. Download the archive 2. Extract relevant data (language and if applicable specific treebank, see notes in the results table) @@ -24,17 +24,17 @@ For a list of potential sources see `issue 1 `_ project is currently a good place to start. It leverages information from the Wiktionary project and is rather extensive. Its main drawbacks are lack of coverage for less-resourced languages and errors during processing of entries as the Wiktionary form tables are not all alike. -1. Find the link to all word senses for a given langauge, e.g. "Download JSON data for all word senses in the Lithuanian dictionary" leading to `https://kaikki.org/dictionary/Lithuanian/kaikki.org-dictionary-Lithuanian.json`. +1. Find the link to all word senses for a given langauge, e.g. "Download JSON data for all word senses in the Lithuanian dictionary" leading to ``https://kaikki.org/dictionary/Lithuanian/kaikki.org-dictionary-Lithuanian.json``. 2. Convert the JSON file to the required tabular data by extracting word forms related to a dictionary entry. 3. Deduplicate the entries. -4. Control the output by skipping lines which are too short or contain unexpected characters, converting lines if they are not in the right character set, exploring the data by hand to spot inconsistencies. +4. Check the output by skipping lines which are too short or contain unexpected characters, converting lines if they are not in the right character set, exploring the data by hand to spot inconsistencies. -Here is an example of how the data can be extracted, the attributes may not be the same for all languages in Kaikki, hence the two different ways, ``senses`` and ``forms`` mostly corresponding to tables in the source. +Here is an example of how the data can be extracted. The attributes may not be the same for all languages in Kaikki, so two different attributes are used, ``senses`` and ``forms``, which mostly correspond to tables in the Wiktionary. .. code-block:: python