Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add info on training data #122

Merged
merged 2 commits into from
Apr 17, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .coveragerc
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,11 @@
source = simplemma

omit =
training/*
tests/*
setup.py

[report]
exclude_lines =
pragma: no cover
if __name__ == .__main__.:
if __name__ == .__main__.:
68 changes: 65 additions & 3 deletions training/README.rst
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
Instructions to run the evaluation
----------------------------------
Running the evaluation
----------------------

The scores are calculated on `Universal Dependencies <https://universaldependencies.org/>`_ treebanks on single word tokens (including some contractions but not merged prepositions). They can be reproduced by the following steps:

1. Install the evaluation dependencies, Python >= 3.8 required (``pip install -r training/requirements.txt``)
2. Update ``DATA_URL`` in ``training/download-eval-data.py`` to point to the latest treebanks archive from `Universal Dependencies <https://universaldependencies.org/#download>` (or the version that you which to use).
2. Update ``DATA_URL`` in ``training/download-eval-data.py`` to point to the latest treebanks archive from `Universal Dependencies <https://universaldependencies.org/#download>`_ (or the version that you which to use).
3. Run ``python3 training/download-eval-data.py`` which will
1. Download the archive
2. Extract relevant data (language and if applicable specific treebank, see notes in the results table)
Expand All @@ -13,3 +13,65 @@ The scores are calculated on `Universal Dependencies <https://universaldependenc
4. Run the script, e.g. from the home directory ``python3 training/evaluate_simplemma.py``
5. Results are stored at ``training/data/results/results_summary.csv``. Also, errors are written in a CSV file for each dataset under the ``data/results``folder.



Building lemmatization dictionaries
-----------------------------------

For a list of potential sources see `issue 1 <https://github.com/adbar/simplemma/issues/1>`_.


Input data
^^^^^^^^^^

- Tab-separated columns, first lemma, then word form, e.g. ``pelican TAB pelicans``.
- Redundant and noisy cases are mostly filtered out by the input script but it is best to check the data as smaller errors in available lists or machine-generated data are common.
- The data should be tested on an authoritative source like the Universal Dependencies (see above).


Adding languages
^^^^^^^^^^^^^^^^

- The Simplemma approach currently works best on languages written from left to right, results will be impacted otherwise (e.g. Urdu).
- The target language has to be prone to lemmatization by allowing for the reduction of at least two word forms to a single dictionary entry (e.g. Korean does not fit the current scope).
- The new language (two- or three-letter ISO code) has to be added to the dictionary data (using the ``dictionary_pickler`` script), it should then be available in ``SUPPORTED_LANGUAGES``.


Example using ``kaikki.org``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Since a source has to comprise enough words without sacrificing quality, the `kaikki.org <https://kaikki.org>`_ project is currently a good place to start. It leverages information from the Wiktionary project and is rather extensive. Its main drawbacks are lack of coverage for less-resourced languages and errors during processing of entries as the Wiktionary form tables are not all alike.


1. Find the link to all word senses for a given langauge, e.g. "Download JSON data for all word senses in the Lithuanian dictionary" leading to ``https://kaikki.org/dictionary/Lithuanian/kaikki.org-dictionary-Lithuanian.json``.
2. Convert the JSON file to the required tabular data by extracting word forms related to a dictionary entry.
3. Deduplicate the entries.
4. Check the output by skipping lines which are too short or contain unexpected characters, converting lines if they are not in the right character set, exploring the data by hand to spot inconsistencies.


Here is an example of how the data can be extracted. The attributes may not be the same for all languages in Kaikki, so two different attributes are used, ``senses`` and ``forms``, which mostly correspond to tables in the Wiktionary.


.. code-block:: python

with open('de-wikt.txt', 'w') as outfh, open('kaikki.org-dictionary-German.json') as infh:
for line in infh:
item = json.loads(line)
i = 0
# use senses
if 'senses' in item:
for s in item['senses']:
if 'form_of' in s and item['word']:
i += 1
lemma = s['form_of'][0]['word']
outfh.write(lemma + '\t' + item['word'] + '\n')
elif 'alt_of' in s and item['word']:
i += 1
lemma = s['alt_of'][0]['word']
outfh.write(lemma + '\t' + item['word'] + '\n')
# use forms
if i == 0 and 'forms' in item:
for f in item['forms']:
if f['form']:
lemma = item['word']
outfh.write(lemma + '\t' + f['form'] + '\n')
Loading