Skip to content

Commit

Permalink
docs: add README section on advanced usage via classes (#113)
Browse files Browse the repository at this point in the history
* Add README section on advanced usage via classes

* Update README.rst

---------

Co-authored-by: Adrien Barbaresi <[email protected]>
  • Loading branch information
osma and adbar authored Apr 16, 2024
1 parent 0c1012c commit 8f66a43
Showing 1 changed file with 31 additions and 0 deletions.
31 changes: 31 additions & 0 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -205,6 +205,37 @@ The ``lang_detector()`` function returns a list of language codes along with sco
The ``greedy`` argument (``extensive`` in past software versions) triggers use of the greedier decomposition algorithm described above, thus extending word coverage and recall of detection at the potential cost of a lesser accuracy.


Advanced usage via classes
~~~~~~~~~~~~~~~~~~~~~~~~~~

*The following classes will be made available in the next version. To start using them, install the latest version from the git repository.*

The above described functions are suitable for simple usage, but it is possible to have more control by instantiating Simplemma classes and calling their methods instead. Lemmatization is handled by the ``Lemmatizer`` class and language detection by the ``LanguageDetector`` class. These in turn rely on different lemmatization strategies, which are implementations of the ``LemmatizationStrategy`` protocol. The ``DefaultStrategy`` implementation uses a combination of different strategies, one of which is ``DictionaryLookupStrategy``. It looks up tokens in a dictionary created by a ``DictionaryFactory``.

For example, it is possible to conserve RAM by limiting the number of cached language dictionaries (default: 8) by creating a custom ``DefaultDictionaryFactory`` with a specific ``cache_max_size`` setting, creating a ``DefaultStrategy`` using that factory, and then creating a ``Lemmatizer`` and/or a ``LanguageDetector`` using that strategy:

.. code-block:: python
# import necessary classes
>>> from simplemma import LanguageDetector, Lemmatizer
>>> from simplemma.strategies import DefaultStrategy
>>> from simplemma.strategies.dictionaries import DefaultDictionaryFactory
LANG_CACHE_SIZE = 5 # How many language dictionaries to keep in memory at once (max)
>>> dictionary_factory = DefaultDictionaryFactory(cache_max_size=LANG_CACHE_SIZE)
>>> lemmatization_strategy = DefaultStrategy(dictionary_factory=dictionary_factory)
# lemmatize using the above customized strategy
>>> lemmatizer = Lemmatizer(lemmatization_strategy=lemmatization_strategy)
>>> lemmatizer.lemmatize('doughnuts', lang='en')
'doughnut'
# detect languages using the above customized strategy
>>> language_detector = LanguageDetector('la', lemmatization_strategy=lemmatization_strategy)
>>> language_detector.proportion_in_target_languages("opera post physica posita (τὰ μετὰ τὰ φυσικά)")
0.5
Supported languages
-------------------

Expand Down

0 comments on commit 8f66a43

Please sign in to comment.