Helsinki-NLP · svirpioj · Jun 26, 2024 · Jun 20, 2024 · Jun 20, 2024 · Jun 26, 2024
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -32,10 +32,12 @@ jobs:
           python-version: ${{ matrix.python-version }}
       - name: Install dependencies
         run: |
+          python -m ensurepip --upgrade
+          python -m pip install --upgrade setuptools
           python -m pip install --upgrade pip
           python -m pip install flake8 pytest wheel
-          pip install -r ${{ matrix.requirements-file }}
-          python setup.py install
+          python -m pip install --no-cache-dir -r ${{ matrix.requirements-file }}
+          python -m pip install .
       - name: Lint with flake8
         run: |
           # stop the build if there are Python syntax errors or undefined names

diff --git a/README.md b/README.md
@@ -25,7 +25,7 @@ Install from source:
 
 ### Troubleshooting
 
-OpusFilter should generally work fine on Python 3.8 to 3.11. In the case of troubles, try installing the exact versions in `requirements.txt`:
+OpusFilter should generally work fine on Python 3.8 to 3.12. In the case of troubles, try installing the exact versions in `requirements.txt`:
 
 * `pip install -r requirements.txt`
 

diff --git a/docs/CHANGELOG.md b/docs/CHANGELOG.md
@@ -7,6 +7,12 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## [Unreleased]
 
+### Changed
+
+- make `pycld2` and `fasttext` libraries optional
+- replace `langid.py` library with `py3langid`
+- update github workflows and include Python 3.12 tests
+
 ## [3.1.0] - 2024-06-05
 
 ### Added

diff --git a/docs/CONTRIBUTING.md b/docs/CONTRIBUTING.md
@@ -5,7 +5,7 @@ issues page. We are also happy to consider pull requests. There are a
 few rules for pull requests:
 
 * Make a pull request to the `develop` branch instead of `master`.
-* The code should support at least Python versions from 3.8 to 3.11.
+* The code should support at least Python versions from 3.8 to 3.12.
 * Please follow [PEP 8](https://www.python.org/dev/peps/pep-0008/). Exception: The maximum line length is 127 characters instead of 79.
 * Especially for new features, please include test cases for unit testing.
 
@@ -20,7 +20,7 @@ skips the respective tests if not.)
 
 GitHub workflows defined in the project run automatically `flake8`
 checks and unit testing with `pytest` using Python 3.8, 3.9, 3.10,
-and 3.11.
+3.11, and 3.12.
 
 Especially for larger contributions, consider using a code analysis
 tool like [Pylint](https://github.com/PyCQA/pylint). Install it

diff --git a/docs/filters/script_and_language_identification_filters.md b/docs/filters/script_and_language_identification_filters.md
@@ -35,7 +35,7 @@ Filter segments based on their language identification confidence scores.
 Parameters:
 
 * `languages`: expected languages (ISO639 language codes) for the segments
-* `id_method`: language indentification method (`langid` for using the `langid` library, `cld2` for using the `cld2` library, or `fasttext` for using a `fasttext` model; the default is `langid`)
+* `id_method`: language indentification method (`langid`, `lingua`, `cld2`, `fasttext`; default `langid`)
 * `thresholds`: minimum identification confidence score for the segments (a single float or a list of floats per language)
 * `fasttext_model_path`: path for a `fasttext` model (required only for the `fasttext` method; default `null`)
 * `langid_languages`: limit detection to a list of possible languages (valid only for the `langid` method; default `null`)
@@ -44,7 +44,15 @@ Parameters:
 
 Returned scores are the language identification confidence scores from a given identification method for the segments. The scores range from 0 to 1. In filtering, all values have to be greater than the minimum thresholds. Negative threshold can be used to skip filtering for a language.
 
-See [langid.py](https://github.com/saffsd/langid.py) and
-[pycld2](https://github.com/aboSamoor/pycld2) for the method-specific
-options. A pretrained `fasttext` model can be downloaded from
-[fasttext.cc/docs/en/language-identification.html](https://fasttext.cc/docs/en/language-identification.html).
+Currently the following identification methods are supported:
+
+* `langid` (default) :cite:`lui-baldwin-2012-langid`
+  * See https://github.com/adbar/py3langid
+* `lingua`
+  * See https://github.com/pemistahl/lingua-py
+* `cld2`
+  * See https://github.com/CLD2Owners/cld2
+  * Requires [installing optional libraries](../installation.md).
+* `fasttext` :cite:`joulin-etal-2016-fasttext` and :cite:`joulin-etal-2017-bag`
+  * A pretrained model can be downloaded from [fasttext.cc/docs/en/language-identification.html](https://fasttext.cc/docs/en/language-identification.html).
+  * Requires [installing optional libraries](../installation.md).
diff --git a/docs/installation.md b/docs/installation.md
@@ -12,20 +12,18 @@ Install from source:
 
 Note that all required libraries are not available to install via PyPI
 on Windows OS. On Linux and MacOS, it should work directly for Python
-versions from 3.8 to 3.11.
+versions from 3.8 to 3.12.
 
 ## Required libraries
 
 * beautifulsoup4
 * opus-fast-mosestokenizer
-* fasttext
 * graphviz
-* langid
+* py3langid
 * matplotlib
 * morfessor
 * OpusTools
 * pandas
-* pycld2
 * rapidfuzz
 * ruamel.yaml
 * regex
@@ -41,24 +39,42 @@ See `setup.py` for possible version requirements.
 
 ## Optional libraries and tools
 
+### FastText and PyCLD2 language identification
+
+The language identification libraries currently supported out-of-the-box
+are [py3langid](https://github.com/adbar/py3langid) and
+[lingua](https://github.com/pemistahl/lingua-py). The support for for
+[PyCLD2](https://github.com/aboSamoor/pycld2) and
+[FastText models](https://fasttext.cc/docs/en/language-identification.html)
+have been changed to optional due to the lack of support especially
+for newer Python versions.
+
+The PyCLD2 support can be installed automatically with pip by
+including the extras `[pycld2]` or `[all]` (e.g.
+`pip install opusfilter[pycld2]`).
+
+The support for FastText models can be installed automatically with
+pip by including the extras `[fasttext]` or `[all]` (e.g.
+`pip install opusfilter[fasttext]`).
+
 ### Jieba and MeCab word segmentation
 
 For Chinese tokenization (word segmentation), you can use the
 [jieba](https://github.com/fxsjy/jieba) library. It can be installed
 automatically with pip by including the extras `[jieba]` or `[all]`
-(e.g. `pip install opusfilter[all]`).
+(e.g. `pip install opusfilter[jieba]`).
 
 For Japanese tokenization (word segmentation), you can use the
 [MeCab](https://github.com/SamuraiT/mecab-python3) library. It can be installed
 automatically with pip by including the extras `[mecab]` or `[all]`
-(e.g. `pip install opusfilter[all]`).
+(e.g. `pip install opusfilter[mecab]`).
 
 ### LASER sentence embeddings
 
 For using sentence embeddings filters, you need to install
 `laserembeddings` (https://github.com/yannvgn/laserembeddings). It can
 be installed automatically with pip by including the extras `[laser]`
-or `[all]` (e.g. `pip install opusfilter[all]`). The package will also
+or `[all]` (e.g. `pip install opusfilter[laser]`). The package will also
 require a number of additional libraries, including PyTorch, jieba,
 and MeCab. Note that you need also to download the prebuild models
 with `python -m laserembeddings download-models`.
@@ -68,12 +84,12 @@ with `python -m laserembeddings download-models`.
 For using n-gram language model filters, you need to install the
 Python wrapper for VariKN (https://github.com/vsiivola/variKN). It can
 be installed automatically with pip by including the extras `[varikn]`
-or `[all]` (e.g. `pip install opusfilter[all]`).
+or `[all]` (e.g. `pip install opusfilter[varikn]`).
 
 ### Eflomal word alignment
 
 For using word alignment filters, you need to install elfomal
 (https://github.com/robertostling/eflomal). It can be installed
 automatically with pip by including the extras `[eflomal]` or `[all]`
-(e.g. `pip install opusfilter[all]`). Note that you will need `Cython`
+(e.g. `pip install opusfilter[eflomal]`). Note that you will need `Cython`
 for the installation.
diff --git a/opusfilter/autogen.py b/opusfilter/autogen.py
@@ -217,7 +217,7 @@ class DefaultParameterFilters(AutoFiltersABC):
                        'AverageWordLengthFilter', 'AlphabetRatioFilter',
                        'TerminalPunctuationFilter', 'NonZeroNumeralsFilter',
                        'LongestCommonSubstringFilter', 'SimilarityFilter', 'RepetitionFilter',
-                       'CharacterScoreFilter', ('LanguageIDFilter', {'id_method': 'cld2'})]
+                       'CharacterScoreFilter', ('LanguageIDFilter', {'id_method': 'lingua'})]
 
     def set_filter_thresholds(self):
         """Set filter thresholds"""
@@ -272,7 +272,7 @@ class PercentileFilters(DataBasedFiltersABC):
                        'AverageWordLengthFilter', 'AlphabetRatioFilter',
                        'TerminalPunctuationFilter', 'NonZeroNumeralsFilter',
                        'LongestCommonSubstringFilter', 'SimilarityFilter', 'RepetitionFilter',
-                       'CharacterScoreFilter', ('LanguageIDFilter', {'id_method': 'cld2'})]
+                       'CharacterScoreFilter', ('LanguageIDFilter', {'id_method': 'lingua'})]
 
     def __init__(self, files, excluded_percentile=0.001, **kwargs):
         super().__init__(files, **kwargs)
@@ -512,7 +512,7 @@ class ClusterFilters(DataBasedFiltersABC):
                        ('LengthRatioFilter.word', {'unit': 'word'}),
                        'NonZeroNumeralsFilter',
                        'CharacterScoreFilter',
-                       ('LanguageIDFilter', {'id_method': 'cld2'}),
+                       ('LanguageIDFilter', {'id_method': 'lingua'}),
                        'TerminalPunctuationFilter']
 
     def __init__(self, files, k=2, max_length=150, **kwargs):

diff --git a/opusfilter/filters.py b/opusfilter/filters.py
@@ -334,8 +334,8 @@ def __init__(self, languages=None, id_method='langid', thresholds=None,
 
     def init_langid(self, langid_languages):
         """Initialize langid identifier"""
-        from langid.langid import LanguageIdentifier, model
-        self.identifier = LanguageIdentifier.from_modelstring(model, norm_probs=True)
+        from py3langid.langid import LanguageIdentifier, MODEL_FILE
+        self.identifier = LanguageIdentifier.from_pickled_model(MODEL_FILE, norm_probs=True)
         if langid_languages:
             self.identifier.set_languages(langid_languages)
 
@@ -344,7 +344,11 @@ def init_fastttext(self, fasttext_model_path):
         if not fasttext_model_path:
             raise ConfigurationError("FastText language ID method was choosen without specifying "
                                      "any path to fasttext model")
-        import fasttext
+        try:
+            import fasttext
+        except ImportError:
+            logger.warning("Could not import fasttext. Select another id_method for LanguageIDFilter.")
+            raise
         self.fasttext_model = fasttext.load_model(os.path.join(self.workdir, fasttext_model_path))
 
     def init_lingua(self, lingua_mode):
@@ -366,7 +370,11 @@ def confidence(self, sentence: str, lan: str) -> float:
             return 1.0
 
         if self.id_method == 'cld2':
-            import pycld2
+            try:
+                import pycld2
+            except ImportError:
+                logger.warning("Could not import pycld2. Select another id_method for LanguageIDFilter.")
+                raise
             try:
                 clddetails = pycld2.detect(sentence, **self.cld2_options)
             except pycld2.error as err:
@@ -380,7 +388,7 @@ def confidence(self, sentence: str, lan: str) -> float:
 
         if self.id_method == 'langid':
             lidetails = self.identifier.classify(sentence)
-            lilan, liconf = lidetails[0], round(lidetails[1], 2)
+            lilan, liconf = lidetails[0], round(float(lidetails[1]), 2)
             if lilan != lan:
                 liconf = 0.0
             return liconf

diff --git a/opusfilter/opusfilter.py b/opusfilter/opusfilter.py
@@ -556,7 +556,11 @@ def _write_jsonl(objects, fname):
         """Write objects to file as JSON lines"""
         with file_open(fname, 'w') as fobj:
             for obj in objects:
-                fobj.write(json.dumps(obj, sort_keys=True)+'\n')
+                try:
+                    fobj.write(json.dumps(obj, sort_keys=True)+'\n')
+                except TypeError as err:
+                    logger.error("Could not convert to JSON: %s", obj)
+                    raise err
 
     @staticmethod
     def _read_jsonl(fname):

diff --git a/requirements.txt b/requirements.txt
@@ -1,15 +1,14 @@
-setuptools==65.5.1
-setuptools_scm==6.4.2
-numpy<2.0.0
+setuptools>=65.5.1
+setuptools_scm>=6.4.2
+numpy>=1.24.4
 opustools
 jieba>=0.42
 beautifulsoup4>=4.8.2
 graphviz>=0.16
-langid==1.1.6
+py3langid==0.3.0
 matplotlib>=3.3.0
 opus-fast-mosestokenizer>=0.0.8.5
 pandas>=1.0.0
-pycld2==0.41
 xxhash==3.2.0
 rapidfuzz>=2.0.5
 regex>=2019.11.1
@@ -18,7 +17,6 @@ ruamel.yaml>=0.15.0
 scikit-learn>=0.24.0
 sentence-splitter==1.4
 tqdm>=4.38.0
-fasttext==0.9.2
 mecab-python3>=1.0.8
 unidic-lite==1.0.8
 subword-nmt==0.3.8

diff --git a/setup.py b/setup.py
@@ -5,17 +5,14 @@
 
 install_requires = [
     "setuptools",
-    "numpy<2.0.0",
     "opustools",
     "beautifulsoup4>=4.8.0",
-    "fasttext",
     "graphviz",
-    "langid",
+    "py3langid>=0.2.2",
     "matplotlib",
     "morfessor",
     "opus-fast-mosestokenizer>=0.0.8.5",
     "pandas>=1.0.0",
-    "pycld2",
     "xxhash>=3.2.0",
     "sentence-splitter",
     "rapidfuzz",
@@ -28,6 +25,16 @@
     "lingua-language-detector>=1.3.0"
 ]
 
+pycld2_require = [
+    "pycld2"
+]
+
+fasttext_require = [
+    "py3langid<0.3.0",  # 0.3.0 requires numpy 2.0.0
+    "numpy<2.0.0",
+    "fasttext"
+]
+
 eflomal_require = [
     'eflomal>=2.0.0'
 ]
@@ -60,7 +67,8 @@
     'sphinxcontrib-bibtex'
 ]
 
-all_require = eflomal_require + jieba_require + mecab_require + laser_require + varikn_require + tests_require + docs_require
+all_require = pycld2_require + fasttext_require + eflomal_require + jieba_require + \
+    mecab_require + laser_require + varikn_require + tests_require + docs_require
 
 setuptools.setup(
     name="opusfilter",
@@ -78,9 +86,10 @@
         "bin/opusfilter-scores", "bin/opusfilter-test"],
     install_requires=install_requires,
     tests_require=tests_require,
-    extras_require={'test': tests_require, 'eflomal': eflomal_require, 'jieba': jieba_require,
-                    'mecab': mecab_require, 'laser': laser_require, 'varikn': varikn_require,
-                    'docs': docs_require, 'all': all_require},
+    extras_require={'test': tests_require, 'pycld2': pycld2_require, 'fasttext': fasttext_require,
+                    'eflomal': eflomal_require, 'jieba': jieba_require, 'mecab': mecab_require,
+                    'laser': laser_require, 'varikn': varikn_require, 'docs': docs_require,
+                    'all': all_require},
     classifiers=(
         "Programming Language :: Python :: 3",
         "License :: OSI Approved :: MIT License",

diff --git a/tests/test_autogen.py b/tests/test_autogen.py
@@ -103,7 +103,7 @@ class TestThresholdFinder(unittest.TestCase):
         {'LengthRatioFilter': {'name': 'word', 'threshold': 1, 'unit': 'word'}},
         {'NonZeroNumeralsFilter': {'threshold': 1}},
         {'CharacterScoreFilter': {'scripts': ['latin', 'latin'], 'thresholds': [1, 1]}},
-        {'LanguageIDFilter': {'id_method': 'cld2', 'languages': ['en', 'de'], 'thresholds': [1, 1]}},
+        {'LanguageIDFilter': {'id_method': 'lingua', 'languages': ['en', 'de'], 'thresholds': [1, 1]}},
         {'TerminalPunctuationFilter': {'threshold': 1}}
     ]