Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make some older libraries optional #73

Merged
merged 7 commits into from
Jun 26, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -32,10 +32,12 @@ jobs:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m ensurepip --upgrade
python -m pip install --upgrade setuptools
python -m pip install --upgrade pip
python -m pip install flake8 pytest wheel
pip install -r ${{ matrix.requirements-file }}
python setup.py install
python -m pip install --no-cache-dir -r ${{ matrix.requirements-file }}
python -m pip install .
- name: Lint with flake8
run: |
# stop the build if there are Python syntax errors or undefined names
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ Install from source:

### Troubleshooting

OpusFilter should generally work fine on Python 3.8 to 3.11. In the case of troubles, try installing the exact versions in `requirements.txt`:
OpusFilter should generally work fine on Python 3.8 to 3.12. In the case of troubles, try installing the exact versions in `requirements.txt`:

* `pip install -r requirements.txt`

Expand Down
6 changes: 6 additions & 0 deletions docs/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,12 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

### Changed

- make `pycld2` and `fasttext` libraries optional
- replace `langid.py` library with `py3langid`
- update github workflows and include Python 3.12 tests

## [3.1.0] - 2024-06-05

### Added
Expand Down
4 changes: 2 additions & 2 deletions docs/CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ issues page. We are also happy to consider pull requests. There are a
few rules for pull requests:

* Make a pull request to the `develop` branch instead of `master`.
* The code should support at least Python versions from 3.8 to 3.11.
* The code should support at least Python versions from 3.8 to 3.12.
* Please follow [PEP 8](https://www.python.org/dev/peps/pep-0008/). Exception: The maximum line length is 127 characters instead of 79.
* Especially for new features, please include test cases for unit testing.

Expand All @@ -20,7 +20,7 @@ skips the respective tests if not.)

GitHub workflows defined in the project run automatically `flake8`
checks and unit testing with `pytest` using Python 3.8, 3.9, 3.10,
and 3.11.
3.11, and 3.12.

Especially for larger contributions, consider using a code analysis
tool like [Pylint](https://github.com/PyCQA/pylint). Install it
Expand Down
18 changes: 13 additions & 5 deletions docs/filters/script_and_language_identification_filters.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ Filter segments based on their language identification confidence scores.
Parameters:

* `languages`: expected languages (ISO639 language codes) for the segments
* `id_method`: language indentification method (`langid` for using the `langid` library, `cld2` for using the `cld2` library, or `fasttext` for using a `fasttext` model; the default is `langid`)
* `id_method`: language indentification method (`langid`, `lingua`, `cld2`, `fasttext`; default `langid`)
* `thresholds`: minimum identification confidence score for the segments (a single float or a list of floats per language)
* `fasttext_model_path`: path for a `fasttext` model (required only for the `fasttext` method; default `null`)
* `langid_languages`: limit detection to a list of possible languages (valid only for the `langid` method; default `null`)
Expand All @@ -44,7 +44,15 @@ Parameters:

Returned scores are the language identification confidence scores from a given identification method for the segments. The scores range from 0 to 1. In filtering, all values have to be greater than the minimum thresholds. Negative threshold can be used to skip filtering for a language.

See [langid.py](https://github.com/saffsd/langid.py) and
[pycld2](https://github.com/aboSamoor/pycld2) for the method-specific
options. A pretrained `fasttext` model can be downloaded from
[fasttext.cc/docs/en/language-identification.html](https://fasttext.cc/docs/en/language-identification.html).
Currently the following identification methods are supported:

* `langid` (default) :cite:`lui-baldwin-2012-langid`
* See https://github.com/adbar/py3langid
* `lingua`
* See https://github.com/pemistahl/lingua-py
* `cld2`
* See https://github.com/CLD2Owners/cld2
* Requires [installing optional libraries](../installation.md).
* `fasttext` :cite:`joulin-etal-2016-fasttext` and :cite:`joulin-etal-2017-bag`
* A pretrained model can be downloaded from [fasttext.cc/docs/en/language-identification.html](https://fasttext.cc/docs/en/language-identification.html).
* Requires [installing optional libraries](../installation.md).
34 changes: 25 additions & 9 deletions docs/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,20 +12,18 @@ Install from source:

Note that all required libraries are not available to install via PyPI
on Windows OS. On Linux and MacOS, it should work directly for Python
versions from 3.8 to 3.11.
versions from 3.8 to 3.12.

## Required libraries

* beautifulsoup4
* opus-fast-mosestokenizer
* fasttext
* graphviz
* langid
* py3langid
* matplotlib
* morfessor
* OpusTools
* pandas
* pycld2
* rapidfuzz
* ruamel.yaml
* regex
Expand All @@ -41,24 +39,42 @@ See `setup.py` for possible version requirements.

## Optional libraries and tools

### FastText and PyCLD2 language identification

The language identification libraries currently supported out-of-the-box
are [py3langid](https://github.com/adbar/py3langid) and
[lingua](https://github.com/pemistahl/lingua-py). The support for for
[PyCLD2](https://github.com/aboSamoor/pycld2) and
[FastText models](https://fasttext.cc/docs/en/language-identification.html)
have been changed to optional due to the lack of support especially
for newer Python versions.

The PyCLD2 support can be installed automatically with pip by
including the extras `[pycld2]` or `[all]` (e.g.
`pip install opusfilter[pycld2]`).

The support for FastText models can be installed automatically with
pip by including the extras `[fasttext]` or `[all]` (e.g.
`pip install opusfilter[fasttext]`).

### Jieba and MeCab word segmentation

For Chinese tokenization (word segmentation), you can use the
[jieba](https://github.com/fxsjy/jieba) library. It can be installed
automatically with pip by including the extras `[jieba]` or `[all]`
(e.g. `pip install opusfilter[all]`).
(e.g. `pip install opusfilter[jieba]`).

For Japanese tokenization (word segmentation), you can use the
[MeCab](https://github.com/SamuraiT/mecab-python3) library. It can be installed
automatically with pip by including the extras `[mecab]` or `[all]`
(e.g. `pip install opusfilter[all]`).
(e.g. `pip install opusfilter[mecab]`).

### LASER sentence embeddings

For using sentence embeddings filters, you need to install
`laserembeddings` (https://github.com/yannvgn/laserembeddings). It can
be installed automatically with pip by including the extras `[laser]`
or `[all]` (e.g. `pip install opusfilter[all]`). The package will also
or `[all]` (e.g. `pip install opusfilter[laser]`). The package will also
require a number of additional libraries, including PyTorch, jieba,
and MeCab. Note that you need also to download the prebuild models
with `python -m laserembeddings download-models`.
Expand All @@ -68,12 +84,12 @@ with `python -m laserembeddings download-models`.
For using n-gram language model filters, you need to install the
Python wrapper for VariKN (https://github.com/vsiivola/variKN). It can
be installed automatically with pip by including the extras `[varikn]`
or `[all]` (e.g. `pip install opusfilter[all]`).
or `[all]` (e.g. `pip install opusfilter[varikn]`).

### Eflomal word alignment

For using word alignment filters, you need to install elfomal
(https://github.com/robertostling/eflomal). It can be installed
automatically with pip by including the extras `[eflomal]` or `[all]`
(e.g. `pip install opusfilter[all]`). Note that you will need `Cython`
(e.g. `pip install opusfilter[eflomal]`). Note that you will need `Cython`
for the installation.
6 changes: 3 additions & 3 deletions opusfilter/autogen.py
Original file line number Diff line number Diff line change
Expand Up @@ -217,7 +217,7 @@ class DefaultParameterFilters(AutoFiltersABC):
'AverageWordLengthFilter', 'AlphabetRatioFilter',
'TerminalPunctuationFilter', 'NonZeroNumeralsFilter',
'LongestCommonSubstringFilter', 'SimilarityFilter', 'RepetitionFilter',
'CharacterScoreFilter', ('LanguageIDFilter', {'id_method': 'cld2'})]
'CharacterScoreFilter', ('LanguageIDFilter', {'id_method': 'lingua'})]

def set_filter_thresholds(self):
"""Set filter thresholds"""
Expand Down Expand Up @@ -272,7 +272,7 @@ class PercentileFilters(DataBasedFiltersABC):
'AverageWordLengthFilter', 'AlphabetRatioFilter',
'TerminalPunctuationFilter', 'NonZeroNumeralsFilter',
'LongestCommonSubstringFilter', 'SimilarityFilter', 'RepetitionFilter',
'CharacterScoreFilter', ('LanguageIDFilter', {'id_method': 'cld2'})]
'CharacterScoreFilter', ('LanguageIDFilter', {'id_method': 'lingua'})]

def __init__(self, files, excluded_percentile=0.001, **kwargs):
super().__init__(files, **kwargs)
Expand Down Expand Up @@ -512,7 +512,7 @@ class ClusterFilters(DataBasedFiltersABC):
('LengthRatioFilter.word', {'unit': 'word'}),
'NonZeroNumeralsFilter',
'CharacterScoreFilter',
('LanguageIDFilter', {'id_method': 'cld2'}),
('LanguageIDFilter', {'id_method': 'lingua'}),
'TerminalPunctuationFilter']

def __init__(self, files, k=2, max_length=150, **kwargs):
Expand Down
18 changes: 13 additions & 5 deletions opusfilter/filters.py
Original file line number Diff line number Diff line change
Expand Up @@ -334,8 +334,8 @@ def __init__(self, languages=None, id_method='langid', thresholds=None,

def init_langid(self, langid_languages):
"""Initialize langid identifier"""
from langid.langid import LanguageIdentifier, model
self.identifier = LanguageIdentifier.from_modelstring(model, norm_probs=True)
from py3langid.langid import LanguageIdentifier, MODEL_FILE
self.identifier = LanguageIdentifier.from_pickled_model(MODEL_FILE, norm_probs=True)
if langid_languages:
self.identifier.set_languages(langid_languages)

Expand All @@ -344,7 +344,11 @@ def init_fastttext(self, fasttext_model_path):
if not fasttext_model_path:
raise ConfigurationError("FastText language ID method was choosen without specifying "
"any path to fasttext model")
import fasttext
try:
import fasttext
except ImportError:
logger.warning("Could not import fasttext. Select another id_method for LanguageIDFilter.")
raise
self.fasttext_model = fasttext.load_model(os.path.join(self.workdir, fasttext_model_path))

def init_lingua(self, lingua_mode):
Expand All @@ -366,7 +370,11 @@ def confidence(self, sentence: str, lan: str) -> float:
return 1.0

if self.id_method == 'cld2':
import pycld2
try:
import pycld2
except ImportError:
logger.warning("Could not import pycld2. Select another id_method for LanguageIDFilter.")
raise
try:
clddetails = pycld2.detect(sentence, **self.cld2_options)
except pycld2.error as err:
Expand All @@ -380,7 +388,7 @@ def confidence(self, sentence: str, lan: str) -> float:

if self.id_method == 'langid':
lidetails = self.identifier.classify(sentence)
lilan, liconf = lidetails[0], round(lidetails[1], 2)
lilan, liconf = lidetails[0], round(float(lidetails[1]), 2)
if lilan != lan:
liconf = 0.0
return liconf
Expand Down
6 changes: 5 additions & 1 deletion opusfilter/opusfilter.py
Original file line number Diff line number Diff line change
Expand Up @@ -556,7 +556,11 @@ def _write_jsonl(objects, fname):
"""Write objects to file as JSON lines"""
with file_open(fname, 'w') as fobj:
for obj in objects:
fobj.write(json.dumps(obj, sort_keys=True)+'\n')
try:
fobj.write(json.dumps(obj, sort_keys=True)+'\n')
except TypeError as err:
logger.error("Could not convert to JSON: %s", obj)
raise err

@staticmethod
def _read_jsonl(fname):
Expand Down
10 changes: 4 additions & 6 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,15 +1,14 @@
setuptools==65.5.1
setuptools_scm==6.4.2
numpy<2.0.0
setuptools>=65.5.1
setuptools_scm>=6.4.2
numpy>=1.24.4
opustools
jieba>=0.42
beautifulsoup4>=4.8.2
graphviz>=0.16
langid==1.1.6
py3langid==0.3.0
matplotlib>=3.3.0
opus-fast-mosestokenizer>=0.0.8.5
pandas>=1.0.0
pycld2==0.41
xxhash==3.2.0
rapidfuzz>=2.0.5
regex>=2019.11.1
Expand All @@ -18,7 +17,6 @@ ruamel.yaml>=0.15.0
scikit-learn>=0.24.0
sentence-splitter==1.4
tqdm>=4.38.0
fasttext==0.9.2
mecab-python3>=1.0.8
unidic-lite==1.0.8
subword-nmt==0.3.8
Expand Down
25 changes: 17 additions & 8 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,17 +5,14 @@

install_requires = [
"setuptools",
"numpy<2.0.0",
"opustools",
"beautifulsoup4>=4.8.0",
"fasttext",
"graphviz",
"langid",
"py3langid>=0.2.2",
"matplotlib",
"morfessor",
"opus-fast-mosestokenizer>=0.0.8.5",
"pandas>=1.0.0",
"pycld2",
"xxhash>=3.2.0",
"sentence-splitter",
"rapidfuzz",
Expand All @@ -28,6 +25,16 @@
"lingua-language-detector>=1.3.0"
]

pycld2_require = [
"pycld2"
]

fasttext_require = [
"py3langid<0.3.0", # 0.3.0 requires numpy 2.0.0
"numpy<2.0.0",
"fasttext"
]

eflomal_require = [
'eflomal>=2.0.0'
]
Expand Down Expand Up @@ -60,7 +67,8 @@
'sphinxcontrib-bibtex'
]

all_require = eflomal_require + jieba_require + mecab_require + laser_require + varikn_require + tests_require + docs_require
all_require = pycld2_require + fasttext_require + eflomal_require + jieba_require + \
mecab_require + laser_require + varikn_require + tests_require + docs_require

setuptools.setup(
name="opusfilter",
Expand All @@ -78,9 +86,10 @@
"bin/opusfilter-scores", "bin/opusfilter-test"],
install_requires=install_requires,
tests_require=tests_require,
extras_require={'test': tests_require, 'eflomal': eflomal_require, 'jieba': jieba_require,
'mecab': mecab_require, 'laser': laser_require, 'varikn': varikn_require,
'docs': docs_require, 'all': all_require},
extras_require={'test': tests_require, 'pycld2': pycld2_require, 'fasttext': fasttext_require,
'eflomal': eflomal_require, 'jieba': jieba_require, 'mecab': mecab_require,
'laser': laser_require, 'varikn': varikn_require, 'docs': docs_require,
'all': all_require},
classifiers=(
"Programming Language :: Python :: 3",
"License :: OSI Approved :: MIT License",
Expand Down
2 changes: 1 addition & 1 deletion tests/test_autogen.py
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,7 @@ class TestThresholdFinder(unittest.TestCase):
{'LengthRatioFilter': {'name': 'word', 'threshold': 1, 'unit': 'word'}},
{'NonZeroNumeralsFilter': {'threshold': 1}},
{'CharacterScoreFilter': {'scripts': ['latin', 'latin'], 'thresholds': [1, 1]}},
{'LanguageIDFilter': {'id_method': 'cld2', 'languages': ['en', 'de'], 'thresholds': [1, 1]}},
{'LanguageIDFilter': {'id_method': 'lingua', 'languages': ['en', 'de'], 'thresholds': [1, 1]}},
{'TerminalPunctuationFilter': {'threshold': 1}}
]

Expand Down
Loading