Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Lingua instead of pycld3 for language detection #615

Closed
wants to merge 5 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 1 addition & 2 deletions .github/workflows/cicd.yml
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ jobs:
# Selectively install the optional dependencies for some Python versions
# For Python 3.8:
if [[ ${{ matrix.python-version }} == '3.8' ]]; then
poetry install -E "nn omikuji yake voikko pycld3";
poetry install -E "nn omikuji yake voikko lingua";
fi
# For Python 3.9:
if [[ ${{ matrix.python-version }} == '3.9' ]]; then
Expand All @@ -62,7 +62,6 @@ jobs:
poetry install -E "nn omikuji yake";
fi
poetry run python -m nltk.downloader punkt

- name: Lint with flake8
run: |
# stop the build if there are Python syntax errors or undefined names
Expand Down
2 changes: 1 addition & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ FROM python:3.8-slim-bullseye
LABEL maintainer="Juho Inkinen <[email protected]>"
SHELL ["/bin/bash", "-c"]

ARG optional_dependencies="fasttext voikko pycld3 fasttext nn omikuji yake spacy"
ARG optional_dependencies="fasttext voikko lingua fasttext nn omikuji yake spacy"
ARG POETRY_VIRTUALENVS_CREATE=false

# Install system dependencies needed at runtime:
Expand Down
2 changes: 1 addition & 1 deletion annif/transform/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,4 +48,4 @@ def get_transform(transform_specs, project):
_transforms.update({langfilter.LangFilter.name: langfilter.LangFilter})
except ImportError:
annif.logger.debug(
"pycld3 not available, not enabling filter_language transform")
"Lingua not available, not enabling filter_language transform")
14 changes: 10 additions & 4 deletions annif/transform/langfilter.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
different from the language of the project."""

import annif
import cld3
import lingua
from . import transform

logger = annif.logger
Expand All @@ -16,14 +16,20 @@ def __init__(self, project, text_min_length=500, sentence_min_length=50):
super().__init__(project)
self.text_min_length = int(text_min_length)
self.sentence_min_length = int(sentence_min_length)
self.detector = (
lingua.LanguageDetectorBuilder
.from_all_languages()
.with_low_accuracy_mode()
.build()
)

def _detect_language(self, text):
"""Tries to detect the language of a text input. Outputs a BCP-47-style
language code (e.g. 'en')."""

lan_info = cld3.get_language(text)
if lan_info is not None and lan_info.is_reliable:
return lan_info.language
lan_info = self.detector.detect_language_of(text)
if lan_info is not None:
return lan_info.iso_code_639_1.name.lower()
else:
return None

Expand Down
4 changes: 2 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ tensorflow-cpu = {version = "2.9.1", optional = true}
lmdb = {version = "1.3.0", optional = true}
omikuji = {version = "0.5.*", optional = true}
yake = {version = "0.4.5", optional = true}
pycld3 = {version = "*", optional = true}
lingua-language-detector = {version = "1.1.3", optional = true}
spacy = {version = "3.3.*", optional = true}

[tool.poetry.dev-dependencies]
Expand All @@ -79,7 +79,7 @@ voikko = ["voikko"]
nn = ["tensorflow-cpu", "lmdb"]
omikuji = ["omikuji"]
yake = ["yake"]
pycld3 = ["pycld3"]
lingua = ["lingua-language-detector"]
spacy = ["spacy"]

[tool.poetry.scripts]
Expand Down
1 change: 0 additions & 1 deletion tests/test_transform_langfilter.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,6 @@ def test_lang_filter(project):
Kansalliskirjasto on kaikille avoin kulttuuriperintöorganisaatio, joka
palvelee valtakunnallisesti kansalaisia, tiedeyhteisöjä ja muita
yhteiskunnan toimijoita.
Abc defghij klmnopqr stuwxyz abc defghij klmnopqr stuwxyz.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to change this test. This is a nonsensical sentence that pycld3 is unsure about (.is_reliable == False), so the language filter gives it the benefit of the doubt and retains it. Lingua simply identifies it as Swahili, so it gets stripped.

In the Lingua documentation I can't see a direct way of telling whether Lingua is unsure about some input; there is the minimum relative distance parameter which could be tuned, but I'm not sure that it would help with nonsensical input. For example, when converting PDF documents to text, it's quite likely to result in garbage "sentences" that aren't in any real language.

Turvaamme Suomessa julkaistun tai Suomea koskevan julkaistun
kulttuuriperinnön saatavuuden sekä välittämme ja tuotamme
tietosisältöjä tutkimukselle, opiskelulle, kansalaisille ja
Expand Down