Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial addition of the Russian language #66

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
258 changes: 258 additions & 0 deletions src/rules/disallowed_words/russian.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,258 @@
хуй
хуина
хуйло
опизденевшие
пизда
др
доп
ул
им
ст
св
чел
шт
пр
см
мн
пл
мл
уд
ср
др
рус
ед
чл
корр
еп
пп
оз
кг
гв
рр
тд
км
кн
мм
юр
ур
дв
ев
яп
шп
яз
цз
тт
сб
пн
вт
ср
чт
пт
вск
эп
зп
сц
уу
ув
оо
би
мя
ал
сс
уг
ол
сл
узб
эк
кр
хр
кс
рч
вн
ов
аг
уч
хх
дд
тп
мч
вр
ьо
ин
оф
ус
тж
жд
дл
мд
фр
эм
ит
оп
лл
ак
эл
рп
вм
3-бет
аббр
аббрев
абл
абс
абх
авар
Авв
авг
Авд
австр
австрал
авт
Агг
агр
адж
адм
адыг
азерб
азиат
акад
академ
акк
акц
алб
алг
алгебр
алж
алт
алф
альм
альп
ам
Ам
амер
анат
англ
ангол
аннот
антич
ао
ап
Апок
апп
апр
ар
араб
арам
аргент
арифм
арм
арт
артез
арх
археол
архиеп
архим
архип
архит
ас
асб
асс
ассир
ассист
астр
астрон
ат
ата
ати
атм
афг
афр
ацет
б-ка
б-н
б-ца
б-чка
бат-н
башк
бел
белорус
бзн
библ
биогр
биол
бирм
Бл
блгв
блгвв
блж
блр
больн
бр
браз
брет
брит
бц
быв
Быт
бюдж
бюлл
вл
Вл
вс
вт
вып
г-жа
г-н
Гбайт
ГВт
гг
Гкал
гл
глаг
гм
гос
гр
грн
дал
дБ
деепр
дееприч
Дж
диак
долл
дптр
др
зак
зам
Зв
изд-во
кал
кат
кв
кВА
кВт
кВтч
ккал
корп
корр
Мб
Мбит
МВт
мг
МГц
межд
междунар
мес
мест
нареч
Бк
Вт
га
гг
Гг
Ггц
кг
км
кт
мкс
мм
сек
78 changes: 78 additions & 0 deletions src/rules/russian.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
min_trimmed_length = 3

# For count = 2 we have sentences like: "Это он", "Будет беда"
# and a lot of trash abbreviations like "до н.", "и доп."
min_word_count = 3

max_word_count = 14

# In Russian, words can consist of one letter, so no restrictions
min_characters = 0

may_end_with_colon = false
quote_start_with_letter = true
needs_punctuation_end = false
needs_letter_start = true

# Apparently, in some places the sentences are cut incorrectly,
# which is why we get some part of the sentence, and not its entirety.
# This is required to fix sentences like:
# с Иваном Галамяном.
needs_uppercase_start = true

allowed_symbols_regex = "[А-Яа-яёЁ\\s:,.\\-‑?;!—­‐–'·=’―−”‘]"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that your previous version excluded all quotes and some other characters as it lead to uneven symbol occurances in the sentences. This now allows quotes, so you might end up with sentences with wrong symbols. I'd suggest to add these to the even_symbols config option to make sure quotes (and possibly other symbols) are always even.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The may difference with the previous version, is that I allowed symbol - that is ok, in Russian there can be words like "Côte d'Ivoire".

Please, can you provide more info about even_symbols, I can't find detailed description about it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this regex here, you could potentially get the following sentence: This is ”uneven and suddenly stops the quote without terminating symbol

Regarding even_symbols, can you tell me what exactly is not clear from the README? Then we can adjust the README to provide better info

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you have a chance to look at this? Is there anything I can help with?


disallowed_symbols = [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this won't be used as allowed_symbols_regex is set. However, given that those invisible chars should not be allowed by the above regex, I guess that's fine and it can be completely removed?

Copy link
Author

@iLeonidze iLeonidze Jan 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. As I found out invisible chars are not detected by regex for some reason. Perhaps because they are part of \s. Therefore, they had to be defined here, and can't be removed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that's probably what is happening here. Could you just use the specific character you want for whitespace instead of \s or is my Regex knowledge completely failing me here?

# Here invisible chars: use vim to work with them
'­', '​', '', '་'
]

broken_whitespace = [" ", " ,", " .", " ?", " !", " ;", " \""]

abbreviation_patterns = [
# Abbreviations:

# 1. М.Н.С.
"[А-Я]+\\.*[А-Я]",

# 2. А. Пушкин
"[А-Я]\\.",

# 3. Дж.
"[А-Я][а-я]\\.",

# 4. СССР
"[А-Я]{2,}",

# 5. г. Пушкина.
"\\s[а-я]\\.",

# 6. — —
"— —",

# 7. сайка фито— и зоопланктоном
"[а-я]—\\s[а-я]",

# 8. Повод —первое упоминание
"[а-я]\\s—[а-я]",

# 9. с разрывом связей С—С и образованием
"[А-Я]—[А-Я]",

# 10. Words that are similar to ordinary, but cannot be at the end of a sentence,
# which means they are abbreviations
"\\s(ор|ок|ом|ум|те|ил)\\.",

# 11. по учению св.отцов
"[а-я]\\.[а-я]",

# 12. др.-евр. чл.-корр.
"[а-я]{1,3}\\.-[а-я]{1,4}\\.",

# 13. ул.Тюменская
"[А-Яа-я]\\.[А-Яа-я]",

# 14. Ставропольский кр.,
"\\.,"
]