-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial addition of the Russian language #66
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,258 @@ | ||
хуй | ||
хуина | ||
хуйло | ||
опизденевшие | ||
пизда | ||
др | ||
доп | ||
ул | ||
им | ||
ст | ||
св | ||
чел | ||
шт | ||
пр | ||
см | ||
мн | ||
пл | ||
мл | ||
уд | ||
ср | ||
др | ||
рус | ||
ед | ||
чл | ||
корр | ||
еп | ||
пп | ||
оз | ||
кг | ||
гв | ||
рр | ||
тд | ||
км | ||
кн | ||
мм | ||
юр | ||
ур | ||
дв | ||
ев | ||
яп | ||
шп | ||
яз | ||
цз | ||
тт | ||
сб | ||
пн | ||
вт | ||
ср | ||
чт | ||
пт | ||
вск | ||
эп | ||
зп | ||
сц | ||
уу | ||
ув | ||
оо | ||
би | ||
мя | ||
ал | ||
сс | ||
уг | ||
ол | ||
сл | ||
узб | ||
эк | ||
кр | ||
хр | ||
кс | ||
рч | ||
вн | ||
ов | ||
аг | ||
уч | ||
хх | ||
дд | ||
тп | ||
мч | ||
вр | ||
ьо | ||
ин | ||
оф | ||
ус | ||
тж | ||
жд | ||
дл | ||
мд | ||
фр | ||
эм | ||
ит | ||
оп | ||
лл | ||
ак | ||
эл | ||
рп | ||
вм | ||
3-бет | ||
аббр | ||
аббрев | ||
абл | ||
абс | ||
абх | ||
авар | ||
Авв | ||
авг | ||
Авд | ||
австр | ||
австрал | ||
авт | ||
Агг | ||
агр | ||
адж | ||
адм | ||
адыг | ||
азерб | ||
азиат | ||
акад | ||
академ | ||
акк | ||
акц | ||
алб | ||
алг | ||
алгебр | ||
алж | ||
алт | ||
алф | ||
альм | ||
альп | ||
ам | ||
Ам | ||
амер | ||
анат | ||
англ | ||
ангол | ||
аннот | ||
антич | ||
ао | ||
ап | ||
Апок | ||
апп | ||
апр | ||
ар | ||
араб | ||
арам | ||
аргент | ||
арифм | ||
арм | ||
арт | ||
артез | ||
арх | ||
археол | ||
архиеп | ||
архим | ||
архип | ||
архит | ||
ас | ||
асб | ||
асс | ||
ассир | ||
ассист | ||
астр | ||
астрон | ||
ат | ||
ата | ||
ати | ||
атм | ||
афг | ||
афр | ||
ацет | ||
б-ка | ||
б-н | ||
б-ца | ||
б-чка | ||
бат-н | ||
башк | ||
бел | ||
белорус | ||
бзн | ||
библ | ||
биогр | ||
биол | ||
бирм | ||
Бл | ||
блгв | ||
блгвв | ||
блж | ||
блр | ||
больн | ||
бр | ||
браз | ||
брет | ||
брит | ||
бц | ||
быв | ||
Быт | ||
бюдж | ||
бюлл | ||
вл | ||
Вл | ||
вс | ||
вт | ||
вып | ||
г-жа | ||
г-н | ||
Гбайт | ||
ГВт | ||
гг | ||
Гкал | ||
гл | ||
глаг | ||
гм | ||
гос | ||
гр | ||
грн | ||
дал | ||
дБ | ||
деепр | ||
дееприч | ||
Дж | ||
диак | ||
долл | ||
дптр | ||
др | ||
зак | ||
зам | ||
Зв | ||
изд-во | ||
кал | ||
кат | ||
кв | ||
кВА | ||
кВт | ||
кВтч | ||
ккал | ||
корп | ||
корр | ||
Мб | ||
Мбит | ||
МВт | ||
мг | ||
МГц | ||
межд | ||
междунар | ||
мес | ||
мест | ||
нареч | ||
Бк | ||
Вт | ||
га | ||
гг | ||
Гг | ||
Ггц | ||
кг | ||
км | ||
кт | ||
мкс | ||
мм | ||
сек |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,78 @@ | ||
min_trimmed_length = 3 | ||
|
||
# For count = 2 we have sentences like: "Это он", "Будет беда" | ||
# and a lot of trash abbreviations like "до н.", "и доп." | ||
min_word_count = 3 | ||
|
||
max_word_count = 14 | ||
|
||
# In Russian, words can consist of one letter, so no restrictions | ||
min_characters = 0 | ||
|
||
may_end_with_colon = false | ||
quote_start_with_letter = true | ||
needs_punctuation_end = false | ||
needs_letter_start = true | ||
|
||
# Apparently, in some places the sentences are cut incorrectly, | ||
# which is why we get some part of the sentence, and not its entirety. | ||
# This is required to fix sentences like: | ||
# с Иваном Галамяном. | ||
needs_uppercase_start = true | ||
|
||
allowed_symbols_regex = "[А-Яа-яёЁ\\s:,.\\-‑?;!—‐–'·=’―−”‘]" | ||
|
||
disallowed_symbols = [ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note that this won't be used as There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No. As I found out invisible chars are not detected by regex for some reason. Perhaps because they are part of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yeah, that's probably what is happening here. Could you just use the specific character you want for whitespace instead of |
||
# Here invisible chars: use vim to work with them | ||
'', '', '', '་' | ||
] | ||
|
||
broken_whitespace = [" ", " ,", " .", " ?", " !", " ;", " \""] | ||
|
||
abbreviation_patterns = [ | ||
# Abbreviations: | ||
|
||
# 1. М.Н.С. | ||
"[А-Я]+\\.*[А-Я]", | ||
|
||
# 2. А. Пушкин | ||
"[А-Я]\\.", | ||
|
||
# 3. Дж. | ||
"[А-Я][а-я]\\.", | ||
|
||
# 4. СССР | ||
"[А-Я]{2,}", | ||
|
||
# 5. г. Пушкина. | ||
"\\s[а-я]\\.", | ||
|
||
# 6. — — | ||
"— —", | ||
|
||
# 7. сайка фито— и зоопланктоном | ||
"[а-я]—\\s[а-я]", | ||
|
||
# 8. Повод —первое упоминание | ||
"[а-я]\\s—[а-я]", | ||
|
||
# 9. с разрывом связей С—С и образованием | ||
"[А-Я]—[А-Я]", | ||
|
||
# 10. Words that are similar to ordinary, but cannot be at the end of a sentence, | ||
# which means they are abbreviations | ||
"\\s(ор|ок|ом|ум|те|ил)\\.", | ||
|
||
# 11. по учению св.отцов | ||
"[а-я]\\.[а-я]", | ||
|
||
# 12. др.-евр. чл.-корр. | ||
"[а-я]{1,3}\\.-[а-я]{1,4}\\.", | ||
|
||
# 13. ул.Тюменская | ||
"[А-Яа-я]\\.[А-Яа-я]", | ||
|
||
# 14. Ставропольский кр., | ||
"\\.," | ||
] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that your previous version excluded all quotes and some other characters as it lead to uneven symbol occurances in the sentences. This now allows quotes, so you might end up with sentences with wrong symbols. I'd suggest to add these to the
even_symbols
config option to make sure quotes (and possibly other symbols) are always even.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The may difference with the previous version, is that I allowed
‘
symbol - that is ok, in Russian there can be words like "Côte d'Ivoire".Please, can you provide more info about
even_symbols
, I can't find detailed description about it.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this regex here, you could potentially get the following sentence:
This is ”uneven and suddenly stops the quote without terminating symbol
Regarding
even_symbols
, can you tell me what exactly is not clear from the README? Then we can adjust the README to provide better infoThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you have a chance to look at this? Is there anything I can help with?