Initial addition of the Russian language #66

iLeonidze · 2020-01-05T12:26:10Z

How many sentences did you get at the end?

2,118,974 lines on output, 1,871,107 after dedup (cat wiki.ru.txt | sort | uniq | wc -l).

How did you create the blacklist file?

A few bad words, which were found in results
A lot of abbreviations

Review

For review please use sample file.

Reviewers:
@swenkal @spartedo @IvanParadox

MichaelKohler

Thanks, I've added some comments for minor improvements. Can't say anything about the actual sentences though, as I don't speak Russian ;)

src/rules/russian.toml

swenkal · 2020-01-05T14:52:39Z

I checked two segments of "sample file":

169087 - 169300 lines
548455 - 548704 lines

1 segment:
169170 line. "Ей, угодниче Божий!"
Weird sentence. It's look like old slavonic lang

169258 line. "В году в карьере актрисы наметился спад."
Missed year numbers. Therefore the meaning of the sentence is lost

169286 and 169288 lines are the same.
"Исполняли функции на общественных началах."

2 segment - normal. Did not find some abbreviations, syntax or gramma errors.

iLeonidze · 2020-01-06T13:07:26Z

@swenkal thank you.

Weird sentence. It's look like old slavonic lang

It's ok, The language is large, it has many different styles of speech, especially since we do not need any particular one.

Missed year numbers

I found the original Wikipedia article, and actually there were numbers in the original sentence. I don’t know where they went, it’s not my settings, it’s nothing to fix here either.

lines are the same

This issue already exists #14

iLeonidze · 2020-01-07T07:15:54Z

@MichaelKohler I updated the rules file according to your comments, please check.

nukeador · 2020-01-07T11:10:45Z

Thanks for leading this, super excited to see Russian coming!

For the QA of the outputs, my recommendation would be to create 4 files with 100 randoms sentences each, and get a native speaker (ideally linguist) to review them and comment back with the number of sentences (you can use this template spreadsheet for easy evaluation).

On your comment about the blacklist file, it seems you are talking about the rules file, did you tried the process of removing less used words? For example in Spanish we added to the blacklist words with 80 repetitions or less, and that allowed to clean up most complex or foreign words.

Thanks for your efforts!

iLeonidze · 2020-01-07T13:34:38Z

@nukeador first QA already checked ~400 sentences, please see his reply from above. As for others, we are waiting for them to check. Is @swenkal's reply format is ok for you? I suppose spreadsheet is a bit complex for this.

As for blacklist file - I do not see the reason for removing less used words (as I understand you correctly). First of all it is difficult to find out them in 2 million sentences. Second, in Wikipedia articles there is a good quality speech-style, may be there will be a small number of strange sentences, but among two million lines, this is a drop in the bucket.

stefangrotz · 2020-01-07T13:42:41Z

As for blacklist file - I do not see the reason for removing less used words (as I understand you correctly). First of all it is difficult to find out them in 2 million sentences. Second, in Wikipedia articles there is a good quality speech-style, may be there will be a small number of strange sentences, but among two million lines, this is a drop in the bucket.

Hey iLeonidze, I really recommend using a blacklist, it deletes a lot of foreign words and typos. It is not difficult to generate this blacklist, you can find a instruction in the readme.md of this repo. If you don't want to loose too many sentences you can decrease the number of repetitions, e.g. you can say you only delete words that are used less than 40 or 60 times on the russian wikipedia.

iLeonidze · 2020-01-07T13:47:17Z

@stefangrotz thank you, I did not see at first that there is a script that removes rarely used words.
Of course, I will try and make changes with blacklist as you need it, but can you explain or give me a link, why do I need to get rid of such words?

stefangrotz · 2020-01-07T13:51:03Z

@iLeonidze I can only give you examples from other languages. The list will mainly contain typos. For example "Wikipediaa" instead of "Wikipedia". This will maybe appear 60 times in the complete corpus and this blacklist with every word that appears less than 80 time helps to avoid words like this. You can also delete valid words from the blacklist manually if you want to keep them.

MichaelKohler

I've made two comments. After fixing that, I'd suggest to re-run the export and create sample files out of that. When running the export, I don't think you need to wait until it's fully done, once you have a good chunk of sentences, you can abort and create the samples from there.

MichaelKohler · 2020-01-07T17:16:40Z

src/rules/russian.toml

+
+allowed_symbols_regex = "[А-Яа-яёЁ\\s:,.\\-‑?;!—‐–'·=’―−”‘]"
+
+disallowed_symbols = [


Note that this won't be used as allowed_symbols_regex is set. However, given that those invisible chars should not be allowed by the above regex, I guess that's fine and it can be completely removed?

No. As I found out invisible chars are not detected by regex for some reason. Perhaps because they are part of \s. Therefore, they had to be defined here, and can't be removed.

yeah, that's probably what is happening here. Could you just use the specific character you want for whitespace instead of \s or is my Regex knowledge completely failing me here?

MichaelKohler · 2020-01-07T17:18:23Z

src/rules/russian.toml

+# с Иваном Галамяном.
+needs_uppercase_start = true
+
+allowed_symbols_regex = "[А-Яа-яёЁ\\s:,.\\-‑?;!—‐–'·=’―−”‘]"


Note that your previous version excluded all quotes and some other characters as it lead to uneven symbol occurances in the sentences. This now allows quotes, so you might end up with sentences with wrong symbols. I'd suggest to add these to the even_symbols config option to make sure quotes (and possibly other symbols) are always even.

The may difference with the previous version, is that I allowed ‘ symbol - that is ok, in Russian there can be words like "Côte d'Ivoire".

Please, can you provide more info about even_symbols, I can't find detailed description about it.

With this regex here, you could potentially get the following sentence: This is ”uneven and suddenly stops the quote without terminating symbol

Regarding even_symbols, can you tell me what exactly is not clear from the README? Then we can adjust the README to provide better info

Did you have a chance to look at this? Is there anything I can help with?

MichaelKohler · 2020-03-04T21:03:44Z

@iLeonidze did you see my last comment? Is there anything I could help you with?

Can I also ask you to do the following, to make sure you can profit from the automatic sample extraction we just introduced?

Update your branch with the latest code from the master branch
Rename src/rules/russian.toml to src/rules/ru.toml
Rename src/rules/disallowed_words/russian.toml to src/rules/disallowed_words/ru.toml

Also note that the local command for extraction will now be:

cargo run -- extract -l ru -d path/to/files

Happy to answer any question you may have and thanks for your efforts!

iLeonidze · 2020-03-09T15:24:43Z

@MichaelKohler Hi, I’m already a little busy for a couple of months, I really want to get back to your help as soon as possible. Thank you for the information, I will try

Nihisil · 2020-07-04T12:29:15Z

@iLeonidze thanks for your work. Is there any update on your side?

iLeonidze added 3 commits January 5, 2020 14:12

Create russian.toml

206adee

Add files via upload

a9b4fb1

Add files via upload

0005a31

MichaelKohler reviewed Jan 5, 2020

View reviewed changes

src/rules/russian.toml Outdated Show resolved Hide resolved

src/rules/russian.toml Outdated Show resolved Hide resolved

Add files via upload

d824882

MichaelKohler requested changes Jan 7, 2020

View reviewed changes

MichaelKohler added the waiting on feedback label Mar 1, 2020

MichaelKohler marked this pull request as draft September 1, 2020 16:10

MichaelKohler changed the base branch from master to main October 27, 2020 17:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial addition of the Russian language #66

Initial addition of the Russian language #66

iLeonidze commented Jan 5, 2020 •

edited

Loading

MichaelKohler left a comment

swenkal commented Jan 5, 2020

iLeonidze commented Jan 6, 2020

iLeonidze commented Jan 7, 2020

nukeador commented Jan 7, 2020

iLeonidze commented Jan 7, 2020

stefangrotz commented Jan 7, 2020 •

edited

Loading

iLeonidze commented Jan 7, 2020

stefangrotz commented Jan 7, 2020 •

edited

Loading

MichaelKohler left a comment

MichaelKohler Jan 7, 2020

iLeonidze Jan 8, 2020 •

edited

Loading

MichaelKohler Jan 8, 2020

MichaelKohler Jan 7, 2020

iLeonidze Jan 8, 2020

MichaelKohler Jan 8, 2020

MichaelKohler Jan 26, 2020

MichaelKohler commented Mar 4, 2020

iLeonidze commented Mar 9, 2020

Nihisil commented Jul 4, 2020


		allowed_symbols_regex = "[А-Яа-яёЁ\\s:,.\\-‑?;!—‐–'·=’―−”‘]"

		disallowed_symbols = [

Initial addition of the Russian language #66

Are you sure you want to change the base?

Initial addition of the Russian language #66

Conversation

iLeonidze commented Jan 5, 2020 • edited Loading

How many sentences did you get at the end?

How did you create the blacklist file?

Review

MichaelKohler left a comment

Choose a reason for hiding this comment

swenkal commented Jan 5, 2020

iLeonidze commented Jan 6, 2020

iLeonidze commented Jan 7, 2020

nukeador commented Jan 7, 2020

iLeonidze commented Jan 7, 2020

stefangrotz commented Jan 7, 2020 • edited Loading

iLeonidze commented Jan 7, 2020

stefangrotz commented Jan 7, 2020 • edited Loading

MichaelKohler left a comment

Choose a reason for hiding this comment

MichaelKohler Jan 7, 2020

Choose a reason for hiding this comment

iLeonidze Jan 8, 2020 • edited Loading

Choose a reason for hiding this comment

MichaelKohler Jan 8, 2020

Choose a reason for hiding this comment

MichaelKohler Jan 7, 2020

Choose a reason for hiding this comment

iLeonidze Jan 8, 2020

Choose a reason for hiding this comment

MichaelKohler Jan 8, 2020

Choose a reason for hiding this comment

MichaelKohler Jan 26, 2020

Choose a reason for hiding this comment

MichaelKohler commented Mar 4, 2020

iLeonidze commented Mar 9, 2020

Nihisil commented Jul 4, 2020

iLeonidze commented Jan 5, 2020 •

edited

Loading

stefangrotz commented Jan 7, 2020 •

edited

Loading

stefangrotz commented Jan 7, 2020 •

edited

Loading

iLeonidze Jan 8, 2020 •

edited

Loading