-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial addition of the Russian language #66
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I've added some comments for minor improvements. Can't say anything about the actual sentences though, as I don't speak Russian ;)
I checked two segments of "sample file":
1 segment: 169258 line. "В году в карьере актрисы наметился спад." 169286 and 169288 lines are the same. 2 segment - normal. Did not find some abbreviations, syntax or gramma errors. |
@swenkal thank you.
It's ok, The language is large, it has many different styles of speech, especially since we do not need any particular one.
I found the original Wikipedia article, and actually there were numbers in the original sentence. I don’t know where they went, it’s not my settings, it’s nothing to fix here either.
This issue already exists #14 |
@MichaelKohler I updated the rules file according to your comments, please check. |
Thanks for leading this, super excited to see Russian coming! For the QA of the outputs, my recommendation would be to create 4 files with 100 randoms sentences each, and get a native speaker (ideally linguist) to review them and comment back with the number of sentences (you can use this template spreadsheet for easy evaluation). On your comment about the blacklist file, it seems you are talking about the rules file, did you tried the process of removing less used words? For example in Spanish we added to the blacklist words with 80 repetitions or less, and that allowed to clean up most complex or foreign words. Thanks for your efforts! |
@nukeador first QA already checked ~400 sentences, please see his reply from above. As for others, we are waiting for them to check. Is @swenkal's reply format is ok for you? I suppose spreadsheet is a bit complex for this. As for blacklist file - I do not see the reason for removing less used words (as I understand you correctly). First of all it is difficult to find out them in 2 million sentences. Second, in Wikipedia articles there is a good quality speech-style, may be there will be a small number of strange sentences, but among two million lines, this is a drop in the bucket. |
Hey iLeonidze, I really recommend using a blacklist, it deletes a lot of foreign words and typos. It is not difficult to generate this blacklist, you can find a instruction in the readme.md of this repo. If you don't want to loose too many sentences you can decrease the number of repetitions, e.g. you can say you only delete words that are used less than 40 or 60 times on the russian wikipedia. |
@stefangrotz thank you, I did not see at first that there is a script that removes rarely used words. |
@iLeonidze I can only give you examples from other languages. The list will mainly contain typos. For example "Wikipediaa" instead of "Wikipedia". This will maybe appear 60 times in the complete corpus and this blacklist with every word that appears less than 80 time helps to avoid words like this. You can also delete valid words from the blacklist manually if you want to keep them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made two comments. After fixing that, I'd suggest to re-run the export and create sample files out of that. When running the export, I don't think you need to wait until it's fully done, once you have a good chunk of sentences, you can abort and create the samples from there.
|
||
allowed_symbols_regex = "[А-Яа-яёЁ\\s:,.\\-‑?;!—‐–'·=’―−”‘]" | ||
|
||
disallowed_symbols = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this won't be used as allowed_symbols_regex
is set. However, given that those invisible chars should not be allowed by the above regex, I guess that's fine and it can be completely removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. As I found out invisible chars are not detected by regex for some reason. Perhaps because they are part of \s
. Therefore, they had to be defined here, and can't be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, that's probably what is happening here. Could you just use the specific character you want for whitespace instead of \s
or is my Regex knowledge completely failing me here?
# с Иваном Галамяном. | ||
needs_uppercase_start = true | ||
|
||
allowed_symbols_regex = "[А-Яа-яёЁ\\s:,.\\-‑?;!—‐–'·=’―−”‘]" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that your previous version excluded all quotes and some other characters as it lead to uneven symbol occurances in the sentences. This now allows quotes, so you might end up with sentences with wrong symbols. I'd suggest to add these to the even_symbols
config option to make sure quotes (and possibly other symbols) are always even.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The may difference with the previous version, is that I allowed ‘
symbol - that is ok, in Russian there can be words like "Côte d'Ivoire".
Please, can you provide more info about even_symbols
, I can't find detailed description about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this regex here, you could potentially get the following sentence: This is ”uneven and suddenly stops the quote without terminating symbol
Regarding even_symbols
, can you tell me what exactly is not clear from the README? Then we can adjust the README to provide better info
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you have a chance to look at this? Is there anything I can help with?
@iLeonidze did you see my last comment? Is there anything I could help you with? Can I also ask you to do the following, to make sure you can profit from the automatic sample extraction we just introduced?
Also note that the local command for extraction will now be:
Happy to answer any question you may have and thanks for your efforts! |
@MichaelKohler Hi, I’m already a little busy for a couple of months, I really want to get back to your help as soon as possible. Thank you for the information, I will try |
@iLeonidze thanks for your work. Is there any update on your side? |
How many sentences did you get at the end?
2,118,974
lines on output,1,871,107
after dedup (cat wiki.ru.txt | sort | uniq | wc -l
).How did you create the blacklist file?
Review
For review please use sample file.
Reviewers:
@swenkal @spartedo @IvanParadox