Imbalanced quotation mark in Mozilla Common Voice Japanese Dataset #2321
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Mozilla Common Voice 11.0 Japanese dataset has unbalanced quotation mark that makes
bin/import_cv2.py
panic.Reproduction
Why does it happen?
cv-corpus-11.0-2022-09-21/ja/validated.tsv
has 4 lines that can potentially mess upcsv
package's quotation handling.Note that in the second occurrence, the quotation mark is not balanced. I assume it has something to do with Japanese typing system. Japanese language often uses 「」 instead of "", and it needs manual conversion, and for some reason it didn't get converted properly.
At the same time, python defaults double quotation mark as the quote character when parsing csv. So python tries to parse the file until the next quotation mark appears. The next occurrence is line 31236 (3712 lines later), thus the error message:
_csv.Error: field larger than field limit (131072)
Fix
Do not use default quote character. In fact, do not worry about quotation at all when parsing csv.
That is what Common Voice ToolBox Package is doing too