From 50ddd680390573344f11d43b0c8d4d1bfa1b6d10 Mon Sep 17 00:00:00 2001 From: Raivis Dejus Date: Wed, 25 Oct 2023 15:28:42 +0300 Subject: [PATCH] Will update instructions to use latest wikiextractor --- README.md | 22 +++++++++------------- 1 file changed, 9 insertions(+), 13 deletions(-) diff --git a/README.md b/README.md index c2a1367..426aeca 100644 --- a/README.md +++ b/README.md @@ -41,10 +41,10 @@ git clone https://github.com/Common-Voice/cv-sentence-extractor.git ### Wikipedia Extraction -You need to download the WikiExtractor: +Install the WikiExtractor: ``` -git clone https://github.com/attardi/wikiextractor.git +pip install wikiextractor ``` ## Extraction @@ -66,9 +66,7 @@ bzip2 -d enwiki-latest-pages-articles-multistream.xml.bz2 2. Use WikiExtractor to extract a dump (this might take a few hours). In the parameters, we specify to use JSON as the output format instead of the default XML. ```bash -cd wikiextractor -git checkout e4abb4cbd019b0257824ee47c23dd163919b731b -python WikiExtractor.py --json ../enwiki-latest-pages-articles-multistream.xml +python -m wikiextractor.WikiExtractor --json enwiki-latest-pages-articles-multistream.xml ``` In order to test your setup or create a small test set, you can interrupt the extractor after a few seconds already, as it creates separate files in each step. Those files can be already ingested by the `cv-sentence-extractor`. @@ -79,9 +77,9 @@ In the beginning, the WikiExtractor prints out how many processes it will use fo 3. Scrape the sentences into a new file from the WikiExtractor output dir (this might take more than 6h to finish) ```bash -cd ../cv-sentence-extractor +cd cv-sentence-extractor pip3 install -r requirements.txt # can be skipped if your language doesn't use the Python segmenter -cargo run --release -- -l en -d ../wikiextractor/text/ extract >> wiki.en.txt +cargo run --release -- -l en -d ../text/ extract >> wiki.en.txt ``` *Tip: You don't need this last process to finish to start observing the output, wiki.en.txt should get a few thousands sentences in just a few minutes, and you can use that as a way to estimate the quality of the output early on and stop the process if you are not happy.* @@ -118,16 +116,14 @@ This process is very similar to the Wikipedia process above. We can only extract Example (you can change "en" to your locale code) ```bash -wget https://dumps.wikimedia.org/enwikisource/latest//enwikisource-latest-pages-articles.xml.bz2 +wget https://dumps.wikimedia.org/enwikisource/latest/enwikisource-latest-pages-articles.xml.bz2 bzip2 -d enwikisource-latest-pages-articles.xml.bz2 ``` 2. Use WikiExtractor to extract a dump (this might take a few hours) ```bash -cd wikiextractor -git checkout e4abb4cbd019b0257824ee47c23dd163919b731b -python WikiExtractor.py --json ../enwikisource-latest-pages-articles.xml +python -m wikiextractor.WikiExtractor --json enwikisource-latest-pages-articles.xml ``` *Important note: Please check the section about [creating a rules file](#using-language-rules) and [a blocklist](#create-a-blocklist-based-on-less-common-words) at this point, you might want to consider creating them and that process should happen before step 3.* @@ -135,9 +131,9 @@ python WikiExtractor.py --json ../enwikisource-latest-pages-articles.xml 3. Scrape the sentences into a new file from the WikiExtractor output dir (this might take more than 6h to finish) ```bash -cd ../cv-sentence-extractor +cd cv-sentence-extractor pip3 install -r requirements.txt # can be skipped if your language doesn't use the Python segmenter -cargo run --release -- -l en -d ../wikiextractor/text/ extract-wikisource >> wiki.en.txt +cargo run --release -- -l en -d ../text/ extract-wikisource >> wiki.en.txt ``` *Tip: You don't need this last process to finish to start observing the output, wiki.en.txt should get a few thousands sentences in just a few minutes, and you can use that as a way to estimate the quality of the output early on and stop the process if you are not happy.*