This repo enables you to pull a large and up-to-date text corpus from the web. It uses state-of-the-art processing methods to produce a clean text dataset that you can immediately use to pretrain a large language model, like BERT, GPT, or BLOOM. The main use-case for this repo is the Online Language Modelling Project, where we want to keep a language model up-to-date by pretraining it on the latest Common Crawl and Wikipedia dumps every month or so.
Specifically, this repo has modular Python commands that enable you to:
- Specify Common Crawl web snapshots, or just Wikipedia snapshots. Then pull the data.
- Filter the data for a particular language, like English or French.
- Run the OSCAR filters used by BigScience for the BLOOM language model. These filters ensure some level of text quality and reduce pornographic content.
- Deduplicate the data.
This code is also fairly parallelized, although it can certianly be improved further. It can process over a terabyte from Common Crawl in a day or two, and all of English Wikipedia in less than an hour if you have:
- A machine with a lot of CPUs and memory.
- A fast internet connection.
Before working with this Pipeline on Taurus, allocate a workspace, clone the repository into it and copy the needed setup scripts to your workspace. The next steps need to be executed from your workspace directory!
ws_allocate -F beegfs -r 7 -m [email protected] Dataset-pipe 30
cd path/to/Dataset-pipe
git clone --recursive [email protected]:OpenGPTX/olm-datasets.git
cp olm-datasets/setup.sh .
cp olm-datasets/activate.sh .
For executing the Setup (and all processing steps) allocate resources first:
srun --pty --ntasks=1 --cpus-per-task=4 --time=1:00:00 --mem-per-cpu=1700 bash -l
Afterwards, do the setup:
bash setup.sh
source activate.sh
Before working with it, allocate resources first and execute source activate.sh
everytime youhave a new session.
- If you want to use this repo to generate a decent amount of data, get a machine with lots of CPUs and memory. We use an
n2d-standard-224
runningUbuntu 20.04 LTS
on GCP. Add Terabytes of disk space too. You may need an even larger machine if you want to process close to 100% of a Common Crawl snapshot or several snapshots, particularly due to how much memory the deduplication process uses. - Clone with submodules:
git clone --recursive [email protected]:huggingface/olm-datasets.git
- Install cargo (rust package manager) with
curl https://sh.rustup.rs -sSf | sh
. Then install Ungoliant withcargo install [email protected]
. You may need to install gcc and cmake first. - Set up a Python 3.9 environment, and run
pip install -r requirements.txt
- Run
huggingface-cli login
. This cli should have been installed fromrequirements.txt
. To login, you need to paste a token from your account at https://huggingface.co. This step is necessary for the pipeline to push the generated datasets to your Hugging Face account.
Follow the instructions at pipeline_scripts/common_crawl.
Here is the output dataset to expect from a 20% random segment sample of the August 2022 Common Crawl Snapshot: https://huggingface.co/datasets/Tristan/olm-CC-MAIN-2022-33-sampling-ratio-0.20
Follow the instructions at pipeline_scripts/wikipedia.
Here is the output dataset to expect from a September 2022 snapshot of Wikipedia: https://huggingface.co/datasets/Tristan/olm-wikipedia-20220920
Follow the instructions at analysis_scripts.
Here is a tweet thread which utilizes these scripts: https://twitter.com/TristanThrush/status/1582356055794733057
Here is another tweet thread that dives a little deeper: https://twitter.com/TristanThrush/status/1588156731909029889
And here is a colab where you can quickly run some of the analysis yourself! https://colab.research.google.com/drive/18Wv7ghW2rRjEe3oWDqh2iz9qqO8O6XcX?usp=sharing