Online Language Modelling Dataset Pipeline

This repo enables you to pull a large and up-to-date text corpus from the web. It uses state-of-the-art processing methods to produce a clean text dataset that you can immediately use to pretrain a large language model, like BERT, GPT, or BLOOM. The main use-case for this repo is the Online Language Modelling Project, where we want to keep a language model up-to-date by pretraining it on the latest Common Crawl and Wikipedia dumps every month or so.

Specifically, this repo has modular Python commands that enable you to:

Specify Common Crawl web snapshots, or just Wikipedia snapshots. Then pull the data.
Filter the data for a particular language, like English or French.
Run the OSCAR filters used by BigScience for the BLOOM language model. These filters ensure some level of text quality and reduce pornographic content.
Deduplicate the data.

This code is also fairly parallelized, although it can certianly be improved further. It can process over a terabyte from Common Crawl in a day or two, and all of English Wikipedia in less than an hour if you have:

A machine with a lot of CPUs and memory.
A fast internet connection.

Setup on HPC System Taurus

Before working with this Pipeline on Taurus, allocate a workspace, clone the repository into it and copy the needed setup scripts to your workspace. The next steps need to be executed from your workspace directory!

ws_allocate -F beegfs -r 7 -m [email protected] Dataset-pipe 30
cd path/to/Dataset-pipe
git clone --recursive [email protected]:OpenGPTX/olm-datasets.git 
cp olm-datasets/setup.sh .
cp olm-datasets/activate.sh .

For executing the Setup (and all processing steps) allocate resources first:

srun --pty --ntasks=1 --cpus-per-task=4 --time=1:00:00 --mem-per-cpu=1700 bash -l

Afterwards, do the setup:

bash setup.sh
source activate.sh

Working with the Pipeline

Before working with it, allocate resources first and execute source activate.sh everytime youhave a new session.

Setup

If you want to use this repo to generate a decent amount of data, get a machine with lots of CPUs and memory. We use an n2d-standard-224 running Ubuntu 20.04 LTS on GCP. Add Terabytes of disk space too. You may need an even larger machine if you want to process close to 100% of a Common Crawl snapshot or several snapshots, particularly due to how much memory the deduplication process uses.
Clone with submodules: git clone --recursive [email protected]:huggingface/olm-datasets.git
Install cargo (rust package manager) with curl https://sh.rustup.rs -sSf | sh. Then install Ungoliant with cargo install [email protected]. You may need to install gcc and cmake first.
Set up a Python 3.9 environment, and run pip install -r requirements.txt
Run huggingface-cli login. This cli should have been installed from requirements.txt. To login, you need to paste a token from your account at https://huggingface.co. This step is necessary for the pipeline to push the generated datasets to your Hugging Face account.

Getting a clean and up-to-date Common Crawl corpus

Follow the instructions at pipeline_scripts/common_crawl.

Here is the output dataset to expect from a 20% random segment sample of the August 2022 Common Crawl Snapshot: https://huggingface.co/datasets/Tristan/olm-CC-MAIN-2022-33-sampling-ratio-0.20

Getting a clean and up-to-date Wikipedia corpus

Follow the instructions at pipeline_scripts/wikipedia.

Here is the output dataset to expect from a September 2022 snapshot of Wikipedia: https://huggingface.co/datasets/Tristan/olm-wikipedia-20220920

Analyzing the corpora

Follow the instructions at analysis_scripts.

Here is a tweet thread which utilizes these scripts: https://twitter.com/TristanThrush/status/1582356055794733057

Here is another tweet thread that dives a little deeper: https://twitter.com/TristanThrush/status/1588156731909029889

And here is a colab where you can quickly run some of the analysis yourself! https://colab.research.google.com/drive/18Wv7ghW2rRjEe3oWDqh2iz9qqO8O6XcX?usp=sharing

Name		Name	Last commit message	Last commit date
Latest commit History 122 Commits
analysis_scripts		analysis_scripts
pipeline_scripts		pipeline_scripts
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
activate.sh		activate.sh
get_hf_dataset.sh		get_hf_dataset.sh
requirements.txt		requirements.txt
setup.sh		setup.sh
stanza_download.py		stanza_download.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Online Language Modelling Dataset Pipeline

Setup on HPC System Taurus

Working with the Pipeline

Setup

Getting a clean and up-to-date Common Crawl corpus

Getting a clean and up-to-date Wikipedia corpus

Analyzing the corpora

About

Releases

Packages

Languages

License

OpenGPTX/olm-datasets

Folders and files

Latest commit

History

Repository files navigation

Online Language Modelling Dataset Pipeline

Setup on HPC System Taurus

Working with the Pipeline

Setup

Getting a clean and up-to-date Common Crawl corpus

Getting a clean and up-to-date Wikipedia corpus

Analyzing the corpora

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages