Skip to content

arXiv source data, and associated code for preprocessing, labeling, and partitioning

Notifications You must be signed in to change notification settings

georgetown-cset/arxiv-corpus

Repository files navigation

arXiv data and code

This repo contains arXiv source data, and associated code for preprocessing, labeling, and partitioning it. The source data are under data/source as gzipped JSONL files.

After setting up a Python environment, run

python runner.py 'data/source/arxiv-data-20200125-split*.jsonl.gz'

The result will be a preprocessed corpus under data/processed and various partitions and samples for training under data/train.

About

arXiv source data, and associated code for preprocessing, labeling, and partitioning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published