GitHub

Data Pipeline

Downloading from arXiv

Processing with LaTeXML

The following untars the arxiv source tar and finds the math files using an internet connection

python3 process.py \
   /media/hd1/arXiv_src/src/arXiv_src_2101_023.tar \
   $HOME/rm_me_process \
   --term math

More Processing

Getting Labeled Definitions

Classifying Definitions

Classifying with multiprocessing (also works on a single GPU)

singularity run --nv \
      --bind $HOME/Documents/arxivDownload:/opt/arxivDownload,/media/hd1:/opt/data_dir \
    $HOME/singul/runner.sif python3 embed/mp_classify.py \
     --model /opt/data_dir/trained_models/lstm_classifier/lstm_Aug-19_17-22 \
     --out /rm_me_path/with_mp_classify \
     --mine /opt/data_dir/promath/math94/940{3,4,5}_001.tar.gz

NER

example with singularity:

singularity run --nv 
    --bind $HOME/Documents/arxivDownload:/opt/arxivDownload,/media/hd1:/opt/data_dir \
    $HOME/singul/runner.sif python3 embed/inference_ner.py \
    --mine /opt/data_dir/glossary/inference_class_all/math96/*.xml.gz \
    --model /opt/data_dir/trained_models/ner_model/lstm_ner/ner_Sep-29_03-45/exp_001 \
    --out $HOME/rm_me_ner

Joining Phrases

MP_scripts/mpi_only_loop.py
slurm_scripts/mpi_joiner.sh

Jupyter Notebooks

Populating and examples SQLAlchemy databases
- Filling the arxiv metadata database using databases/create_db_define_models.py
- Query join examples in sqlalchemy query language
Parsing Arxib Manifest and querying metadat.ipynb
- Using magic module to find file info
- Structure of the data in the manifest file
- using the dload.py script and its objects
- basic usage of the arxiv API package
- very disorganized, mostly scratch work
Time stats check output and logs.ipynb
- code to read and interpret latexml log files
- plot time of latexml processing
getting problem articles for latexml.ipynb
- Identify articles that are not included in the arxmliv database
- Try to process these problematic articles with either removing environments or with LaTeXTual
Word embeddings generation and evaluation.py
- read the binary files produced by word2vec
- Get the raw text ready for embedders
- Search for arxiv.db for the tags of an article
- tSNE visualization of the tags of terms

Scripts

update_db.py
- USAGE: python update_db.py DATABASE MANIFEST.xml tar_src_path [--log ]
- Where database is a sqlite database and manifest is an xml file in the original format
- tar_src_path is the dir where the tar files can be found
- Ex. python3 update_db.py /mnt/databases/arxivDB.db ../arXiv_src_manifest_Oct_2019.xml /mnt/arXiv_src/
process.py
- Xtraction class reads and extracts a arXiv tar files.
- Querying the arxiv metadata with the arxiv API and the arxiv.py package
- Xtraction(tarfilename, db='sqlite:///pathdb') to read metadata from a database instead of api
- Writing arxiv metadata to a database.

Queries

Index the article ID column to speedup queries

CREATE INDEX id_ind on articles(id);

To search and article, run with the following query:

select tags from articles where id between "http://arxiv.org/abs/{0}" and "http://arxiv.org/abs/{0}{{";

Count the articles in a year of tar files

SELECT  count(articles.id) FROM manifest LEFT JOIN articles on manifest.id = articles.tarfile_id WHERE manifest.filename LIKE 'src/arXiv_src_06%' and articles.tags like '[{''term'': ''math%';

Find the authors (in general) with the most publications

SELECT author, count(*) AS c FROM articles GROUP BY author ORDER BY c DESC LIMIT 10;

Hack to find main article tag

 SELECT count(tags) FROM articles where tags LIKE '[{''term'': ''math.DG''%';

find repeated entries where DataId is the repeated term

SELECT DataId, COUNT(*) c FROM DataTab GROUP BY DataId HAVING c > 1;

Left join to quickly find all articles in a tar file

SELECT  articles.id, tags FROM manifest LEFT JOIN articles on manifest.id = articles.tarfile_id WHERE manifest.id = 1747;

To check the files with with unknown encoding:

   find . -name 'latexml_commentary.txt' -exec grep Ignoring {} \;

To process the first .tex file to an .xml file of the same name and last part of error stream to latexml_commentary.txt

TEXF=`ls *.tex`; latexml $TEXF.tex 2>&1 > ${TEXF%.*}.xml | tail -15 >> latexml_commentary.txt

To find directories unprocessed by latexml (don't have a latexml_errors_mess.txt file)

find ./* -maxdepth 0 -type d '!' -exec test -e "{}/latexml_errors_mess.txt" ';' -print

To filter manually cancelled latexml processes search in the latex_errors file with:

Fatal:perl:die Perl died

When LaTeXML runs out of memory for example in 1504.06138

(Processing definitions /usOut of memory!

Notes

There is a limit of around 500 articles id that the API can handle.
In 2014 the article name format changed from YYMM.{4 digits} to 5 digits.
In March 2007, the naming format of the articles changed from 0701/math0701672 to 1503/1503.08375.
The distribution of the sizes of the tar files in the manifest:

Counter({Interval(-1857373.906, 382162956.2, closed='right'): 273,
         Interval(382162956.2, 764272737.4, closed='right'): 2222,
         Interval(764272737.4, 1146382518.6, closed='right'): 3,
         Interval(1528492299.8, 1910602081.0, closed='right'): 1})
Large files
src/arXiv_src_1405_008.tar|805505033
src/arXiv_src_1512_003.tar|1910602081
src/arXiv_src_1812_033.tar|835663353
src/arXiv_src_1908_006.tar|803583004

Definitions Tags

ltx_theorem_df -- /math.0406533

Problems

LateXML did not finish 2014/1411.6225/bcdr_en.tex

Testing

All the tests in the ./tests directory are discovered with the command. Run from the repo directory

PYTHONPATH="./tests" python -m unittest discover -s tests

Or, from the tests directory, run:

PYTHONPATH=".." python -m unittest discover -s tests

The xml_file.xml is modified by the search.py module:

processed, is False by default.
search exists only when locate has been ran on the filesystem. It is true, when the file was found and False if the file has been searched and not found.

Name		Name	Last commit message	Last commit date
Latest commit History 845 Commits
AITP_pres		AITP_pres
ANNs_and_KGs		ANNs_and_KGs
CICM_pres		CICM_pres
CICM_proposal		CICM_proposal
FoMM_pres		FoMM_pres
HIM_2024		HIM_2024
LLMs		LLMs
MB_presentation		MB_presentation
MP_scripts		MP_scripts
Neocortex		Neocortex
SCSS_2021_paper		SCSS_2021_paper
SIMMAC_2023		SIMMAC_2023
arxiv.py @ 49d2d54		arxiv.py @ 49d2d54
classifier_trainer		classifier_trainer
cnl_latextual		cnl_latextual
databases		databases
diary		diary
embed		embed
interact_scripts		interact_scripts
ner		ner
planetmath		planetmath
preprocessing		preprocessing
search_argot		search_argot
singularity		singularity
slurm_scripts		slurm_scripts
tests		tests
unwiki @ c28d9e1		unwiki @ c28d9e1
.gitignore		.gitignore
.gitmodules		.gitmodules
Chunking for NER and BIO tags.ipynb		Chunking for NER and BIO tags.ipynb
Classifier Definitions with scikit learn.ipynb		Classifier Definitions with scikit learn.ipynb
Comparison between NN and SGD glossaries.py		Comparison between NN and SGD glossaries.py
Creating dependency graph from definition file.ipynb		Creating dependency graph from definition file.ipynb
Creating explicit dependency tree or graph.py		Creating explicit dependency tree or graph.py
Dealing with the stacks project dataset.py		Dealing with the stacks project dataset.py
Parsing Arxiv Manifest and querying metadata.py		Parsing Arxiv Manifest and querying metadata.py
Picking the best classifier with Showdown.py		Picking the best classifier with Showdown.py
Planetmath exploring data.py		Planetmath exploring data.py
Populating and examples SQLAlchemy databases.py		Populating and examples SQLAlchemy databases.py
Preparing Data and NLP.ipynb		Preparing Data and NLP.ipynb
Putting Classifier and NER Together.py		Putting Classifier and NER Together.py
README.md		README.md
Reading data from CNL.py		Reading data from CNL.py
Scraping Wikis for definienda.ipynb		Scraping Wikis for definienda.ipynb
SyntaxNet and NLTK examples.ipynb		SyntaxNet and NLTK examples.ipynb
Time stats check output and logs.py		Time stats check output and logs.py
Visualization of definitions.ipynb		Visualization of definitions.ipynb
Word2Vec on ArXiv data.ipynb		Word2Vec on ArXiv data.ipynb
arXiv_src_manifest.xml		arXiv_src_manifest.xml
comparing homegrown xml and arxmliv html.ipynb		comparing homegrown xml and arxmliv html.ipynb
config_version_control.toml		config_version_control.toml
definiendum.py		definiendum.py
dload.py		dload.py
extract definiendum helper.py		extract definiendum helper.py
extract.py		extract.py
file_guess.pl		file_guess.pl
file_guess2.pl		file_guess2.pl
get_main_tex.pl		get_main_tex.pl
getting problem articles for latexml.py		getting problem articles for latexml.py
graphs.py		graphs.py
latexml_err_mess_stats.py		latexml_err_mess_stats.py
multithreaded_dependency_graph.py		multithreaded_dependency_graph.py
parallel_run.sh		parallel_run.sh
parsing_xml.py		parsing_xml.py
partial fit best classifier selection.py		partial fit best classifier selection.py
peep_tar.py		peep_tar.py
process.py		process.py
random_sampling.py		random_sampling.py
render LaTexml.ipynb		render LaTexml.ipynb
run_latexml.sh		run_latexml.sh
run_single_latexml.sh		run_single_latexml.sh
sampling.py		sampling.py
search.py		search.py
test_extract_defs_with_NER.ipynb		test_extract_defs_with_NER.ipynb
update_db.py		update_db.py
visual_db.txt		visual_db.txt
wikiparse.py		wikiparse.py
wikipedia xml parsing high efficiency cleaning data set.ipynb		wikipedia xml parsing high efficiency cleaning data set.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Pipeline

Downloading from arXiv

Processing with LaTeXML

More Processing

Getting Labeled Definitions

Classifying Definitions

NER

Joining Phrases

Jupyter Notebooks

Scripts

Queries

Notes

Definitions Tags

Problems

Testing

About

Releases

Packages

Contributors 2

Languages

lab156/arxivDownload

Folders and files

Latest commit

History

Repository files navigation

Data Pipeline

Downloading from arXiv

Processing with LaTeXML

More Processing

Getting Labeled Definitions

Classifying Definitions

NER

Joining Phrases

Jupyter Notebooks

Scripts

Queries

Notes

Definitions Tags

Problems

Testing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages