Skip to content

lab156/arxivDownload

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Pipeline

Downloading from arXiv

Processing with LaTeXML

The following untars the arxiv source tar and finds the math files using an internet connection

python3 process.py \
   /media/hd1/arXiv_src/src/arXiv_src_2101_023.tar \
   $HOME/rm_me_process \
   --term math

More Processing

Getting Labeled Definitions

Classifying Definitions

Classifying with multiprocessing (also works on a single GPU)

singularity run --nv \
      --bind $HOME/Documents/arxivDownload:/opt/arxivDownload,/media/hd1:/opt/data_dir \
    $HOME/singul/runner.sif python3 embed/mp_classify.py \
     --model /opt/data_dir/trained_models/lstm_classifier/lstm_Aug-19_17-22 \
     --out /rm_me_path/with_mp_classify \
     --mine /opt/data_dir/promath/math94/940{3,4,5}_001.tar.gz

NER

example with singularity:

singularity run --nv 
    --bind $HOME/Documents/arxivDownload:/opt/arxivDownload,/media/hd1:/opt/data_dir \
    $HOME/singul/runner.sif python3 embed/inference_ner.py \
    --mine /opt/data_dir/glossary/inference_class_all/math96/*.xml.gz \
    --model /opt/data_dir/trained_models/ner_model/lstm_ner/ner_Sep-29_03-45/exp_001 \
    --out $HOME/rm_me_ner

Joining Phrases

  • MP_scripts/mpi_only_loop.py
  • slurm_scripts/mpi_joiner.sh

Jupyter Notebooks

  • Populating and examples SQLAlchemy databases
    • Filling the arxiv metadata database using databases/create_db_define_models.py
    • Query join examples in sqlalchemy query language
  • Parsing Arxib Manifest and querying metadat.ipynb
    • Using magic module to find file info
    • Structure of the data in the manifest file
    • using the dload.py script and its objects
    • basic usage of the arxiv API package
    • very disorganized, mostly scratch work
  • Time stats check output and logs.ipynb
    • code to read and interpret latexml log files
    • plot time of latexml processing
  • getting problem articles for latexml.ipynb
    • Identify articles that are not included in the arxmliv database
    • Try to process these problematic articles with either removing environments or with LaTeXTual
  • Word embeddings generation and evaluation.py
    • read the binary files produced by word2vec
    • Get the raw text ready for embedders
    • Search for arxiv.db for the tags of an article
    • tSNE visualization of the tags of terms

Scripts

  • update_db.py
    • USAGE: python update_db.py DATABASE MANIFEST.xml tar_src_path [--log ]
    • Where database is a sqlite database and manifest is an xml file in the original format
    • tar_src_path is the dir where the tar files can be found
    • Ex. python3 update_db.py /mnt/databases/arxivDB.db ../arXiv_src_manifest_Oct_2019.xml /mnt/arXiv_src/
  • process.py
    • Xtraction class reads and extracts a arXiv tar files.
    • Querying the arxiv metadata with the arxiv API and the arxiv.py package
    • Xtraction(tarfilename, db='sqlite:///pathdb') to read metadata from a database instead of api
    • Writing arxiv metadata to a database.

Queries

  • Index the article ID column to speedup queries
CREATE INDEX id_ind on articles(id);

To search and article, run with the following query:

select tags from articles where id between "http://arxiv.org/abs/{0}" and "http://arxiv.org/abs/{0}{{";
  • Count the articles in a year of tar files
SELECT  count(articles.id) FROM manifest LEFT JOIN articles on manifest.id = articles.tarfile_id WHERE manifest.filename LIKE 'src/arXiv_src_06%' and articles.tags like '[{''term'': ''math%';
  • Find the authors (in general) with the most publications
SELECT author, count(*) AS c FROM articles GROUP BY author ORDER BY c DESC LIMIT 10;
  • Hack to find main article tag
 SELECT count(tags) FROM articles where tags LIKE '[{''term'': ''math.DG''%';
  • find repeated entries where DataId is the repeated term
SELECT DataId, COUNT(*) c FROM DataTab GROUP BY DataId HAVING c > 1;
  • Left join to quickly find all articles in a tar file
SELECT  articles.id, tags FROM manifest LEFT JOIN articles on manifest.id = articles.tarfile_id WHERE manifest.id = 1747;
  • To check the files with with unknown encoding:
   find . -name 'latexml_commentary.txt' -exec grep Ignoring {} \;
  • To process the first .tex file to an .xml file of the same name and last part of error stream to latexml_commentary.txt
TEXF=`ls *.tex`; latexml $TEXF.tex 2>&1 > ${TEXF%.*}.xml | tail -15 >> latexml_commentary.txt
  • To find directories unprocessed by latexml (don't have a latexml_errors_mess.txt file)
find ./* -maxdepth 0 -type d '!' -exec test -e "{}/latexml_errors_mess.txt" ';' -print
  • To filter manually cancelled latexml processes search in the latex_errors file with:
Fatal:perl:die Perl died
  • When LaTeXML runs out of memory for example in 1504.06138
(Processing definitions /usOut of memory!

Notes

  • There is a limit of around 500 articles id that the API can handle.
  • In 2014 the article name format changed from YYMM.{4 digits} to 5 digits.
  • In March 2007, the naming format of the articles changed from 0701/math0701672 to 1503/1503.08375.
  • The distribution of the sizes of the tar files in the manifest:
Counter({Interval(-1857373.906, 382162956.2, closed='right'): 273,
         Interval(382162956.2, 764272737.4, closed='right'): 2222,
         Interval(764272737.4, 1146382518.6, closed='right'): 3,
         Interval(1528492299.8, 1910602081.0, closed='right'): 1})
Large files
src/arXiv_src_1405_008.tar|805505033
src/arXiv_src_1512_003.tar|1910602081
src/arXiv_src_1812_033.tar|835663353
src/arXiv_src_1908_006.tar|803583004

Definitions Tags

  • ltx_theorem_df -- /math.0406533

Problems

  • LateXML did not finish 2014/1411.6225/bcdr_en.tex

Testing

  • All the tests in the ./tests directory are discovered with the command. Run from the repo directory
PYTHONPATH="./tests" python -m unittest discover -s tests

Or, from the tests directory, run:

PYTHONPATH=".." python -m unittest discover -s tests

The xml_file.xml is modified by the search.py module:

  • processed, is False by default.
  • search exists only when locate has been ran on the filesystem. It is true, when the file was found and False if the file has been searched and not found.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published