NOTE: This project is now hosted here: https://github.com/openreview/open-meta-extraction
This repo will be removed in the future
A set of services to spider and extract metadata (abstracts, titles, authors, pdf links) from given URLs.
Services may be run individually as command line applications, or as service that runs on a schedule (using PM2)
This project provides a set of services to take the URL of a webpage for a research paper, and extract metadata for that paper. Metadata includes the abstract, title, authors, and a URL for a PDF version of the paper. Spidering is done using an automated Chrome browser. Once a URL is loaded into the browser, a series of extraction rules are applied. The extractor checks for commonly used metadata schemas, including Highwire Press, Dublin Core, OpenGraph, along with non-standard variations of the most common schemas, and a growing list of journal and domain-specific methods of embedding metadata in the head or body of a webpage. Once a suitable schema is identified, the metadata fields are saved to a local file system, then returned to the caller. If changes are made to the spidering and/or extractor, such that re-running the system produces different results, that fact is returned along with the results.
- node >= v16.15
- I install via n (https://github.com/tj/n) or other node version management
- rush
- > npm install -g @microsoft/rush
- pm2
- > npm install -g pm2
- Chrome dependencies (runs headless via puppeteer)
- > sudo apt install libnss3-dev libatk1.0-0 libatk-bridge2.0-0 libcups2 libgbm1 libpangocairo-1.0-0 libgtk-3-0
- HTML Tidy
- > sudo apt install tidy
- dotenv (optional, helps with development)
- MongoDB, tested against v4.4.14
Run ‘rush’ commands from project root
- Initial installation
- > rush install
- Update dependencies after version bump in package.json
- > rush update
- Build/Full rebuild
- > rush build
- > rush rebuild
- Run tests
- > rush test
Create at least one of the three config files in project root (or in an ancestor directory):
config-test.json: to run tests config-dev.json: to run command line apps locally with a dev database instance config-prod.json: for pm2 production deployment
{ "openreview": { "restApi": "https://api.openreview.net", "restUser": "openreview-username", "restPassword": "openreview-password" }, "mongodb": { "connectionUrl": "mongodb://localhost:27017/", "dbName": "meta-extract-(dev|prod|test)" } }
All functionality is available from the command line via bin/cli in the root folder.
> bin/cli --env dev run> node ./packages/services/dist/src/cli cli <command> Commands: cli extraction-summary Show A Summary of Spidering/Extraction Progress cli run-relay-fetch Fetch new OpenReview URLs into local DB for spidering/extraction cli run-relay-extract Spider new URLs, extract metadata, and POST results back to OpenReview API cli mongo-tools Create/Delete/Update Mongo Database cli pm2-restart Notification/Restart scheduler cli test-scheduler Testing app for scheduler cli scheduler Notification/Restart scheduler cli echo Echo message to stdout, once or repeating cli preflight-check Run checks and stop pm2 apps if everything looks okay cli spider-url spider the give URL, save results in corpus cli extract-url spider and extract field from given URL cli extract-urls spider and extract field from list of URLs in given file Options: --version Show version number [boolean] --help Show help [boolean]
Spider a URL and save results to local filesystem (delete any previously downloaded files via --clean) > ./bin/cli spider-url --corpus-root local.corpus.d --url 'https://doi.org/10.3389/fncom.2014.00139' --clean Spider, then extract metadata from given URL, filesystem only (no mongo db) > ./bin/cli extract-url --corpus-root local.corpus.d --url 'https://arxiv.org/abs/2204.09028' --log-level debug --clean Drop/recreate collections in mongo db > ./bin/cli --env=dev mongo-tools --clean Fetch a batch of URLs from notes via OpenReview API, store in mongo > ./bin/cli --env=dev run-relay-fetch --offset 100 --count 100 Spider/extract any unprocessed URLs in mongo, optionally posting results back to OpenReview API > ./bin/cli --env=dev run-relay-extract --post-results=false Show extraction stats for dev database > ./bin/cli --env=dev extraction-summary
PM2 wrapper script will set *_ENV evironment variables, flush pm2 logs, then run pm2 with correct *.ecosystem.json and tail the logfiles.
> bin/pm2-control PM2 Control Usage: bin/pm2-control [--(no-)verbose] [--(no-)dry-run] [--env <ENVMODE>] [--start] [--reset] [--restart] [-h|--help] --env: Env Mode; Required. Can be one of: 'dev', 'test' and 'prod' (default: 'unspecified') --start: Start pm2 with *-ecosystem.config --reset: stop/flush logs/del all --restart: reset + start -h, --help: Prints help To restart the system with clean log files: > bin/pm2-control --env=prod --restart