Skip to content
This repository has been archived by the owner on Nov 20, 2022. It is now read-only.

Commit

Permalink
Initial commit
Browse files Browse the repository at this point in the history
  • Loading branch information
davidmezzetti committed Mar 19, 2020
0 parents commit aa4ae1f
Show file tree
Hide file tree
Showing 36 changed files with 8,047 additions and 0 deletions.
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
build/
dist/
*egg-info/
__pycache__/
.coverage
*.pyc
20 changes: 20 additions & 0 deletions .pylintrc
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
[BASIC]
module-rgx=[a-z_][a-zA-Z0-9_]{2,30}$
method-rgx=[a-z_][a-zA-Z0-9_]{2,30}$
function-rgx=[a-z_][a-zA-Z0-9_]{2,30}$
argument-rgx=[a-z_][a-zA-Z0-9_]{0,30}$
variable-rgx=[a-z_][a-zA-Z0-9_]{0,30}$
attr-rgx=[a-z_][a-zA-Z0-9_]{0,30}$

[DESIGN]
max-args=10
max-locals=40
max-returns=10
max-attributes=20
min-public-methods=0

[FORMAT]
max-line-length=150

[MESSAGES CONTROL]
disable=I0011,R0201,W0105,W0108,W0110,W0141,W0621,W0640
20 changes: 20 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
MIT License
Copyright (c) 2020 NeuML LLC

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.
66 changes: 66 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
cord19q: Exploring and indexing the CORD-19 dataset
======

![CORD19](https://pages.semanticscholar.org/hs-fs/hubfs/covid-image.png?width=300&name=covid-image.png)

COVID-19 Open Research Dataset (CORD-19) is a free resource of over 29,000 scholarly articles, including over 13,000 with full text, about COVID-19 and the coronavirus family of viruses for use by the global research community. The dataset can be found on [Semantic Scholar](https://pages.semanticscholar.org/coronavirus-research) and there is an active competition on [Kaggle](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge).

This project is a Python project that builds a sentence embeddings index with FastText + BM25. Background on this method can be found in this [Medium article](https://towardsdatascience.com/building-a-sentence-embedding-index-with-fasttext-and-bm25-f07e7148d240) and an existing repository using this method [codequestion](https://github.com/neuml/codequestion).

### Tasks
The following files show queries for the Top 10 matches for each task provided in the CORD-19-research-challenge competition using this method.

* [What is known about transmission, incubation, and environmental stability?](./tasks/transmission.md)
* [What do we know about COVID-19 risk factors?](./tasks/risk-factors.md)
* [What do we know about virus genetics, origin, and evolution?](./tasks/virus-genome.md)
* [What do we know about non-pharmaceutical interventions?](./tasks/interventions.md)
* [What do we know about vaccines and therapeutics?](./tasks/vaccines.md)
* [What has been published about medical care?](./tasks/virus-genomes.md)
* [What has been published about information sharing and inter-sectoral collaboration?](./tasks/sharing.md)
* [What has been published about ethical and social science considerations?](./tasks/ethics.md)
* [What do we know about diagnostics and surveillance?](./tasks/diagnostics.md)

### Installation
You can use Git to clone the repository from GitHub and install it. It is recommended to do this in a Python Virtual Environment.

git clone https://github.com/neuml/cord19q.git
cd cord19q
pip install .

Python 3.5+ is supported

### Building a model
Download all the files in the Download CORD-19 section on [Semantic Scholar](https://pages.semanticscholar.org/coronavirus-research). Go the directory with the files
and run the following commands.

cd <download_path>
mv all_sources_metadata*.csv metadata.csv
mkdir articles

For each tar.gz file run the following
tar -xvzf <file.tar.gz>
mv <extracted_directory>/* articles

Once completed, there should be a file called metadata.csv and an articles/ directory with all json articles.

To build the model locally:

python -m cord19q.etl.execute <download_path>
python -m cord19q.vectors
python -m cord19q.index

The model will be stored in ~/.cord19

### Running queries
The fastest way to run queries is to start a cord19q shell

cord19q

A prompt will come up. Queries can be typed directly into the console.

### Building a report file
A report file is simply a markdown file created from a list of queries. An example:

python -m cord19q.report tasks/diagnostics.txt

Once complete a file named tasks/diagnostics.md will be created.
47 changes: 47 additions & 0 deletions setup.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# pylint: disable = C0111
from setuptools import setup

with open("README.md", "r") as f:
DESCRIPTION = f.read()

setup(name="cord19q",
version="1.0.0",
author="NeuML",
description="CORD19 Dataset exploration and indexing",
long_description=DESCRIPTION,
long_description_content_type="text/markdown",
url="https://github.com/neuml/cord19q",
project_urls={
"Documentation": "https://github.com/neuml/cord19q",
"Issue Tracker": "https://github.com/neuml/cord19q/issues",
"Source Code": "https://github.com/neuml/cord19q",
},
license="MIT License: http://opensource.org/licenses/MIT",
packages=["cord19q"],
package_dir={"": "src/python/"},
keywords="python search embedding machine-learning",
python_requires=">=3.5",
entry_points={
"console_scripts": [
"cord19q = cord19q.shell:main",
],
},
install_requires=[
"faiss-gpu>=1.6.1",
"fasttext>=0.9.1",
"html2text>=2019.9.26",
"mdv>=1.7.4",
"numpy>=1.17.4",
"pymagnitude>=0.1.120",
"scikit-learn>=0.22.1",
"scipy>=1.4.1",
"tqdm==4.40.2"
],
classifiers=[
"License :: OSI Approved :: MIT License",
"Operating System :: OS Independent",
"Programming Language :: Python :: 3",
"Topic :: Software Development",
"Topic :: Text Processing :: Indexing",
"Topic :: Utilities"
])
Empty file added src/python/cord19q/__init__.py
Empty file.
Loading

0 comments on commit aa4ae1f

Please sign in to comment.