This repository has been archived by the owner on Nov 20, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 17
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit aa4ae1f
Showing
36 changed files
with
8,047 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
build/ | ||
dist/ | ||
*egg-info/ | ||
__pycache__/ | ||
.coverage | ||
*.pyc |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
[BASIC] | ||
module-rgx=[a-z_][a-zA-Z0-9_]{2,30}$ | ||
method-rgx=[a-z_][a-zA-Z0-9_]{2,30}$ | ||
function-rgx=[a-z_][a-zA-Z0-9_]{2,30}$ | ||
argument-rgx=[a-z_][a-zA-Z0-9_]{0,30}$ | ||
variable-rgx=[a-z_][a-zA-Z0-9_]{0,30}$ | ||
attr-rgx=[a-z_][a-zA-Z0-9_]{0,30}$ | ||
|
||
[DESIGN] | ||
max-args=10 | ||
max-locals=40 | ||
max-returns=10 | ||
max-attributes=20 | ||
min-public-methods=0 | ||
|
||
[FORMAT] | ||
max-line-length=150 | ||
|
||
[MESSAGES CONTROL] | ||
disable=I0011,R0201,W0105,W0108,W0110,W0141,W0621,W0640 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
MIT License | ||
Copyright (c) 2020 NeuML LLC | ||
|
||
Permission is hereby granted, free of charge, to any person obtaining a copy | ||
of this software and associated documentation files (the "Software"), to deal | ||
in the Software without restriction, including without limitation the rights | ||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | ||
copies of the Software, and to permit persons to whom the Software is | ||
furnished to do so, subject to the following conditions: | ||
|
||
The above copyright notice and this permission notice shall be included in | ||
all copies or substantial portions of the Software. | ||
|
||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | ||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | ||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | ||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | ||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | ||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN | ||
THE SOFTWARE. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,66 @@ | ||
cord19q: Exploring and indexing the CORD-19 dataset | ||
====== | ||
|
||
![CORD19](https://pages.semanticscholar.org/hs-fs/hubfs/covid-image.png?width=300&name=covid-image.png) | ||
|
||
COVID-19 Open Research Dataset (CORD-19) is a free resource of over 29,000 scholarly articles, including over 13,000 with full text, about COVID-19 and the coronavirus family of viruses for use by the global research community. The dataset can be found on [Semantic Scholar](https://pages.semanticscholar.org/coronavirus-research) and there is an active competition on [Kaggle](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge). | ||
|
||
This project is a Python project that builds a sentence embeddings index with FastText + BM25. Background on this method can be found in this [Medium article](https://towardsdatascience.com/building-a-sentence-embedding-index-with-fasttext-and-bm25-f07e7148d240) and an existing repository using this method [codequestion](https://github.com/neuml/codequestion). | ||
|
||
### Tasks | ||
The following files show queries for the Top 10 matches for each task provided in the CORD-19-research-challenge competition using this method. | ||
|
||
* [What is known about transmission, incubation, and environmental stability?](./tasks/transmission.md) | ||
* [What do we know about COVID-19 risk factors?](./tasks/risk-factors.md) | ||
* [What do we know about virus genetics, origin, and evolution?](./tasks/virus-genome.md) | ||
* [What do we know about non-pharmaceutical interventions?](./tasks/interventions.md) | ||
* [What do we know about vaccines and therapeutics?](./tasks/vaccines.md) | ||
* [What has been published about medical care?](./tasks/virus-genomes.md) | ||
* [What has been published about information sharing and inter-sectoral collaboration?](./tasks/sharing.md) | ||
* [What has been published about ethical and social science considerations?](./tasks/ethics.md) | ||
* [What do we know about diagnostics and surveillance?](./tasks/diagnostics.md) | ||
|
||
### Installation | ||
You can use Git to clone the repository from GitHub and install it. It is recommended to do this in a Python Virtual Environment. | ||
|
||
git clone https://github.com/neuml/cord19q.git | ||
cd cord19q | ||
pip install . | ||
|
||
Python 3.5+ is supported | ||
|
||
### Building a model | ||
Download all the files in the Download CORD-19 section on [Semantic Scholar](https://pages.semanticscholar.org/coronavirus-research). Go the directory with the files | ||
and run the following commands. | ||
|
||
cd <download_path> | ||
mv all_sources_metadata*.csv metadata.csv | ||
mkdir articles | ||
|
||
For each tar.gz file run the following | ||
tar -xvzf <file.tar.gz> | ||
mv <extracted_directory>/* articles | ||
|
||
Once completed, there should be a file called metadata.csv and an articles/ directory with all json articles. | ||
|
||
To build the model locally: | ||
|
||
python -m cord19q.etl.execute <download_path> | ||
python -m cord19q.vectors | ||
python -m cord19q.index | ||
|
||
The model will be stored in ~/.cord19 | ||
|
||
### Running queries | ||
The fastest way to run queries is to start a cord19q shell | ||
|
||
cord19q | ||
|
||
A prompt will come up. Queries can be typed directly into the console. | ||
|
||
### Building a report file | ||
A report file is simply a markdown file created from a list of queries. An example: | ||
|
||
python -m cord19q.report tasks/diagnostics.txt | ||
|
||
Once complete a file named tasks/diagnostics.md will be created. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,47 @@ | ||
# pylint: disable = C0111 | ||
from setuptools import setup | ||
|
||
with open("README.md", "r") as f: | ||
DESCRIPTION = f.read() | ||
|
||
setup(name="cord19q", | ||
version="1.0.0", | ||
author="NeuML", | ||
description="CORD19 Dataset exploration and indexing", | ||
long_description=DESCRIPTION, | ||
long_description_content_type="text/markdown", | ||
url="https://github.com/neuml/cord19q", | ||
project_urls={ | ||
"Documentation": "https://github.com/neuml/cord19q", | ||
"Issue Tracker": "https://github.com/neuml/cord19q/issues", | ||
"Source Code": "https://github.com/neuml/cord19q", | ||
}, | ||
license="MIT License: http://opensource.org/licenses/MIT", | ||
packages=["cord19q"], | ||
package_dir={"": "src/python/"}, | ||
keywords="python search embedding machine-learning", | ||
python_requires=">=3.5", | ||
entry_points={ | ||
"console_scripts": [ | ||
"cord19q = cord19q.shell:main", | ||
], | ||
}, | ||
install_requires=[ | ||
"faiss-gpu>=1.6.1", | ||
"fasttext>=0.9.1", | ||
"html2text>=2019.9.26", | ||
"mdv>=1.7.4", | ||
"numpy>=1.17.4", | ||
"pymagnitude>=0.1.120", | ||
"scikit-learn>=0.22.1", | ||
"scipy>=1.4.1", | ||
"tqdm==4.40.2" | ||
], | ||
classifiers=[ | ||
"License :: OSI Approved :: MIT License", | ||
"Operating System :: OS Independent", | ||
"Programming Language :: Python :: 3", | ||
"Topic :: Software Development", | ||
"Topic :: Text Processing :: Indexing", | ||
"Topic :: Utilities" | ||
]) |
Empty file.
Oops, something went wrong.