Releases · neuml/cord19q · GitHub

This repository has been archived by the owner on Nov 20, 2022. It is now read-only.

27 May 21:43

v1.3.0

Made the following changes to this process. Will move on trying to determine level of evidence within a study.

ETL

Process 2020-03-27 dataset
Investigated cord_uid but found that it had duplicate articles with the same sha but different cord_uid. Will continue using current id strategy.
Changed reference field to use url instead of doi. Now includes 3000+ more urls for documents that didn't have a doi.
Filter out full text section for COVID-19 resource centre boilerplate text to prevent tagging older, non-relevant documents
Add section name to sections table to help with determining level of evidence of a study
Rebuild vectors

Reports

Add parameter for number of article results in output
Add journal column
Modify report.py and add methods to read data from list and write markdown output to string.
Escape | with escape sequence in report.py

Kaggle Notebook

Remove task reports from main notebook and add notebook per task. Link to each task from main notebook.
Add report query notebook to allow building a report on an adhoc query

Assets 2

27 May 21:41

v1.2.0

Made a couple of updates to the backing project which will propagate to the notebook.

Modified report formatting to conform with this discussion. Article results are now shown as a table.
Added linguistic rules to identify sentence fragments and questions. These are not used in the embeddings index.
Modified highlighting logic to require uniqueness within each bullet. Previously, there was a lot of duplication.
Added abstract field to word vectors and models. Was only using full text previously.

Assets 2

27 May 21:41

v1.1.0

Added notebook version of cord19q to Kaggle

Assets 2

27 May 21:39

v1.0.0

Initial release of cord19q project

Assets 2