-
Notifications
You must be signed in to change notification settings - Fork 17
miniproject: viral epidemics and disease
Priya
Dheeraj Kumar
Aishwarya
Please read the INITIAL SUMMARY section first, if you have any difficulties in this section.
- Use the communal corpus
epidemic50noCov
consisting of 50 articles.CREATED
- Scrutinizing the 50 articles to know the true positives and false positives, that is, whether the articles are about viral epidemic or not.
FINISHED
- Using
ami search
to find whether the articles mentioned any comorbidity in a viral epidemic or not, annotating with dictionaries to create ami DataTables.FINISHED
- Sectioning the articles using
ami:section
to extract the relevant information on comorbidity.FINISHED
- Refining and rerunning the query to get a corpus of 950 articles.
CREATED
- Scrutinizing the 950 articles for true positives and false positives and creating a spreadsheet.
PROGRESSING
- Using
ami search
to create DataTables andami section
for sectioning the 950 articles.FINISHED
- Using relevant ML technique for the classification of data whether the articles are based on viral epidemic and the diseases/disorders that co-occur.
PROGRESSING
- Creating a dashboard of knowledge, especially with an annotated map.
NOT STARTED
- A spreadsheet will be developed based on the comorbidity during a viral epidemic and their count;
- for 50 articles in
epidemic50noCov
.FINISHED
- for 950 articles in
disease
corpus.PROGRESSING
- Development of the ML model for data classification on accuracy.
PROGRESSING
- Annotated map with the obtained data.
NOT STARTED
- English
- Initially the communal corpus
epidemic50noCov
will be used. (A small test corpus for using the large corpusdisease
) - Later a corpus of 950 articles created in
disease
corpus, using the syntaxgetpapers -q "viral epidemics AND human NOT COVID NOT corona virus NOT SARS-Cov-2" -o disease -f disease/log.txt -k 950 -x -p
, will be used.
- Spanish
- For testing the
Spanish disease
dictionary (in order to create further other language dictionaries), it was created from Redalyc
- Disease [Details]
- Valid Disease Dictionary [Details]
-
getpapers
to create the corpus of 950 articles by downloading fromEPMC
. -
AMI
for creating DataTables, creating and using dictionaries, sectioning. -
SPARQL
for creating dictionaries. -
Jupyter Notebook
[Python] for binary classification & display.
-
50 articles corpus
epidemic50noCov
at - https://github.com/petermr/openVirus/tree/master/miniproject/epidemic50noCov -
950 articles corpus
disease
at - https://github.com/petermr/openVirus/tree/master/miniproject/disease -
for
getpapers
- https://github.com/petermr/openVirus/wiki/getpapers#tester-2 -
for installing
ami
- https://github.com/petermr/ami3/wiki/ami-installation -
for updating
ami
- https://github.com/petermr/openVirus/wiki/Tools:-ami3#updating-ami3 -
for
amidict
/dictionary validation - https://github.com/petermr/openVirus/wiki/Tools:-ami3#amidict-validation -
for
ami search
- https://github.com/petermr/openVirus/wiki/ami-search -
for
ami section
- https://github.com/petermr/openVirus/wiki/ami:section -
for
SPARQL
- https://github.com/petermr/openVirus/wiki/Tools-:-SPARQL -
for ML technique
jupyter notebook
is used - https://github.com/petermr/openVirus/wiki/Jupyter-Notebooks#data-preparation-for-ml -
the Spanish corpus is at - https://github.com/petermr/openVirus/tree/master/miniproject/disease/spanish
(by collaborator Dheeraj)
Our aim first of all, is that if we recognize diseases, then we will be able to give medicines for them.
In this mini project, we will be able to find diseases with the help of disease
dictionary (from open access articles) in accordance to "viral epidemic" by using ContentMine software(getpapers and ami).
- The names of all diseases are updated in the dictionary of diseases which are helpful in searching particular diseases' words in the articles, just like the dictionary contains a store of words.
- It's source is ICD-10(by WHO) and Wikidata and it was created using ami.
- It's a multilingual dictionary ( contains english,hindi,tamil,Kannada,Spanish, Portuguese)
- This is a group of articles which is based on viral epidemics and diseases. These articles contain information regarding diseases which are to be simplified.
- This is a group of 950 articles that have been downloaded from EPMC via getpapers.
This is a Pub Med Central website with a lot of scientific research knowledge articles. We are analyzing some of the open access articles from EPMC for our mini-project, which are downloaded using getpapers.
- It is a ContentMine software capable of downloading large number of articles from Eupmc.
- See https://github.com/petermr/openVirus/wiki/getpapers#use-of-getpapers for using.
- It is also a ContentMine software. It is used in creating a dictionary. It is useful for searching particular diseases' words that are updated in dictionary, sectioning downloaded articles and gathering information from them.
- Like in this, we have created a dictionary of disease.
- The query service by wikidata. It has everything included from Wikipedia and even more.
- In this mini project we needed ICD-10 code for Diseases and wanted the result in different languages.
- We obtained primarily the following result. CLICK HERE results in four languages.
- I have read about getpapers and EPMC and also I have read about advanced search in EPMC and reading its articles too.
- I read wikidata and learned to update the dictionary.
- Also updated the Dictionary with the help of Wikidata Query Service with the ICD-10 codes.
- So far I have manually classified some articles as True and False Positives.
- Created a SPARQL query for multilingual(six languages)
disease
dictionary.
- As said that if diseases are known, then we can give medicines accordingly. Therefore, our main goal will be to find out the names of diseases that co-occur during viral epidemics and work accordingly.
- Now have to manually classify all the articles into true positive and false positive.
- Learning Python code in
Jupyter Notebook
to use in binary classification.
- The 950 article corpus was large in size and hence using
ami search
popped the OutOfMemoryError. - Hence, the
disease
corpus (Cproject) was split into 4-parts consisting of 200-250 Ctrees. - Then,
ami search
was used in each parts successfully, which created DataTables. - The test details at https://github.com/petermr/openVirus/wiki/ami-search#running-ami-search-in-disease-dictionary
- Primarily in Windows
amisearch
created an empty_cooccurence
folder. - After debugging, AMI was updated which gave the desired result in
_cooccurence
folder. - Thus the error was rectified.
(Reference from Ambreen's update )
- Download VS code and clone the openVirus repository into your system.
- Open the
openVirus
folder in VS code (don't close it). - Now open your openVirus folder in your directory and make your changes in it.
- Reopen the VS code that was minimized. Now commit the changes by selecting the commit symbol. It might take time with respect to your size of uploading files.
- After adding the remote repository, push the changes to GitHub. See this video for other clarification.
NOTE : If already had cloned the repository, first pull the repo and then push the changes.
- The syntax used in above ami search used the
in-build disease
dictionary. - To use the Valid Disease Dictionary, the whole path must be specified in the syntax as follows:
ami -p <Cproject> search --dictionary openVirus/cambiohack2020/dictionaries/disease.xml
NOTE : <Cproject>
must be replaced by the name of your Cproject, the one that contain Ctrees
.
- The Spanish dictionary created, gave the results [here] on using the
ami search
.