-
Notifications
You must be signed in to change notification settings - Fork 17
ami search
From Andy Jackson:
I'm trying to understand the data flow of ami-search @Peter Murray-Rust -- am I right in thinking it goes:
Scan text and generate snippets XML per item.
Read snippets XML and generate frequencies/counts XML etc. per item.
Read per-item XML data and generate top-level/summary XML or CSV (latter for co-occurrence data).
Generate HTML versions of XML and CSV for use.
In particular, am I right in thinking that all the outputs are generated from the snippets?
ami-search
processes a CProject
and iterates over each CTree
.
- It creates
scholarly.html
. Probaly from ` "pseudo-make" where it skips
Tester: Ambreen Hamadani
ami search
tool was used to test the country dictionary
-
getpapers was used to create a directory of 1000 papers (including full texts wherever available)
getpapers -q "viral epidemics" -o countr_dict -f v_epid/log.txt -x -p -k 1000
-
This directory was used to run ami search using
country
dictionaryami -p countr_dict search --dictionary country
-
After a successful run, HTML Documents were created that classified the papers on the basis of the _country _while citing the frequency of each country. eg:
ISSUES:
-
ami search
doesn't work directly unless the directory (cProject Directory) is specified before the search --dictionary eg The commandami search --dictionary country -p countr_dict1
throws the following error
================================
-v to see generic values
Specific values (AMISearchTool)
================================
created COMMAND: word(frequencies)xpath:@count>20~w.stopwords:pmcstop.txt_stopwords.txt search(country) search(-p) search(countr_dict1)
0 [main] DEBUG org.contentmine.ami.tools.AbstractAMISearchTool - old style search command); to be changed
0 [main] DEBUG org.contentmine.ami.tools.AbstractAMISearchTool - old style search command); to be changed
>ERROR: requires cProject
The correct command, in this case, is: ami -p countr_dict1 search --dictionary country
- A large corpus of 950 articles with
XML
files andpdf
files was created(for mini-project) using the syntaxgetpapers -q "viral epidemics AND human NOT COVID NOT corona virus NOT SARS-Cov-2" -o mpc -f mpc/log.txt -x -p -k 950
. - The corpus was segmented into 4 subfolders, each consisting of 200-250 Ctree folders.
-
ami search
was run on each subfolder usingdisease
dictionary. The syntax used for 1st subfolder wasami -p 1-subfolder search --dictionary disease
. - The output showed warnings and debugs. xml documents and html DataTables were created in the subfolder based on
disease
dictionary with their counts and the frequencies of the words that take place in the articles.
The html datatable was like: https://drive.google.com/file/d/112nZnbZk-duJGQ88-NvNcIv7ItuA0k_Q/view?usp=sharing
- Initially
ami search
was used in the 950 article corpus completely.ami search
was able to create html files for some Ctree folders but errors popped up as below.
Caused by: java.lang.OutOfMemoryError: Java heap space
544001 [main] ERROR org.contentmine.cproject.args.DefaultArgProcessor - ERR! java.lang.RuntimeException: cannot run [runTransform] in --transform (OutOfMemoryError: Java heap space)
PMC7259790 java.lang.reflect.InvocationTargetException
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:564)
[...]
- To rectify the
OutOfMemoryError
, set the environment variable MAVEN_OPTS using the commandset MAVEN_OPTS =-Xmx512m -XX:MaxPermSize=128m
. - The Cproject (mpc) was segmented into 4 subfolders, each consisting of 200-250 Ctree folders.
- Then, the above syntax was used on each subfolders.