-
Notifications
You must be signed in to change notification settings - Fork 17
How ami search works
Work In Progress!
AMI is a toolset for querying and analyzing a small-to-medium (up to 10,000) collection of documents, normally on local storage. It includes tools for downloading scientific papers, processing documents into sections and XML, analyzing components (text, tables, diagrams), creating dictionaries, and searching.
This document assumes you have AMI installed. If you haven't, please refer to https://github.com/petermr/ami3/blob/master/INSTALL.md or AMI installation.
FIXME: @petermr, please review and update the below.
A CProject is a directory structure that the AMI toolset uses to gather and process data. You would usually create a new CProject for each new question you want to answer (like "How effective are masks to protect against COVID-19?")
Question: do I need to create a project before running getpapers
?
A CTree is a subdirectory of a CProject that deals with a single paper. It contains the original paper (for example, a PDF), and the files created by the AMI toolset to analyse and transform the paper into other formats.
Use ami-makeproject
tool to create a project. Run the tool with --help
for details.
_FIXME: *** but you no longer need to do this - getpapers creates the output directory you specify if it doesn't already exist. Remove? ***
WARN: A CProject can contain hundreds of CTrees and become very large. Be careful about committing a CProject to git!
FIXME: @petermr, please review and update the below.
A dictionary is a structured set of terms used by the AMI toolset to create a set of in-context snippets and occurrence counts for each term in each document in a result set. Dictionaries are stored in XML or JSON format. Here is an example.
The AMI toolset contains many built-in dictionaries, and ContentMine has dictionaries, but you may need to create custom dictionaries to do ??? because ???.
Use the ami-dictionary create
to create a new dictionary. See https://github.com/petermr/openVirus/wiki/Creating-Dictionaries-from-Wikipedia-pages for details.
TODO: any Any pitfalls, things to be careful about?
See the Overview page for an example that walks through the steps for "Search for N95 (masks) on EuropePMC".
It details some commonly used commands and their output.
Would it be an idea to have a diagram with common workflows?
makeproject
|
V
getpapers
|
V
dictionary create
|
V
search
|
V
pdf
Given we have a valid CProject folder, with text or HTML versions of each item, ami-search
can be used to analyse all the items and aggregate the results. The overall data flow is:
- Make sure there is a
scholarly.html
. If not, convertfulltext.xml
toscholarly.html
usingami-transform
and a stylesheet (nlm2html.xsl
, I think) .(uses amake
-like strategy, i.e. only converts once) - sets up an empty
full.dataTables.html
based ondataTables.js
. This is 6 years old:
<link rel="stylesheet" type="text/css" href="http://ajax.aspnetcdn.com/ajax/jquery.dataTables/1.9.4/css/jquery.dataTables.css"/>
<script src="http://ajax.aspnetcdn.com/ajax/jQuery/jquery-1.8.2.min.js" charset="UTF-8" type="text/javascript"> </script>
<script src="http://ajax.aspnetcdn.com/ajax/jquery.dataTables/1.9.4/jquery.dataTables.min.js" charset="UTF-8" type="text/javascript"> </script>
- extracts the bibliographic metadata from each
fulltext.html
into (a) col1: links to sources and full text (b) col2: abbreviated bibliograph for title and abstract. NEEDS: decent mousoever and readable display - reads the list of dictionaries (
--dictionary
option). NEEDS to check existence of dictionary. - iterate over
fulltext.html
with each dictionary. capture each hit as "word in context" snippet. Context is "pre", "exact" (match), "post". These are limited in length to 200 chars. A hit withpre
andpost
has enough information to locate it in the document since absolute coordinates are fragile (see W3C annotation spec). Snippets are XML and listed in files namedresults/search/<dictionaryName>/results.xml or empty.xml
. Theempty.xml
means no results (to make it easier to search without reading XML, Ugly). - calculate word frequencies independently of dictionaries.
ami-search
reads stopword files for (a) Common EN words, (b) Common words in scholarly discourse (e.g. "journal", "method"...) and omits these. Words are split at whitespace. Stemming is applied (not sure how) and repeat words are stored in caches (Bloom Filters - I think I implemented them). - Read snippets XML and generate frequencies/counts XML etc. per item. These are (a) copied to the appropriate cell of the
dataTables.html
with a frequecy cut off. NEED much better display with mouseover, histograms or other icons, links back to fulltext. - Read per-item XML data and generate top-level/summary XML or CSV (latter for co-occurrence data). These data are direct children of
CProject
. This is messy and there should probably be a__summary
directory child ofCProject
. Note. The only children ofCProject
should be either<CTree>
s or__*
. If we redesigned it would be better to have theCTree
s in a__ctrees
directory but 5 years ago we didn't know where we'd be. - Generate HTML versions of XML and CSV for use.
see https://github.com/petermr/openVirus/tree/master/examples/n95
...Rough statement of how the process is implemented, where it has scaling/performance consequences?...
Implemented in Java, AMISearchTool
, BUT, unfortunately, uses an old pre-picocli commandline for message-passing. This needs to be rewritten but it works. Scaling: There may be memory leaks for thousands of files (I think we create a CTreeList
of the hits and this is not completely flushed` . Otherwise should be O(n) for times and O(1) for space.
(more later)
The documentation that resulted from Peter's Tigr3ess workshop in Delhi may be useful. For example:
- Quick overview
- AMI installation - for Windows, Unix, MacOS. Detailed steps with screenshots.
- AMI getpapers tutorial
- AMI search tutorial
- AMI dictionaries tutorial
- AMI clean to clean an AMI corpus and start again, without deleting the files you downloaded
- Trouble-shooting - just in case ;-)