-
Notifications
You must be signed in to change notification settings - Fork 17
Maintaining and Updating Dictionaries
Tester: Ambreen H : While downloading dictionaries from SPARQL the XML document downloaded has each column as a separate element and not as an attribute within the element tag. Will this format work for ami search or is there a way to change it to the one required for ami search? I used SPARQL to get all relevant information regarding countries including abbreviations, synonyms, URL, country code etc which is not available in the country dictionary in ami. Downloaded dictionary for reference: https://github.com/petermr/openVirus/blob/master/dictionaries/test/country_wikidata.xml.xml
eg:
<?xml version='1.0' encoding='UTF-8'?>
<sparql xmlns='http://www.w3.org/2005/sparql-results#'>
<head>
<variable name='wikidata'/>
<variable name='wikidataLabel'/>
<variable name='wikipedia'/>
<variable name='wikidataAltLabel'/>
<variable name='synonyms'/>
</head>
<results>
<result>
<binding name='wikidata'>
<uri>http://www.wikidata.org/entity/Q16</uri>
</binding>
<binding name='synonyms'>
<literal>🇨🇦</literal>
</binding>
<binding name='wikipedia'>
<uri>https://en.wikipedia.org/wiki/Canada</uri>
</binding>
<binding name='wikidataLabel'>
<literal xml:lang='en'>Canada</literal>
</binding>
<binding name='wikidataAltLabel'>
<literal xml:lang='en'>CA, ca, CDN, can, CAN, British North America, 🇨🇦, Dominion of Canada</literal>
</binding>
</result>
PMR: You can use the existing dictionary for ami search at present, but the dictionary itself has many shortcomings and needs extensive editing. See https://github.com/petermr/ami3/blob/master/src/main/resources/org/contentmine/ami/plugins/dictionary/country.xml
Because of that, and because almost all the content for country will be in Wikidata , SPARQL will give a better dictionary. I will write an amidict tool to convert the SPARQL output to amidict format.
- to collect together concepts we are interested in under a single label (e.g. country)
- to provide an for each concept
- to provide search terms for each concept to locate it in the documents we search
- to link the concept to the world's knowledge graph.
- to help human readers understand the concepts.
- to provide a record of provenance and maintenance
All the words that can potentially be present in any research paper must be well available within our country dictionary
or better "all the words describing countries where viral epidemics have been reported/discussed". That can be hard ("Himalayan", "North Atlantic", "Sub-Saharan", etc.) But generally, academic papers will mention one or more countries specifically. @Emanuel Faria has done this for plants (where do essential oils come from?"). For that a country ("india") is too broad - we might want "Goa", or "Rajasthan", WE may have to be more specific "Wuhan" rather than "China". But for the moment lets work with countries.
It also has to be ensured that the country names that appear in the dictionary are really recognized countries (for instance, not ancient empires). It must also contain, in my opinion, the following:
- All the synonyms of the country: synonyms. Yes. "England", "Scotland", "Britain", "United Kingdom" are all widely used.
- All the common abbreviations: Yes. UK, GB, NI, for example. Abbreviations often cause ambiguity.
- Maybe even translations in other important world languages: Translations. Absolutely. If we are going to explore Hindi we will need a
term.hi
attribute. Wikidata has these if they are the titles of Wikipedia pages - The current dictionary has empty entity-tags for Wikipedia as well as wikidata which must also be present for redirection to the source pages: Yes. The tags were autogenerated to show they should be filled by hand.
The amidict software is, in principle, able to find Wikidata and Wikipedia links. But these are often ambiguous. In cases like country I expect it will be the leading one found. Manual checking is always required. This is an excellent thing for incoming INYAS to help with.