-
Notifications
You must be signed in to change notification settings - Fork 17
Dictionary: Overview
The purpose of Dictionaries in the openVirus
project is:
- to identify words and phrases ("entities") in the documents (running text and images).
- to provide (computable) links to their meaning and context ("ontologies").
- to collect a subset of terms representing a high-level concept ("virus", "disease", "country"...).
The benefits include:
- understanding the meanings of words.
- background reading.
- aggregation ("searching") for the same or related entities in the corpus (collection of documents).
- building computable knowledge networks/graphs.
- classifying documents.
This can be described as ontological annotations in semantic networks.
There are many established uses of such annotations:
We are often put off by unfamiliar terms, e.g. "nosocomial infections". Wikipedia has an article on https://en.wikipedia.org/wiki/Hospital-acquired_infection:
A hospital-acquired infection (HAI), also known as a nosocomial infection (from the Greek "νοσοκομιακός" / "nosokomiakos", meaning "of the hospital"), is an infection that is acquired in a hospital or other health care facility.
With mouseover or footnotes this can dramatically improve speed of reading.
Annotations are easily aggregated in indexes or search engines.
People may confuse COVID-19 (disease) with coronavirus (a virus).
As an example from Wikipedia (https://en.wikipedia.org/wiki/Coronavirus_disease_2019 )
Coronavirus disease 2019 (COVID-19) is an infectious disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2).
This sentence links Coronavirus disease 2019 (COVID-19) to severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Indeed we can write:
- COVID-19 isA disease
- COVID-19 isCausedBy SARS-CoV-2
Ami's annotations allow software to discover and use such annotation. We can find all diseases isCausedBy viruses.
What's "Zika"?
https://en.wikipedia.org/wiki/Zika_(disambiguation) tells us:
Zika, or Zika fever, is an illness caused by the Zika virus.Zika or Žika may also refer to:
- Zika virus, a member of the Flaviviridae virus family
- Zika Forest, a forest in Uganda
- Zika rabbit, a breed of rabbit
People
Surname
Adolf Zika (born 1972), Czech photographer
... many more ...
We can label the different concepts by using a unique identifier system as in Wikidata.
Dictionaries have a simple format, best supported by XML or JSON (currently mainly XML). This defines certain elements and attributes (in <element att1="attval1" att2="attval2" ... >
). We are developing validation software. In general:
- unknown elements are ignored
-
<desc>
and<entry>
and<alternative>
are optional and repeatable. - all attributes except
dictionary/@title
are optional (at this stage) - order of elements and attributes is irrelevant (but worth making pretty and consistent)
This is the root element and contains the title which MUST be a single word and MUST be the base of the filename, e.g.
virus.xml
must have the structure
<dictionary title="virus">
...
</dictionary>
There is no XML namespace.
There is a header of zero or more <desc>
description elements, though we may enforce mandatory elements later. These can describe metadata such as dates, maintenance, provenance, etc. They are not yet standardised but will be.
<dictionary title="virus">
<desc date="2020-06-21" author="Peter Murray-Rust">created dictionary from Wikipedia https://en.wikipedia.org/wiki/List_of_virus_taxa after manual removal of invertebrate hosts</desc>
<desc date="2020-06-22" author="Peter Murray-Rust">removed further non-relevant viruses (Q1234567, Q2345678 ...)</desc>
<desc date="2020-06-23" author="Peter Murray-Rust">reassigned Wikidata IDs (Q9876543, Q9876876) for incorrect
automatic assignments</desc>
</dictionary>
The main component of a dictionary are entries, still slightly evolving. An entry is a well-defined object which can normally be mapped / linked to a Wikidata item. This gives it a unique identifier (Q-number), removing the need to maintain identifiers. Typical entry (with new element synonym
and more use of desc
with new attributes:
<dictionary title="miniterpenes">
<entry term="borneol" wikipedia="borneol" wikidata="Q27089413" name="(-)-borneol" description="chemical compound" id="CM.myterpenes.0" term.hi="बोर्निऑल" term.it="borneolo" term.zh="冰片" regex="(\([+-]\)\-)?[Bb]borneol">
<desc date="2020-07-22">added Bornyl-alcohol synonym</desc>
<alternative>(-)-Bornyl alcohol</alternative>
<entry>
...
</dictionary>
- the
term
is the unique lexical string (word) defining the entry. Terms are always lowercase and always start with a letter. The term may or may not be the linguistic entity in documents. - the
name
is the preferred name for the term. It is case-sensitive, and will often occur in text,name
andterm
may or may not be identical words. -
term.xx
can occur as language equivalents wherexx
is the appropriate 2- or 3-letter language code. See https://en.wikipedia.org/wiki/ISO_639-2. These can often be picked up from the links to Wikipedia pages from a Wikidata item (bottom of page). (Experimental). -
regex
is a regular expression for locating possible matches in text. This one finds(-)-borneol
,(+)-borneol
, andborneol
. -
description
is a human-readable string describing the entry. However it is often created directly from Wikidata and may be used for grouping or disambiguation. -
wikipedia
is the name of the Wikipedia page. It is often the term (for single words). It may not have spaces and may have escaped punctuation. resolves to (e.g. for EN,https://en.wikipedia.org/wiki/<wikipedia>
-
wikidata
is the identifier of the Wikidataitem
, always of the formQddddd..
(occasionallyPddd...
). It resolves tohttps://wikidata.org/wiki/<wikidata>
. There is only one identifier for a Wikidata item and the relationships and graphs are language-independent. -
id
is a local autogenerated ID and is not stable.
We are introducing 2 children of entry
-
desc
has the same semantics asdesc
fordictionary
-
<alternative>
. These are alternative lexical forms for theterm
. There are deliberately no semantics. They may or may not be exact synonyms, and may or may not be narrower/broader terms. These ontological relations can often be obtained from Wikidata.
- dictionaries will provide search terms (
term
,name
,regex
,alternative
) forami
,Lucene/Solr
orKNIME
. - dictionaries provide a link to Wikipedia pages or Wikidata Items. Annotation software can create hyperlinks for humans to follow.
Conventional dictionaries take a lot of effort to create and maintain, particularly if they contain ontological relationships. Often only specialist maintainers can do this. ContentMine dictionaries remove this problem by reducing the problem to a selection of relevant term
s. Often this selection is already made, in Wikipedia pages, or other collections. Many dictionaries are thus "views" (subsets) of Wikidata. There are several ways of doing this.
Create a list of terms that you think are relevant