Skip to content

Dictionaries: Deployment and Location

petermr edited this page Jul 27, 2020 · 3 revisions

deployment and location of dictionaries

Currently dictionaries are deployed in several ways, not well documented. The dictionaries are not always easily editable and may be incompatible. Here we suggest a more structured approach. Please comment.

current structure

Currently (2020-07) dictionaries are addressable by:

  • resources . These are accessed from within ami as resources in the classpath. This makes them relatively easy to locate by name, but very difficult to locate, index, search, edit, copy. This is the commonest way of current use but they are essentially readonly except for ami developers. Their main advantage is that they can be bundled in a jar, but they take up space.
  • local files. This is the primary way of developing personal dictionaries. It's flexible but requires the user to know absolute or relative filenames.
  • dictionary directory. Distribution of all "approved" dictionaries as (nested) directories. This requires the user to provide the address of the directory, or for the program to have a symbol (e.g. from environment variables, such as $DICTIONARY_TOP.
  • URL. read the dictionary from a (public) URL. Only useful for single dictionaries (unless we develop an index system).

criteria for a new system

symbols where possible

Users are used to simple names such as country , funder. This will require an address resolver (e.g. one or more root directories or URLs).

cascading

A symbol can be matched in more than one place. A typical priority could be:

  • personal dictionary ($PERSONAL)
  • builtin dictionaries (AMI, only available with source code or jars)
  • $DICTIONARY (downloaded from community site)
  • URLs

The cascading can be overridden by giving fully qualified addresses. The suffix .xml can be omitted, but dictionaries in other formats must be specified.

example

A user might issue the command

ami search --dictionary /usr/local/pm286/dictionary/disease.xml https://contentmine.org/dictionary/virus.xml $AMI/drug $DICTIONARY/country.xml funder 
  • the disease dictionary is located precisely
  • so is virus
  • drug is bundled with the ami system (obsolescent?)
  • country is installed by the user in a separate directory, whose location is set by the environment variable DICTIONARY
  • funder is resolved by cascade. (a) search PERSONAL ; if fail (b) search (builtin) AMI; if fail (c) search DICTIONARY.

namespaces.

It will be easy to get several dictionaries with the same name. We therefore prefix them with semi-unique strings, similar to XML namespaces. We reserve

  • CM for ContentMine

and a semi-unique string such as pmr, cev, ov... When we start getting collsions we'll be successful enough to implement namespace-mapping to domain names.

deployment and use

The most likely strategy is for us to create a communal dictionary folder which is distributed for this wanting a range of dictionaries. We need your feedback. With this strategy:

  • authors will commit their public dictionaries to a single repo site (e.g. https://github.com/contentmine/dictionary
  • users will download the same resource and link it to DICTIONARY
  • users will create their own local dictionaries and link them to PERSONAL
  • users can also share single dictionaries with URLs.
Clone this wiki locally