-
Notifications
You must be signed in to change notification settings - Fork 17
Dictionaries: Deployment and Location
Currently dictionaries are deployed in several ways, not well documented. The dictionaries are not always easily editable and may be incompatible. Here we suggest a more structured approach. Please comment.
Currently (2020-07) dictionaries are addressable by:
-
resources . These are accessed from within
ami
as resources in the classpath. This makes them relatively easy to locate by name, but very difficult to locate, index, search, edit, copy. This is the commonest way of current use but they are essentially readonly except for ami developers. Their main advantage is that they can be bundled in a jar, but they take up space. - local files. This is the primary way of developing personal dictionaries. It's flexible but requires the user to know absolute or relative filenames.
- dictionary directory. Distribution of all "approved" dictionaries as (nested) directories. This requires the user to provide the address of the directory, or for the program to have a symbol (e.g. from environment variables, such as $DICTIONARY_TOP.
- URL. read the dictionary from a (public) URL. Only useful for single dictionaries (unless we develop an index system).
Users are used to simple names such as country
, funder
. This will require an address resolver (e.g. one or more root directories or URLs).
A symbol can be matched in more than one place. A typical priority could be:
- personal dictionary ($PERSONAL)
- builtin dictionaries (AMI, only available with source code or jars)
- $DICTIONARY (downloaded from community site)
- URLs
The cascading can be overridden by giving fully qualified addresses. The suffix .xml
can be omitted, but dictionaries in other formats must be specified.
A user might issue the command
ami search --dictionary /usr/local/pm286/dictionary/disease.xml https://contentmine.org/dictionary/virus.xml $AMI/drug $DICTIONARY/country.xml funder
- the
disease
dictionary is located precisely - so is
virus
-
drug
is bundled with theami
system (obsolescent?) -
country
is installed by the user in a separate directory, whose location is set by the environment variable DICTIONARY -
funder
is resolved by cascade. (a) search PERSONAL ; if fail (b) search (builtin) AMI; if fail (c) search DICTIONARY.
It will be easy to get several dictionaries with the same name. We therefore prefix them with semi-unique strings, similar to XML namespaces. We reserve
- CM for ContentMine
and a semi-unique string such as pmr
, cev
, ov
... When we start getting collsions we'll be successful enough to implement namespace-mapping to domain names.
The most likely strategy is for us to create a communal dictionary folder which is distributed for this wanting a range of dictionaries. We need your feedback. With this strategy:
- authors will commit their public dictionaries to a single repo site (e.g. https://github.com/contentmine/dictionary
- users will download the same resource and link it to DICTIONARY
- users will create their own local dictionaries and link them to PERSONAL
- users can also share single dictionaries with URLs.