Skip to content

Dictionaries Validation

Ambreen H edited this page Sep 16, 2020 · 23 revisions

validation

This wikipage can be used for peer-review of the dictionaries in the run-up to the hackathon.

general principles of validation

A project-pair (A and B) should review each other's dictionaries: Here are some criteria:

  • do they use standard fields? and names?
  • do they have provenance ? (how they were created)
  • do they work in ami search?

the content.

  • are there entries which should be removed?
  • are there mis-labelled mislinked entries (e.g. wikidata links to scientific articles)
  • are there syntax or encoding problems?
  • are there multilingual entries?

===== country =====

overview

The dictionary was created using Wikidata SPARQL Query: SPARQL QUERY HYPERLINK

It was later converted into the standard format using amidict

amidict -vv --dictionary country --directory ami_12_08_2020/amidict --input ami_12_08_2020/country.xml create --informat=wikisparqlxml --sparqlmap wikidata=wikidata,term=term,name=wikidataLabel,description=wikidataDescription,wikipedia=wikipedia,_iso3166=_iso3166 --synonyms=synonym ami -p ami_12_08_2020\corpus_950 search --dictionary ami_12_08_2020\xml.xml

COUNTRY DICTIONARY HYPERLINK

purpose

The country dictionary is created and maintained for annotations during ami search to extract frequencies countries in viral epidemics.

scope (including limitations)

May help in answering the question, which countries do viral epidemics frequently occur in?

LIMITATION: The local name of the country is still not present within the dictionary

peer-review

reviewer A

PMR syntax review

  • I removed the root/base URL from wikipediaPage
  • The validator:
The XML document is valid.

===== disease =====

validation 2020-09-15

Cvc-complex-type.3.2.2: Attribute 'Tamil' Is Not Allowed To Appear In Element 'entry'., Line '8', Column '465'.
Cvc-complex-type.3.2.2: Attribute 'Urdu' Is Not Allowed To Appear In Element 'entry'., Line '8', Column '465'.
Cvc-complex-type.3.2.2: Attribute 'altLabel' Is Not Allowed To Appear In Element 'entry'., Line '8', Column '465'.
Cvc-complex-type.3.2.2: Attribute 'Hindi' Is Not Allowed To Appear In Element 'entry'., Line '13', Column '465'.
Cvc-complex-type.3.2.2: Attribute 'Hindi_description' Is Not Allowed To Appear In Element 'entry'., Line '13', Column '465'.
Cvc-complex-type.3.2.2: Attribute 'ICD-10_code' Is Not Allowed To Appear In Element 'entry'., Line '13', Column '465'.

PMR comments

language attributes

Cvc-complex-type.3.2.2: Attribute 'Tamil' Is Not Allowed To Appear In Element 'entry'., Line '8', Column '465'.
Cvc-complex-type.3.2.2: Attribute 'Urdu' Is Not Allowed To Appear In Element 'entry'., Line '8', Column '465'.
Cvc-complex-type.3.2.2: Attribute 'Hindi' Is Not Allowed To Appear In Element 'entry'., Line '13', Column '465'.

These need code to resolve. PMR will do this. see below for <synonym xml:lang="hi">

altLabel

Cvc-complex-type.3.2.2: Attribute 'altLabel' Is Not Allowed To Appear In Element 'entry'., Line '8', Column '465'.

occurs in

altLabel="mental or behavioural disorder, psychotic disorder, psychotic disorders"

This is not supported and should be transformed to synonyms (which has been done, so simply omit).

description in non-EN

Cvc-complex-type.3.2.2: Attribute 'Hindi_description' Is Not Allowed To Appear In Element 'entry'., Line '18', Column '465'.

I need to think about this. It is VERY useful and should probably become a new child element:

<entry ...>
  <description xml:lang="hi">सचीज़ोफ्रेनिया के बारे में मेरा विचार</description>
  <synonym xml:lang="hi">मनोविक्षिप्ति</synonym>
</entry>

is probably best at this stage

wikidata properties

Cvc-complex-type.3.2.2: Attribute 'ICD-10_code' Is Not Allowed To Appear In Element 'entry'., Line '13', Column '465'.#### 

This should be translated to include the wikidata identifier for the property

<entry _p494_icd10="F25" ...

PMR syntax validation

I HAVE NOT YET TACKLED THE LANGUAGE ATTRIBUTES

  • when SPARQL cannot find a language equivalent it puts in the Wikidata ID. so I remove them with the regex
(Hindi|Tamil|Urdu)="Q\d+" => ""

validation report

Cvc-complex-type.3.2.2: Attribute 'Hindi' Is Not Allowed To Appear In Element 'entry'., Line '3', Column '337'.
Cvc-complex-type.3.2.2: Attribute 'Tamil' Is Not Allowed To Appear In Element 'entry'., Line '3', Column '337'.
Cvc-complex-type.3.2.2: Attribute 'Urdu' Is Not Allowed To Appear In Element 'entry'., Line '3', Column '337'.
Cvc-complex-type.3.2.2: Attribute 'altLabel' Is Not Allowed To Appear In Element 'entry'., Line '3', Column '337'.
Cvc-complex-type.3.2.2: Attribute 'Hindi' Is Not Allowed To Appear In Element 'entry'., Line '8', Column '457'.
Cvc-complex-type.3.2.2: Attribute 'Hindi_description' Is Not Allowed To Appear In Element 'entry'., Line '8', Column '457'.
Cvc-complex-type.3.2.2: Attribute 'ICD-10_code' Is Not Allowed To Appear In Element 'entry'., Line '8', Column '457'.
Cvc-complex-type.3.2.2: Attribute 'Urdu' Is Not Allowed To Appear In Element 'entry'., Line '8', Column '457'.
Cvc-complex-type.3.2.2: Attribute 'altLabel' Is Not Allowed To Appear In Element 'entry'., Line '8', Column '457'.
Cvc-complex-type.3.2.2: Attribute 'Hindi' Is Not Allowed To Appear In Element 'entry'., Line '13', Column '457'.
Cvc-complex-type.3.2.2: Attribute 'Hindi_description' Is Not Allowed To Appear In Element 'entry'., Line '13', Column '457'.
Cvc-complex-type.3.2.2: Attribute 'ICD-10_code' Is Not Allowed To Appear In Element 'entry'., Line '13', Column '457'.

PMR - we need to manage the languages. Remove altLabel and replace ICD10 (Have done this...)

===== drug =====

validation 2020-09-15

PMR comments

wikidata properties

Cvc-complex-type.3.2.2: Attribute '_formula' Is Not Allowed To Appear In Element 'entry'., Line '3', Column '202'.
Cvc-complex-type.3.2.2: Attribute '_picture' Is Not Allowed To Appear In Element 'entry'., Line '3', Column '202'.content 

These are not (yet) amidict attributes. Can they be changed to wikidata properties?

Also the content is problematic (the formula uses small caps) and we don't have a clear way to store the picture. Let's work on this...

wikidata/wikipedia attributes

Cvc-complex-type.3.2.2: Attribute 'wikidata' Is Not Allowed To Appear In Element 'entry'., Line '3', Column '202'.
Cvc-complex-type.3.2.2: Attribute 'wikipedia' Is Not Allowed To Appear In Element 'entry'., Line '17', Column '227'.

These should be URLs or Pages or ID.

===== funders =====

validation 2020-09-15

Cvc-complex-type.2.4.a: Invalid Content Was Found Starting With Element 'entry'. One Of '{desc}' Is Expected., Line '2', Column '131'
```Cvc-complex-type.2.4.a: Invalid Content Was Found Starting With Element 'entry'. One Of '{desc}' Is Expected., Line '2', Column '173'.
Cvc-complex-type.3.2.2: Attribute 'wikipedia' Is Not Allowed To Appear In Element 'entry'., Line '2', Column '173'.
Cvc-complex-type.3.2.2: Attribute 'wikidata' Is Not Allowed To Appear In Element 'entry'., Line '2', Column '173'.
Cvc-complex-type.3.2.2: Attribute 'wikipedia' Is Not Allowed To Appear In Element 'entry'., Line '3', Column '100'.
Cvc-complex-type.3.2.2: Attribute 'wikipedia' Is Not Allowed To Appear In Element 'entry'., Line '2', Column '173'.
Cvc-complex-type.3.2.2: Attribute 'wikidata' Is Not Allowed To Appear In Element 'entry'., Line '2', Column '173'.
Cvc-complex-type.3.2.2: Attribute 'wikipedia' Is Not Allowed To Appear In Element 'entry'., Line '3', Column '100'.

PMR comments

desc

add one or more as child elements of dictionary

wikidata and wikipedia

Cvc-complex-type.3.2.2: Attribute 'wikipedia' Is Not Allowed To Appear In Element 'entry'., Line '2', Column '173'.
Cvc-complex-type.3.2.2: Attribute 'wikidata' Is Not Allowed To Appear In Element 'entry'., Line '2', Column '173'.

Choose from the valid attributes

NOW VALIDATES BUT NEEDS WIKIPEDIA

===== non-pharmaceutical interventions =====

validation 2020-09-15

Cvc-pattern-valid: Value 'alternative Medicine ' Is Not Facet-valid With Respect To Pattern '[^/\s]+' For Type '#AnonType_wikipediaPageentrydictionary'., Line '4', Column '125'.
Cvc-attribute.3: The Value 'alternative Medicine ' Of Attribute 'wikipediaPage' On Element 'entry' Is Not Valid With Respect To Its Type, '#AnonType_wikipediaPageentrydictionary'., Line '4', Column '125'.
Cvc-pattern-valid: Value 'cleanroom Suit' Is Not Facet-valid With Respect To Pattern '[^/\s]+' For Type '#AnonType_wikipediaPageentrydictionary'., Line '5', Column '104'.
Cvc-attribute.3: The Value 'cleanroom Suit' Of Attribute 'wikipediaPage' On Element 'entry' Is Not Valid With Respect To Its Type, '#AnonType_wikipediaPageentrydictionary'., Line '5', Column '104'.

PMR comments

We now have:

Cvc-pattern-valid: Value 'alternative Medicine ' Is Not Facet-valid With Respect To Pattern '[^/\s]+' For Type '#AnonType_wikipediaPageentrydictionary'., Line '3', Column '125'.
Cvc-attribute.3: The Value 'alternative Medicine ' Of Attribute 'wikipediaPage' On Element 'entry' Is Not Valid With Respect To Its Type, '#AnonType_wikipediaPageentrydictionary'., Line '3', Column '125'.

This is because the wikipediaPage attribute has spaces (and some are not correct anyway). Unfortunately you have to copy these by hand. So https://en.wikipedia.org/wiki/Flattening_the_curve => wikipediaPage="Flattening_the_curve" and wikipediaURL="https://en.wikipedia.org/wiki/Flattening_the_curve"

===== test and trace =====

validation 2020-09-15

Now valid

PMR comments

BUT Wikipedia entries with embedded spaces have been deleted and need re-entering with underscores, etc.

===== virus =====

validation 2020-09-15

Now valid

PMR comments

BUT Wikipedia entries with embedded spaces have been deleted and need re-entering with underscores, etc.

===== zoonoses =====

validation 2020-09-15

Validates against Schema

PMR comments

Wikipedia has to be added

desc

see previous comments

"hasCause" attribute

Cvc-complex-type.3.2.2: Attribute '_pP828_hasCause' Is Not Allowed To Appear In Element 'entry'., Line '5', Column '56'.

This is a new attribute and I have added it and will commit shortly. I will change the name to _p828_hasCause. I'll also edit your dictionary

matching tags

You are missing a </synonym> . This suggests the file was hand-edited - it's very easy for these sorts of errors, which is why we should try to use software where possible. *Your dictionary now validates *

XSD schema generation

online tools

There are several. Here's https://www.freeformatter.com/xsd-generator.html. It takes a dictionary (we use country.xml) and analyzes what elements occur, in what context (e.g. children). Then it analyzes each element to see if it has attributes and what it their type.

schema v0.1

Here's the result of the xsd-generator's first guess at a "Russian Doll" schema. (https://www.oracle.com/technical-resources/articles/java/design-patterns.html) . XSD Schema can be very confusing so just take some of it for granted at this stage (I and others tried to get a simpler version in 2000 but were overruled). Luckily for dictionaries we don't need anything complicated. This will evolve as we try to accommodate all dictionaries.

<xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:element name="dictionary">
    <xs:complexType>
      <xs:sequence>
        <xs:element type="xs:string" name="desc"/>
        <xs:element name="entry" maxOccurs="unbounded" minOccurs="0">
          <xs:complexType>
            <xs:sequence>
              <xs:element type="xs:string" name="synonym" maxOccurs="unbounded" minOccurs="0"/>
            </xs:sequence>
            <xs:attribute type="xs:string" name="_p297_country" use="optional"/>
            <xs:attribute type="xs:string" name="description" use="optional"/>
            <xs:attribute type="xs:string" name="name" use="optional"/>
            <xs:attribute type="xs:string" name="term" use="optional"/>
            <xs:attribute type="xs:anyURI" name="wikidataURL" use="optional"/>
            <xs:attribute type="xs:string" name="wikipediaURL" use="optional"/>
          </xs:complexType>
        </xs:element>
      </xs:sequence>
      <xs:attribute type="xs:string" name="title"/>
    </xs:complexType>
  </xs:element>
</xs:schema>

This what the schema creator guesses are what the author intended but it needs editing. We will wish to make some attributes required and look at the type of wikipediaURL.

Interpretation

  • schema structure The current example involves nested definitions sometimes called "Russian Doll".
<xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" xmlns:xs="http://www.w3.org/2001/XMLSchema">

(just copy this accurately - we don't need to understand it. It sets the namespaces).

  • root element definition
  <xs:element name="dictionary">
...

      <xs:attribute type="xs:string" name="title"/>
...
  </xs:element>

This defines an element dictionary and requires it to have an attribute title, so our documents must look something like:

<dictionary title="foobar" >
 ...
</dictionary>

(The title can be anything at this stage - string is the least constraining).

  • child elements
  <xs:element name="dictionary">
    <xs:complexType>

the dictionary element can have many children, but in a given order

      <xs:sequence>
        <xs:element type="xs:string" name="desc"/>

There must be a single <desc>...</desc> child element. (We will revise this later...), followed by

        <xs:element name="entry" maxOccurs="unbounded" minOccurs="0">
          <xs:complexType>
...
          </xs:complexType>
        </xs:element>
      </xs:sequence>
      <xs:attribute type="xs:string" name="title"/>
    </xs:complexType>
  </xs:element>

any number of <entry>...</entry> elements.

  • grandchild elements
        <xs:element name="entry" maxOccurs="unbounded" minOccurs="0">
          <xs:complexType>
            <xs:sequence>
              <xs:element type="xs:string" name="synonym" maxOccurs="unbounded" minOccurs="0"/>
            </xs:sequence>
...

each <entry> element can contain any number of <synonym> elements

  • string content
              <xs:element type="xs:string" name="synonym" maxOccurs="unbounded" minOccurs="0"/>

The <synonym> elements have no element-children but can contain a text string

  • attributes The <entry> element can have many attributes:
        <xs:element name="entry" maxOccurs="unbounded" minOccurs="0">
          <xs:complexType>
...
            <xs:attribute type="xs:string" name="_p297_country" use="optional"/>
            <xs:attribute type="xs:string" name="description" use="optional"/>
            <xs:attribute type="xs:string" name="name" use="optional"/>
            <xs:attribute type="xs:string" name="term" use="optional"/>
            <xs:attribute type="xs:anyURI" name="wikidataURL" use="optional"/>
            <xs:attribute type="xs:string" name="wikipediaURL" use="optional"/>
          </xs:complexType>
        </xs:element>

By default all attributes are of type string and have been guessed as optional. We'll now refine that...

schema v0.2

We want some attributes to be mandatory, so here's the next version:

<xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:element name="dictionary">
    <xs:complexType>
      <xs:sequence>
        <xs:element type="xs:string" name="desc"/>
        <xs:element name="entry" maxOccurs="unbounded" minOccurs="1">
          <xs:complexType>
            <xs:sequence>
              <xs:element type="xs:string" name="synonym" maxOccurs="unbounded" minOccurs="0"/>
            </xs:sequence>
<!-- this only applies to country so we'll make it optional -->
            <xs:attribute type="xs:string" name="_p297_country" use="optional"/>
<!-- but these 3 are mandatory -->
            <xs:attribute type="xs:string" name="description"/>
            <xs:attribute type="xs:string" name="name"/>
            <xs:attribute type="xs:string" name="term"/>
<!-- these two are optional (there may not be wikipedia or wikidata values) -->
            <xs:attribute type="xs:anyURI" name="wikidataURL" use="optional"/>
            <xs:attribute type="xs:string" name="wikipediaURL" use="optional"/>
<!-- and we'll add these ones -->
            <xs:attribute type="xs:string" name="wikidataID" use="optional"/>
            <xs:attribute type="xs:string" name="wikipediaPage" use="optional"/>
          </xs:complexType>
        </xs:element>
      </xs:sequence>
      <xs:attribute type="xs:string" name="title"/>
    </xs:complexType>
  </xs:element>
</xs:schema>

validation

We can see if our dictionary is valid against the current schema. Use https://www.freeformatter.com/xml-validator-xsd.html and either link to your file or paste it in. If your dictionary is very large (> 2MBytes) cut and paste a sample and test that (remember the dictionary must have balanced elements (e.g. end with </dictionary>). Use the latest schema (currently dictionaries/openVirus_schema.xsd).

Here's a typical run

Cvc-complex-type.2.4.a: Invalid Content Was Found Starting With Element 'entry'. One Of '{desc}' Is Expected., Line '2', Column '266'.
Cvc-complex-type.3.2.2: Attribute 'wikipedia' Is Not Allowed To Appear In Element 'entry'., Line '2', Column '266'.
Cvc-complex-type.3.2.2: Attribute 'wikidata' Is Not Allowed To Appear In Element 'entry'., Line '2', Column '266'.
Cvc-complex-type.3.2.2: Attribute 'wikipedia' Is Not Allowed To Appear In Element 'entry'., Line '3', Column '107'.
...
  • Invalid Content Was Found Starting With Element 'entry'. One Of '{desc}' Is Expected., . We have required at least one desc child of dictionary so create one or more to record what you have done.

  • Attribute 'wikipedia' Is Not Allowed To Appear In Element 'entry' The allowed attributes on entry are wikipediaURL and wikipediaPage . Decide which is meant

  • Attribute 'wikidata' Is Not Allowed To Appear In Element 'entry'. The allowed attributes on entry are wikidataURL and wikidataID . Decide which is meant.

In this way we will work towards consensus.

PLEASE LET US KNOW ON SLACK IF YOU CAN'T UNDERSTAND THE MESSAGES OR WANT TO ADD MORE ATTRIBUTES

Clone this wiki locally