-
Notifications
You must be signed in to change notification settings - Fork 17
Dictionaries Validation
This wikipage can be used for peer-review of the dictionaries in the run-up to the hackathon.
A project-pair (A and B) should review each other's dictionaries: Here are some criteria:
- do they use standard fields? and names?
- do they have provenance ? (how they were created)
- do they work in
ami search
?
the content.
- are there entries which should be removed?
- are there mis-labelled mislinked entries (e.g. wikidata links to scientific articles)
- are there syntax or encoding problems?
- are there multilingual entries?
The dictionary was created using Wikidata SPARQL Query: SPARQL QUERY HYPERLINK
It was later converted into the standard format using amidict
amidict -vv --dictionary country --directory ami_12_08_2020/amidict --input ami_12_08_2020/country.xml create --informat=wikisparqlxml --sparqlmap wikidata=wikidata,term=term,name=wikidataLabel,description=wikidataDescription,wikipedia=wikipedia,_iso3166=_iso3166 --synonyms=synonym ami -p ami_12_08_2020\corpus_950 search --dictionary ami_12_08_2020\xml.xml
The country dictionary is created and maintained for annotations during ami search to extract frequencies countries in viral epidemics.
May help in answering the question, which countries do viral epidemics frequently occur in?
LIMITATION: The local name of the country is still not present within the dictionary
- I removed the root/base URL from
wikipediaPage
- The validator:
The XML document is valid.
Cvc-complex-type.3.2.2: Attribute 'Tamil' Is Not Allowed To Appear In Element 'entry'., Line '8', Column '465'.
Cvc-complex-type.3.2.2: Attribute 'Urdu' Is Not Allowed To Appear In Element 'entry'., Line '8', Column '465'.
Cvc-complex-type.3.2.2: Attribute 'altLabel' Is Not Allowed To Appear In Element 'entry'., Line '8', Column '465'.
Cvc-complex-type.3.2.2: Attribute 'Hindi' Is Not Allowed To Appear In Element 'entry'., Line '13', Column '465'.
Cvc-complex-type.3.2.2: Attribute 'Hindi_description' Is Not Allowed To Appear In Element 'entry'., Line '13', Column '465'.
Cvc-complex-type.3.2.2: Attribute 'ICD-10_code' Is Not Allowed To Appear In Element 'entry'., Line '13', Column '465'.
Cvc-complex-type.3.2.2: Attribute 'Tamil' Is Not Allowed To Appear In Element 'entry'., Line '8', Column '465'.
Cvc-complex-type.3.2.2: Attribute 'Urdu' Is Not Allowed To Appear In Element 'entry'., Line '8', Column '465'.
Cvc-complex-type.3.2.2: Attribute 'Hindi' Is Not Allowed To Appear In Element 'entry'., Line '13', Column '465'.
These need code to resolve. PMR will do this. see below for <synonym xml:lang="hi">
Cvc-complex-type.3.2.2: Attribute 'altLabel' Is Not Allowed To Appear In Element 'entry'., Line '8', Column '465'.
occurs in
altLabel="mental or behavioural disorder, psychotic disorder, psychotic disorders"
This is not supported and should be transformed to synonyms (which has been done, so simply omit).
Cvc-complex-type.3.2.2: Attribute 'Hindi_description' Is Not Allowed To Appear In Element 'entry'., Line '18', Column '465'.
I need to think about this. It is VERY useful and should probably become a new child element:
<entry ...>
<description xml:lang="hi">सचीज़ोफ्रेनिया के बारे में मेरा विचार</description>
<synonym xml:lang="hi">मनोविक्षिप्ति</synonym>
</entry>
is probably best at this stage
Cvc-complex-type.3.2.2: Attribute 'ICD-10_code' Is Not Allowed To Appear In Element 'entry'., Line '13', Column '465'.####
This should be translated to include the wikidata identifier for the property
<entry _p494_icd10="F25" ...
I HAVE NOT YET TACKLED THE LANGUAGE ATTRIBUTES
- when SPARQL cannot find a language equivalent it puts in the Wikidata ID. so I remove them with the regex
(Hindi|Tamil|Urdu)="Q\d+" => ""
Cvc-complex-type.3.2.2: Attribute 'Hindi' Is Not Allowed To Appear In Element 'entry'., Line '3', Column '337'.
Cvc-complex-type.3.2.2: Attribute 'Tamil' Is Not Allowed To Appear In Element 'entry'., Line '3', Column '337'.
Cvc-complex-type.3.2.2: Attribute 'Urdu' Is Not Allowed To Appear In Element 'entry'., Line '3', Column '337'.
Cvc-complex-type.3.2.2: Attribute 'altLabel' Is Not Allowed To Appear In Element 'entry'., Line '3', Column '337'.
Cvc-complex-type.3.2.2: Attribute 'Hindi' Is Not Allowed To Appear In Element 'entry'., Line '8', Column '457'.
Cvc-complex-type.3.2.2: Attribute 'Hindi_description' Is Not Allowed To Appear In Element 'entry'., Line '8', Column '457'.
Cvc-complex-type.3.2.2: Attribute 'ICD-10_code' Is Not Allowed To Appear In Element 'entry'., Line '8', Column '457'.
Cvc-complex-type.3.2.2: Attribute 'Urdu' Is Not Allowed To Appear In Element 'entry'., Line '8', Column '457'.
Cvc-complex-type.3.2.2: Attribute 'altLabel' Is Not Allowed To Appear In Element 'entry'., Line '8', Column '457'.
Cvc-complex-type.3.2.2: Attribute 'Hindi' Is Not Allowed To Appear In Element 'entry'., Line '13', Column '457'.
Cvc-complex-type.3.2.2: Attribute 'Hindi_description' Is Not Allowed To Appear In Element 'entry'., Line '13', Column '457'.
Cvc-complex-type.3.2.2: Attribute 'ICD-10_code' Is Not Allowed To Appear In Element 'entry'., Line '13', Column '457'.
PMR - we need to manage the languages. Remove altLabel
and replace ICD10
(Have done this...)
Cvc-complex-type.3.2.2: Attribute '_formula' Is Not Allowed To Appear In Element 'entry'., Line '3', Column '202'.
Cvc-complex-type.3.2.2: Attribute '_picture' Is Not Allowed To Appear In Element 'entry'., Line '3', Column '202'.content
These are not (yet) amidict
attributes. Can they be changed to wikidata properties?
Also the content is problematic (the formula uses small caps) and we don't have a clear way to store the picture. Let's work on this...
Cvc-complex-type.3.2.2: Attribute 'wikidata' Is Not Allowed To Appear In Element 'entry'., Line '3', Column '202'.
Cvc-complex-type.3.2.2: Attribute 'wikipedia' Is Not Allowed To Appear In Element 'entry'., Line '17', Column '227'.
These should be URLs or Pages or ID.
Cvc-complex-type.2.4.a: Invalid Content Was Found Starting With Element 'entry'. One Of '{desc}' Is Expected., Line '2', Column '131'
```Cvc-complex-type.2.4.a: Invalid Content Was Found Starting With Element 'entry'. One Of '{desc}' Is Expected., Line '2', Column '173'.
Cvc-complex-type.3.2.2: Attribute 'wikipedia' Is Not Allowed To Appear In Element 'entry'., Line '2', Column '173'.
Cvc-complex-type.3.2.2: Attribute 'wikidata' Is Not Allowed To Appear In Element 'entry'., Line '2', Column '173'.
Cvc-complex-type.3.2.2: Attribute 'wikipedia' Is Not Allowed To Appear In Element 'entry'., Line '3', Column '100'.
Cvc-complex-type.3.2.2: Attribute 'wikipedia' Is Not Allowed To Appear In Element 'entry'., Line '2', Column '173'.
Cvc-complex-type.3.2.2: Attribute 'wikidata' Is Not Allowed To Appear In Element 'entry'., Line '2', Column '173'.
Cvc-complex-type.3.2.2: Attribute 'wikipedia' Is Not Allowed To Appear In Element 'entry'., Line '3', Column '100'.
add one or more as child elements of dictionary
Cvc-complex-type.3.2.2: Attribute 'wikipedia' Is Not Allowed To Appear In Element 'entry'., Line '2', Column '173'.
Cvc-complex-type.3.2.2: Attribute 'wikidata' Is Not Allowed To Appear In Element 'entry'., Line '2', Column '173'.
Choose from the valid attributes
NOW VALIDATES BUT NEEDS WIKIPEDIA
Cvc-pattern-valid: Value 'alternative Medicine ' Is Not Facet-valid With Respect To Pattern '[^/\s]+' For Type '#AnonType_wikipediaPageentrydictionary'., Line '4', Column '125'.
Cvc-attribute.3: The Value 'alternative Medicine ' Of Attribute 'wikipediaPage' On Element 'entry' Is Not Valid With Respect To Its Type, '#AnonType_wikipediaPageentrydictionary'., Line '4', Column '125'.
Cvc-pattern-valid: Value 'cleanroom Suit' Is Not Facet-valid With Respect To Pattern '[^/\s]+' For Type '#AnonType_wikipediaPageentrydictionary'., Line '5', Column '104'.
Cvc-attribute.3: The Value 'cleanroom Suit' Of Attribute 'wikipediaPage' On Element 'entry' Is Not Valid With Respect To Its Type, '#AnonType_wikipediaPageentrydictionary'., Line '5', Column '104'.
We now have:
Cvc-pattern-valid: Value 'alternative Medicine ' Is Not Facet-valid With Respect To Pattern '[^/\s]+' For Type '#AnonType_wikipediaPageentrydictionary'., Line '3', Column '125'.
Cvc-attribute.3: The Value 'alternative Medicine ' Of Attribute 'wikipediaPage' On Element 'entry' Is Not Valid With Respect To Its Type, '#AnonType_wikipediaPageentrydictionary'., Line '3', Column '125'.
This is because the wikipediaPage
attribute has spaces (and some are not correct anyway). Unfortunately you have to copy these by hand. So https://en.wikipedia.org/wiki/Flattening_the_curve
=> wikipediaPage="Flattening_the_curve"
and wikipediaURL="https://en.wikipedia.org/wiki/Flattening_the_curve"
Now valid
BUT Wikipedia entries with embedded spaces have been deleted and need re-entering with underscores, etc.
Now valid
BUT Wikipedia entries with embedded spaces have been deleted and need re-entering with underscores, etc.
Validates against Schema
Wikipedia has to be added
see previous comments
Cvc-complex-type.3.2.2: Attribute '_pP828_hasCause' Is Not Allowed To Appear In Element 'entry'., Line '5', Column '56'.
This is a new attribute and I have added it and will commit shortly. I will change the name to _p828_hasCause
. I'll also edit your dictionary
You are missing a </synonym>
. This suggests the file was hand-edited - it's very easy for these sorts of errors, which is why we should try to use software where possible.
*Your dictionary now validates *
There are several. Here's https://www.freeformatter.com/xsd-generator.html. It takes a dictionary (we use country.xml
) and analyzes what element
s occur, in what context (e.g. children). Then it analyzes each element to see if it has attributes and what it their type.
Here's the result of the xsd-generator's first guess at a "Russian Doll" schema. (https://www.oracle.com/technical-resources/articles/java/design-patterns.html) . XSD Schema can be very confusing so just take some of it for granted at this stage (I and others tried to get a simpler version in 2000 but were overruled). Luckily for dictionaries we don't need anything complicated. This will evolve as we try to accommodate all dictionaries.
<xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="dictionary">
<xs:complexType>
<xs:sequence>
<xs:element type="xs:string" name="desc"/>
<xs:element name="entry" maxOccurs="unbounded" minOccurs="0">
<xs:complexType>
<xs:sequence>
<xs:element type="xs:string" name="synonym" maxOccurs="unbounded" minOccurs="0"/>
</xs:sequence>
<xs:attribute type="xs:string" name="_p297_country" use="optional"/>
<xs:attribute type="xs:string" name="description" use="optional"/>
<xs:attribute type="xs:string" name="name" use="optional"/>
<xs:attribute type="xs:string" name="term" use="optional"/>
<xs:attribute type="xs:anyURI" name="wikidataURL" use="optional"/>
<xs:attribute type="xs:string" name="wikipediaURL" use="optional"/>
</xs:complexType>
</xs:element>
</xs:sequence>
<xs:attribute type="xs:string" name="title"/>
</xs:complexType>
</xs:element>
</xs:schema>
This what the schema creator guesses are what the author intended but it needs editing. We will wish to make some attributes required
and look at the type
of wikipediaURL
.
- schema structure The current example involves nested definitions sometimes called "Russian Doll".
<xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" xmlns:xs="http://www.w3.org/2001/XMLSchema">
(just copy this accurately - we don't need to understand it. It sets the namespaces).
- root element definition
<xs:element name="dictionary">
...
<xs:attribute type="xs:string" name="title"/>
...
</xs:element>
This defines an element dictionary
and requires it to have an attribute title
, so our documents must look something like:
<dictionary title="foobar" >
...
</dictionary>
(The title can be anything at this stage - string
is the least constraining).
- child elements
<xs:element name="dictionary">
<xs:complexType>
the dictionary element can have many children, but in a given order
<xs:sequence>
<xs:element type="xs:string" name="desc"/>
There must be a single <desc>...</desc>
child element. (We will revise this later...), followed by
<xs:element name="entry" maxOccurs="unbounded" minOccurs="0">
<xs:complexType>
...
</xs:complexType>
</xs:element>
</xs:sequence>
<xs:attribute type="xs:string" name="title"/>
</xs:complexType>
</xs:element>
any number of <entry>...</entry>
elements.
- grandchild elements
<xs:element name="entry" maxOccurs="unbounded" minOccurs="0">
<xs:complexType>
<xs:sequence>
<xs:element type="xs:string" name="synonym" maxOccurs="unbounded" minOccurs="0"/>
</xs:sequence>
...
each <entry>
element can contain any number of <synonym>
elements
- string content
<xs:element type="xs:string" name="synonym" maxOccurs="unbounded" minOccurs="0"/>
The <synonym>
elements have no element-children but can contain a text string
- attributes
The
<entry>
element can have many attributes:
<xs:element name="entry" maxOccurs="unbounded" minOccurs="0">
<xs:complexType>
...
<xs:attribute type="xs:string" name="_p297_country" use="optional"/>
<xs:attribute type="xs:string" name="description" use="optional"/>
<xs:attribute type="xs:string" name="name" use="optional"/>
<xs:attribute type="xs:string" name="term" use="optional"/>
<xs:attribute type="xs:anyURI" name="wikidataURL" use="optional"/>
<xs:attribute type="xs:string" name="wikipediaURL" use="optional"/>
</xs:complexType>
</xs:element>
By default all attributes are of type string
and have been guessed as optional
. We'll now refine that...
We want some attributes to be mandatory, so here's the next version:
<xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="dictionary">
<xs:complexType>
<xs:sequence>
<xs:element type="xs:string" name="desc"/>
<xs:element name="entry" maxOccurs="unbounded" minOccurs="1">
<xs:complexType>
<xs:sequence>
<xs:element type="xs:string" name="synonym" maxOccurs="unbounded" minOccurs="0"/>
</xs:sequence>
<!-- this only applies to country so we'll make it optional -->
<xs:attribute type="xs:string" name="_p297_country" use="optional"/>
<!-- but these 3 are mandatory -->
<xs:attribute type="xs:string" name="description"/>
<xs:attribute type="xs:string" name="name"/>
<xs:attribute type="xs:string" name="term"/>
<!-- these two are optional (there may not be wikipedia or wikidata values) -->
<xs:attribute type="xs:anyURI" name="wikidataURL" use="optional"/>
<xs:attribute type="xs:string" name="wikipediaURL" use="optional"/>
<!-- and we'll add these ones -->
<xs:attribute type="xs:string" name="wikidataID" use="optional"/>
<xs:attribute type="xs:string" name="wikipediaPage" use="optional"/>
</xs:complexType>
</xs:element>
</xs:sequence>
<xs:attribute type="xs:string" name="title"/>
</xs:complexType>
</xs:element>
</xs:schema>
We can see if our dictionary is valid against the current schema. Use https://www.freeformatter.com/xml-validator-xsd.html and either link to your file or paste it in. If your dictionary is very large (> 2MBytes) cut and paste a sample and test that (remember the dictionary must have balanced elements (e.g. end with </dictionary>
). Use the latest schema (currently dictionaries/openVirus_schema.xsd
).
Here's a typical run
Cvc-complex-type.2.4.a: Invalid Content Was Found Starting With Element 'entry'. One Of '{desc}' Is Expected., Line '2', Column '266'.
Cvc-complex-type.3.2.2: Attribute 'wikipedia' Is Not Allowed To Appear In Element 'entry'., Line '2', Column '266'.
Cvc-complex-type.3.2.2: Attribute 'wikidata' Is Not Allowed To Appear In Element 'entry'., Line '2', Column '266'.
Cvc-complex-type.3.2.2: Attribute 'wikipedia' Is Not Allowed To Appear In Element 'entry'., Line '3', Column '107'.
...
-
Invalid Content Was Found Starting With Element 'entry'. One Of '{desc}' Is Expected., . We have required at least one
desc
child ofdictionary
so create one or more to record what you have done. -
Attribute 'wikipedia' Is Not Allowed To Appear In Element 'entry' The allowed attributes on
entry
arewikipediaURL
andwikipediaPage
. Decide which is meant -
Attribute 'wikidata' Is Not Allowed To Appear In Element 'entry'. The allowed attributes on
entry
arewikidataURL
andwikidataID
. Decide which is meant.
In this way we will work towards consensus.
PLEASE LET US KNOW ON SLACK IF YOU CAN'T UNDERSTAND THE MESSAGES OR WANT TO ADD MORE ATTRIBUTES