-
Notifications
You must be signed in to change notification settings - Fork 51
FreeDict HOWTO – Writing Text Encoding Initiative XML Files
This chapter explains the format used within FreeDict to encode dictionaries. It
is not complete and serves as an introduction into the format. A basic
understanding of XML is required, mainly up to a syntactical familiarity.
For more advanced examples, the example dictionary within the fd-dictionaries
repository
may be a useful resource.
At this stage, you should have obtained a version of our tools from https://github.com/freedict/fd-dictionaries that may assist you in validating your dictionary. However, this is not required.
Before we dive into the TEI format specifics, it is useful to understand that every dictionary consists of a header and a body. The header stores all the general information of the dictionary, such as the author, number of headwords and its edition. The body contain the actual content, i.e., the headwords and translations. If this is your first time with TEI (or even with XML), you might want to skip the header section and start straight away with the section on the TEI body to exprience some progress in your efforts. The header can always be added later. If you follow this approach, you might want to copy an existing header from one of the existing hand-written dictionaries, see https://github.com/freedict/fd-dictionaries.
The header of a dictionary encodes general information about the dictionary. This includes, but is not limited to, the author, the sources, the edition along with the headword count and the title. For brevity, we also include XML-specifics in this section. The XML-aware reader may ignore this sloppiness. For people with XML knowledge and an interest in a more fine-grained overview about the TEI header, a look into the TEI Guidelines, Chapter 2 is recommended.
Each XML document starts with the XML declaration. FreeDict dictionaries are encoded in XML and use the UTF-8 encoding. If you do not know what UTF-8 is, we advise you to check the editor settings and change the encoding from whatever setting to UTF-8. UTF-8 allows each viewer of the dictionary to read the entered characters, no matter which language the computer uses. For our dictionaries, the line XML declaration, placed at the first line, looks like this:
The next lines of the TEI dictionaries consist of the Document Type (Doctype):
<?xml version='1.0' encoding="UTF-8" ?>
<?xml-stylesheet type="text/css" href="freedict-dictionary.css"?>
<?oxygen RNGSchema="freedict-P5.rng" type="xml"?>
<!DOCTYPE TEI SYSTEM "freedict-P5.dtd">
The Doctype configures the way parsing tools interpret TEI XML. For FreeDict, it defines the structure of the document and defines the subset of TEI that we use. These lines rarely change for our dictionaries and can be safely copied from an existing dictionary.
In the following, an XPath expression is given ,together with the explanation of what this particular one is meant for. If you don't know how XPath works, then imagine the text to be an XML tag name and the slash a symbol to represent a parent-child-relation. So for instance, a/b
speaks of an element b contained in a, or in XML: <a><b>...</b></a>
.
It is recommended to have a look at the actual dictionaries to get a feeling for a complete TEI header.
The title of the dictionary. This becomes the short description of the dictionary in dictionary programs.
Declare author and optionally a maintainer for this dictionary. A maintainer is responsible for keeping a dictionary in a good shape and to handle incoming feature requests. This may or may not be the same person as the author.
Example:
<titleStmt>
<!-- ... -->
<respStmt>
<resp>created by</resp>
<name>your name</name>
</respStmt>
<respStmt>
<!-- maintainer for the freedict database -->
<resp>Maintainer</resp>
<name>your name again</name>
</respStmt>
</titleStmt>
Name and email address of the person carrying the responsibility named in ../resp, i.e. the contents should follow the form FirstName LastName <user@host>
.
This stanza must contain a tag called edition
.
This becomes the release version number. It is shown on the website and used in building filenames for releases. It is recommended to use only numbers and dots. Also, two levels of versioning should be enough, i.e. 0.1 is a good start.
For automated imports, it is advisable to use a version number such as YYYY.MM.DD (year.month.day).
Example
<editionStmt>
<edition>0.8</edition>
</editionStmt>
This tag has to contain the approximate number of headwords, including the unit "headwords". It is put into 00-database-info. The headword count for the website is extracted from the .index file of the dictd database format.
Ideally the headword count should be exact. If it cannot be exact, the size must be prepended with "about".
Example:
<extent>4500 headwords</extent>
In the publication stanza, most information about the project is gathered. The publisher tag should contain the word "FreeDict". While you might be the author, the publisher is the project as a whole.
Example:
<publicationStmt>
<publisher>FreeDict</publisher>
...
</publicationStmt>
This stanza documents licensing information.
Example:
<publicationStmt>
<availability>
<p>Available under the terms of the GNU General Public Licence, version 3 (or at your option any later version, published by the FSF).</p>
</availability>
</publicationStmt>
This tag contains the release date of the database. For machine-readibility, an additional attribute containing an ISO-8601-formatted date in the format "YYYY-MM-DD" may be supplied.
Example:
<publicationStmt>
<date when="2016-05-07">7th May 2016</date>
The place identifier from where this dictionary can be obtained. It has to be set to https://freedict.org.
Example:
<publicationStmt>
<pubPlace>
<ref>https://freedict.org/</ref>
</pubPlace>
...
Within the notesStmt stanza, multiple <note/>
tags may be used to document any
special things regarding this dictionary.
One note tag with the attribute type="status"
has to exist: it documents the
size of the database for the FreeDict XML API. See the example below.
The possible values are documented in the FAQ.
The notes from this tag are put into 00-database-info and are available on the website as well. However, currently a "more info on this dictionary" page doesn't exist.
Example:
<notesStmt>
<note type="status">small</note>
<note>Some note documenting anything special...</note>
</notesStmt>
Any information about the source. Can be formatted in tags like p or list.
fileDesc/sourceDesc/xptr
This is an optional source URL for the dictionary, most commonly used if the dictionary is or was automatically converted.
The value becomes the source url on the website and is put into 00-database-url when converting to dictd database format.
This tag contains a stanza concerning the project. It should be contained in every dictionary and looks like this:
<encodingDesc>
<projectDesc>
<p>This dictionary comes to you through nice people making it available for free and for good.</p>
<p>It is part of the FreeDict project, http://www.freedict.org.</p>
<p>This project aims to make many translating dictionaries available for free. Your contributions are welcome!</p>
</projectDesc>
</encodingDesc>
It makes sure that people can figure out where a particular dictionary comes from and why it is available.
This tag contains the changes which people have made to this dictionary. It is not meant to release the maintainer from writing useful commit messages in the VCS, but rather to provide an overview to the end user what happened.
Ideally the ChangeLog files, distributed along with the released dictionaries, could be automatically generated using this information.
Each version should have one change
. The attribute n
documents the version
number to which a change corresponds. Changes that lack the version number
will implicitly be counted towards its more recent versioned change.
It is advised to use the date
attribute in the ISO format format YYYY-MM-DD.
Optionally the date can be documented in a subtag called date
.
Within the change, paragraphs (p
) as well as lists (list
) are allowed.
Example:
<change n="0.1" who="#some_user" when="2016-05-20">
<date>2016-05-20</date> <!-- optional, may be localised -->
<list>
<item>increase readibility by doing xyz</item>
<item>add more words</item>
</list>
</change>
In the following sections, a few example entries are discussed. Please note that these are not all possible entries and depending on your particular needs, it might be necessary to use a slightly different encoding. On the other hand FreeDict slightly limits the TEI standard to achieve some consistancy. When in doubt, please ask on the mailing list.
The explanation of the tags is postponed to the next section
<entry>
<form>
<orth>dog</orth>
</form>
<sense>
<cit type="trans">
<quote>Hund</quote>
</cit>
</sense>
</entry>
This entry would be formatted into something like this:
dog
Hund
Have a look at the next example for explanations.
<entry>
<form>
<orth>dog</orth>
<pron>dɔg</pron> <!-- IPA pronunciation -->
</form>
<gramGrp> <!-- grammatical information -->
<pos>n</pos> <!-- part of speech -->
</gramGrp>
<sense>
<cit type="trans">
<quote>Hund</quote><gen>m</gen>
</cit>
<cit type="example">
<quote>The dog is barking.</quote>
<cit type="trans" xml:lang="de">
<quote>Der Hund bellt.</quote>
</cit>
</cit>
<note>Dogs bite as well.</note>
</sense>
</entry>
After formatting it might look as:
dog [dɔg] n.
Hund m.
"The dog is barking." = "Der Hund bellt."
(Dogs bite as well.)
Orth stands for the orthography (spelling) and is used as the headword of an entry. If multiple spellings exist, multiple orth elements can be used. The pron element contains the pronunciation of a word. This is optional and generated automatically for a lot of languages by eSpeak-NG. FreeDict uses the IPA (International Phonetic Alphabet), the symbols are also part of Unicode.
In a gramGrp element, all grammatical information is grouped. You can give the part of speech (here n for noun), the gender and also the number (singular, plural) of the headword (Number is not given in this example). The values allowed for part-of-speech are prescribed in this document, so that one doesn't write n and the other one noun and the third one N or whatever. See Table 5.1, “Part of Speech Typology (recommended contents of the pos element)”).
Translations for headwords are specified in a cit
element, containing a
quote
. A cit
element may contain multiple quotes, as may sense
contain
multiple cit
's. The type="trans"
marks a word as a translation, an example
serves as an example to the dictionary reader. As an exception, the trans
type
within an example
counts as the translation of the example, not as the one of
the headword.
You can group the different senses of homographs with sense
. The numbering (given with the attribute n) is optional. Translations are given in the tr
element. Multiple tr elements may be given. For each, grammatical information is optional.
With usg
you can give examples of usage and optionally their translation.
<entry>
<form>
<orth>ban</orth>
</form>
<gramGrp>
<pos>prep</pos>
</gramGrp>
<sense>
<cit type="trans"><quote>to</quote></cit>
<def>denotes infinitive of the following verb</def>
<cit type="example">
<quote>U nang ban thoh.</quote>
<cit type="trans">
<quote>Come here!</quote>
</cit
</cit>
</sense>
</entry>
Please note that definitions should be used with care. Please use the usg
tag
for usage information (can be further narrowed down with specific types named in
the TEI standard or notes if appropriate. def
should only contain definitions.
<entry>
<form>
<orth>pynhiar</orth>
</form>
<gramGrp>
<pos>v</pos>
</gramGrp>
<sense>
<cit type="trans"><quote>abase</quote></cit>
<xr type="syn"><ref target="#pynrit">pynrit</ref></xr>
</sense>
</entry>
<entry xml:id="pynrit">
...
</entry>
This way of referencing might seem a bit counter-intuitive at first. xr
groups
all references of one type. The type
attribute is optional, but may be given
to make the database also machine-readable. Values for the type include:
syn synonym etym etymological cf compare or consult illus illustration see for loosely related entries
Within the xr
tag, you may also put in a short word like "compare" to please
the eye of the human reader.
The ref
tag marks a reference to another entry. The target attribute is again
optional, but makes it 100 % clear to which headword you are linking to (useful
for machine parsing). However, it is not mandatory and only the text within the
ref
tag is shown in the human-readable dictionary.
The second, not fully shown, entry shows how the label within the target
attribute of the ref
tag has to be defined. Note that in the reference, the
label must be preceded with a hash sign #
.
<entry>
<form>
<orth>pungkjat</orth>
</form>
<gramGrp>
<pos>n</pos><gen>f</gen>
</gramGrp>
<sense>
<usg type="dom">bio</usg>
<cit type="trans">
<quote>leg</quote>
</cit>
</sense>
</entry>
The TEI standard lists a few example domains which we currently support. This is by no means complete and can be extended over time.
Tag | Definition |
---|---|
form |
groups orth and pron elements |
orth |
orthography; becomes a headword |
pron |
pronunciation; optional |
gramGrp |
grouping of grammatical information like pos (part of speech), gen (gender) and num (number); gramGrp is optional, but recommended |
sense |
group senses (translations); can be numbered with the n attribute |
usg |
Usage hints. Suggested to use a type to further clarify the exact usage hint, see the TEI standard on usg . |
cit | Groups information relating to a translation equivalent. A type needs to be given: "type=trans" for translations or type="example" for examples; additional grammatical information may be given as well and should be grouped again in a gramGrp element. |
def | contains a definition; is taken over verbatim; in a bilingual dictionary, grammatical particles sometimes cannot be translated, therefore a definition of their function is more appropriate |
xref | a cross reference (to a different headword or translation), see examples section |
Please refer to the TEI Guidelines and take a look at the XML markup of a dictionary for more examples.
You can use the FreeDict Dictionary TEI XML example dictionary as a template. You can find it in our repository under shared/lg1-lg2 or here.
As dictionaries are used as authoritative sources for people looking up the spelling of words or learning foreign languages, it is important that they maintain a high quality standard.
The following quality criteria are important:
Quality Criteria for Dictionaries
Correctness : The reason for this we mentioned in the introductory paragraph. Headword Count : It is frustrating not to find a word in a dictionary. Usability : This is mainly a question for the platforms that our dictionaries are provided for. Dictionaries should be easy to install and word lookup should be easy as well. Electronic dictionaries have a great advantage in the speed of entry lookups and can provide lookup strategies that paper dictionaries can't. Paper dictionaries have the advantage to be quite portable, up to a number of kg that depends on the capacity of the bearer. PDA can replace nowadays a whole bunch of dictionaries while maintaining portability.
So what can be done for the above criteria? Indirectly you can always support the people working on dictionaries, sparing resources for them. To contribute directly, the following can be done:
Having no comfortable editor presently makes this a bit hard. It requires profound knowledge of the languages of the dictionary, so this task is reserved for experts.
When you write a dictionary for a language where no spellchecker wordlist exists yet, of course you can't do this. But mostly dictionaries will be from some language into English. The English part of the dictionary can be spellchecked! This requires a suitable tool.
Can help to spot entries without parts of speech information or carrying editorial marks requesting clarification. Requires a supporting tool.
This is a natural activity for Open Source Software.
Provided there is a way to enter/submit new entries, the question arises from where to get new entries or information to extend existing entries. This is a quite complex topic, so it might go into its own section one day when it has grown up (are there parallels to "How to grow a language" from Guy Steele?).
Having a miss during word lookup creates a likely candidate to be added to the dictionary if the query was not misspelled. Usually the translation can be found in another dictionary (that is what I do when I have a miss). This combination of headword and translation can be added to the dictionary you want to grow.
Doing it systematically can be called copying a dictionary. Watch out for author's rights here. Nobody can own words, but compiling a dictionary can be quite some work, so acknowledge this!
Often, having parts can help get you going. Having wordlists of the headwords or translation equivalents of one of the languages of the dictionary can help growing a dictionary. With a wordlist you only have to answer slightly more easy questions like "What is the translation of word XXX in the other language?" or "What is the Part of Speech of word XXX?".
Sometimes wordlists are quite easy to get. You can for example extract them from existing dictionaries or spellchecker databases. The tool index2wordlist.pl in the tools/testing directory can make a word list out of a dictd database index file. The aspell dump command can give you the wordlist of a database of the aspell spellchecker.
For languages where you cannot reuse existing word lists, e.g. when you are spearheading the development of the first-ever dictionary of a minority language, the situation is slightly more difficult.
If electronic documents - preferably websites - in that language exist, you can use a Natural Language Processing technique that employs seed words as input. The seeds you have to give should be specific to your language, ie. they should not be used in other languages. Then you can identify the electronic documents containing those seeds. They are likely to be in your language. From the documents in your language you can extract additional words which you can reuse to find more documents in your language and more words in turn. In hypertexts you can exploit a locality feature of documents in a certain language: The links are likely to lead to documents in the same language. So you can get more words from there as well.
An implementation of this technique was done by Prof. Kevin Scannell with Crudaban, a crawler that uses the Google API(s) to find websites.