Skip to content

FreeDict HOWTO – Writing Text Encoding Initiative XML Files

John Dovey edited this page Feb 8, 2021 · 10 revisions

Writing Text Encoding Initiative XML Files

This chapter explains the format used within FreeDict to encode dictionaries. It is not complete and serves as an introduction into the format. A basic understanding of XML is required, mainly up to a syntactical familiarity.
For more advanced examples, the example dictionary within the fd-dictionaries repository may be a useful resource.

At this stage, you should have obtained a version of our tools from https://github.com/freedict/fd-dictionaries that may assist you in validating your dictionary. However, this is not required.

Before we dive into the TEI format specifics, it is useful to understand that every dictionary consists of a header and a body. The header stores all the general information of the dictionary, such as the author, number of headwords and its edition. The body contain the actual content, i.e., the headwords and translations. If this is your first time with TEI (or even with XML), you might want to skip the header section and start straight away with the section on the TEI body to exprience some progress in your efforts. The header can always be added later. If you follow this approach, you might want to copy an existing header from one of the existing hand-written dictionaries, see https://github.com/freedict/fd-dictionaries.

The TEI Dictionary Header

The header of a dictionary encodes general information about the dictionary. This includes, but is not limited to, the author, the sources, the edition along with the headword count and the title. For brevity, we also include XML-specifics in this section. The XML-aware reader may ignore this sloppiness. For people with XML knowledge and an interest in a more fine-grained overview about the TEI header, a look into the TEI Guidelines, Chapter 2 is recommended.

Declaration and Doctype

Each XML document starts with the XML declaration. FreeDict dictionaries are encoded in XML and use the UTF-8 encoding. If you do not know what UTF-8 is, we advise you to check the editor settings and change the encoding from whatever setting to UTF-8. UTF-8 allows each viewer of the dictionary to read the entered characters, no matter which language the computer uses. For our dictionaries, the line XML declaration, placed at the first line, looks like this:

The next lines of the TEI dictionaries consist of the Document Type (Doctype):

<?xml version='1.0' encoding="UTF-8" ?>

<?xml-stylesheet type="text/css" href="freedict-dictionary.css"?>
<?oxygen RNGSchema="freedict-P5.rng" type="xml"?>
<!DOCTYPE TEI SYSTEM "freedict-P5.dtd">

The Doctype configures the way parsing tools interpret TEI XML. For FreeDict, it defines the structure of the document and defines the subset of TEI that we use. These lines rarely change for our dictionaries and can be safely copied from an existing dictionary.

Header Elements

In the following, an XPath expression is given ,together with the explanation of what this particular one is meant for. If you don't know how XPath works, then imagine the text to be an XML tag name and the slash a symbol to represent a parent-child-relation. So for instance, a/b speaks of an element b contained in a, or in XML: <a><b>...</b></a>.

It is recommended to have a look at the actual dictionaries to get a feeling for a complete TEI header.

fileDesc/titleStmt/title

The title of the dictionary. This becomes the short description of the dictionary in dictionary programs.

fileDesc/titleStmt/respStmt/resp

Declare author and optionally a maintainer for this dictionary. A maintainer is responsible for keeping a dictionary in a good shape and to handle incoming feature requests. This may or may not be the same person as the author.

Example:

  <titleStmt>
    <!-- ... -->
    <respStmt>
      <resp>created by</resp>
      <name>your name</name>
    </respStmt>
    <respStmt>
      <!-- maintainer for the freedict database -->
      <resp>Maintainer</resp>
      <name>your name again</name>
    </respStmt>
  </titleStmt>

fileDesc/titleStmt/respStmt/name

Name and email address of the person carrying the responsibility named in ../resp, i.e. the contents should follow the form FirstName LastName <user@host>.

fileDesc/editionStmt

This stanza must contain a tag called edition. This becomes the release version number. It is shown on the website and used in building filenames for releases. It is recommended to use only numbers and dots. Also, two levels of versioning should be enough, i.e. 0.1 is a good start.

For automated imports, it is advisable to use a version number such as YYYY.MM.DD (year.month.day).

Example

<editionStmt>
  <edition>0.8</edition>
</editionStmt>

fileDesc/extent

This tag has to contain the approximate number of headwords, including the unit "headwords". It is put into 00-database-info. The headword count for the website is extracted from the .index file of the dictd database format.

Ideally the headword count should be exact. If it cannot be exact, the size must be prepended with "about".

Example:

  <extent>4500 headwords</extent>

publicationStmt/publisher

In the publication stanza, most information about the project is gathered. The publisher tag should contain the word "FreeDict". While you might be the author, the publisher is the project as a whole.

Example:

<publicationStmt>
  <publisher>FreeDict</publisher>
  ...
</publicationStmt>

publicationStmt/availability

This stanza documents licensing information.

Example:

<publicationStmt>
  <availability>
    <p>Available under the terms of the GNU General Public Licence, version 3 (or at your option any later version, published by the FSF).</p>
  </availability>
</publicationStmt>

publicationStmt/date

This tag contains the release date of the database. For machine-readibility, an additional attribute containing an ISO-8601-formatted date in the format "YYYY-MM-DD" may be supplied.

Example:

<publicationStmt>
  <date when="2016-05-07">7th May 2016</date>

publicationStmt/pubPlace

The place identifier from where this dictionary can be obtained. It has to be set to https://freedict.org.

Example:

<publicationStmt>
  <pubPlace>
    <ref>https://freedict.org/</ref>
  </pubPlace>
...

notesStmt/note

Within the notesStmt stanza, multiple <note/> tags may be used to document any special things regarding this dictionary.

One note tag with the attribute type="status" has to exist: it documents the size of the database for the FreeDict XML API. See the example below. The possible values are documented in the FAQ.

The notes from this tag are put into 00-database-info and are available on the website as well. However, currently a "more info on this dictionary" page doesn't exist.

Example:

<notesStmt>
  <note type="status">small</note>
  <note>Some note documenting anything special...</note>
</notesStmt>

fileDesc/sourceDesc

Any information about the source. Can be formatted in tags like p or list.

fileDesc/sourceDesc/xptr

This is an optional source URL for the dictionary, most commonly used if the dictionary is or was automatically converted.

The value becomes the source url on the website and is put into 00-database-url when converting to dictd database format.

encodingDesc/projectDesc

This tag contains a stanza concerning the project. It should be contained in every dictionary and looks like this:

<encodingDesc>
  <projectDesc>
    <p>This dictionary comes to you through nice people making it available for free and for good.</p>
    <p>It is part of the FreeDict project, http://www.freedict.org.</p>
    <p>This project aims to make many translating dictionaries available for free. Your contributions are welcome!</p>
  </projectDesc>
</encodingDesc>

It makes sure that people can figure out where a particular dictionary comes from and why it is available.

revisionDesc

This tag contains the changes which people have made to this dictionary. It is not meant to release the maintainer from writing useful commit messages in the VCS, but rather to provide an overview to the end user what happened.

Ideally the ChangeLog files, distributed along with the released dictionaries, could be automatically generated using this information.

revisionDesc/change

Each version should have one change. The attribute n documents the version number to which a change corresponds. Changes that lack the version number will implicitly be counted towards its more recent versioned change. It is advised to use the date attribute in the ISO format format YYYY-MM-DD. Optionally the date can be documented in a subtag called date.

Within the change, paragraphs (p) as well as lists (list) are allowed.

Example:

<change n="0.1" who="#some_user" when="2016-05-20">
  <date>2016-05-20</date> <!-- optional, may be localised -->
  <list>
    <item>increase readibility by doing xyz</item>
    <item>add more words</item>
  </list>
</change>

Entry Examples

In the following sections, a few example entries are discussed. Please note that these are not all possible entries and depending on your particular needs, it might be necessary to use a slightly different encoding. On the other hand FreeDict slightly limits the TEI standard to achieve some consistancy. When in doubt, please ask on the mailing list.

The explanation of the tags is postponed to the next section

TEI Body

A minimal entry

<entry>
  <form>
    <orth>dog</orth>
  </form>
  <sense>
    <cit type="trans">
      <quote>Hund</quote>
    </cit>
  </sense>
</entry>

This entry would be formatted into something like this:

dog
    Hund

Have a look at the next example for explanations.

A more complete entry

<entry>
  <form>
    <orth>dog</orth>
    <pron>dɔg</pron> <!-- IPA pronunciation -->
  </form>
  <gramGrp> <!-- grammatical information -->
    <pos>n</pos> <!-- part of speech -->
  </gramGrp>
  <sense>
    <cit type="trans">
      <quote>Hund</quote><gen>m</gen>
    </cit>
    <cit type="example">
      <quote>The dog is barking.</quote>
      <cit type="trans" xml:lang="de">
        <quote>Der Hund bellt.</quote>
      </cit>
    </cit>
    <note>Dogs bite as well.</note>
  </sense>
</entry>

After formatting it might look as:

dog [dɔg] n.
  Hund m.
  "The dog is barking." = "Der Hund bellt."
  (Dogs bite as well.)

Orth stands for the orthography (spelling) and is used as the headword of an entry. If multiple spellings exist, multiple orth elements can be used. The pron element contains the pronunciation of a word. This is optional and generated automatically for a lot of languages by eSpeak-NG. FreeDict uses the IPA (International Phonetic Alphabet), the symbols are also part of Unicode.

In a gramGrp element, all grammatical information is grouped. You can give the part of speech (here n for noun), the gender and also the number (singular, plural) of the headword (Number is not given in this example). The values allowed for part-of-speech are prescribed in this document, so that one doesn't write n and the other one noun and the third one N or whatever. See Table 5.1, “Part of Speech Typology (recommended contents of the pos element)”).

Translations for headwords are specified in a cit element, containing a quote. A cit element may contain multiple quotes, as may sense contain multiple cit's. The type="trans" marks a word as a translation, an example serves as an example to the dictionary reader. As an exception, the trans type within an example counts as the translation of the example, not as the one of the headword.

You can group the different senses of homographs with sense. The numbering (given with the attribute n) is optional. Translations are given in the tr element. Multiple tr elements may be given. For each, grammatical information is optional. With usg you can give examples of usage and optionally their translation.

Entry with a definition as well as a translation and an example sentence

<entry>
  <form>
    <orth>ban</orth>
  </form>
  <gramGrp>
    <pos>prep</pos>
  </gramGrp>
  <sense>
    <cit type="trans"><quote>to</quote></cit>
    <def>denotes infinitive of the following verb</def>
    <cit type="example">
      <quote>U nang ban thoh.</quote>
      <cit type="trans">
        <quote>Come here!</quote>
      </cit
    </cit>
  </sense>
</entry>

Please note that definitions should be used with care. Please use the usg tag for usage information (can be further narrowed down with specific types named in the TEI standard or notes if appropriate. def should only contain definitions.

Entry with a cross reference to a synonym

<entry>
  <form>
    <orth>pynhiar</orth>
  </form>
  <gramGrp>
    <pos>v</pos>
  </gramGrp>
  <sense>
    <cit type="trans"><quote>abase</quote></cit>
    <xr type="syn"><ref target="#pynrit">pynrit</ref></xr>
  </sense>
</entry>

<entry xml:id="pynrit">
  ...
</entry>

This way of referencing might seem a bit counter-intuitive at first. xr groups all references of one type. The type attribute is optional, but may be given to make the database also machine-readable. Values for the type include:


syn synonym etym etymological cf compare or consult illus illustration see for loosely related entries


Within the xr tag, you may also put in a short word like "compare" to please the eye of the human reader.

The ref tag marks a reference to another entry. The target attribute is again optional, but makes it 100 % clear to which headword you are linking to (useful for machine parsing). However, it is not mandatory and only the text within the ref tag is shown in the human-readable dictionary.

The second, not fully shown, entry shows how the label within the target attribute of the ref tag has to be defined. Note that in the reference, the label must be preceded with a hash sign #.

Entry with domain for translation

<entry>
  <form>
    <orth>pungkjat</orth>
  </form>
  <gramGrp>
    <pos>n</pos><gen>f</gen>
  </gramGrp>
  <sense>
    <usg type="dom">bio</usg>
    <cit type="trans">
      <quote>leg</quote>
    </cit>
  </sense>
</entry>

The TEI standard lists a few example domains which we currently support. This is by no means complete and can be extended over time.

Supported elements in entries (children of Entry)

Tag Definition
form groups orth and pron elements
orth orthography; becomes a headword
pron pronunciation; optional
gramGrp grouping of grammatical information like pos (part of speech), gen (gender) and num (number); gramGrp is optional, but recommended
sense group senses (translations); can be numbered with the n attribute
usg Usage hints. Suggested to use a type to further clarify the exact usage hint, see the TEI standard on usg.
cit Groups information relating to a translation equivalent. A type needs to be given: "type=trans" for translations or type="example" for examples; additional grammatical information may be given as well and should be grouped again in a gramGrp element.
def contains a definition; is taken over verbatim; in a bilingual dictionary, grammatical particles sometimes cannot be translated, therefore a definition of their function is more appropriate
xref a cross reference (to a different headword or translation), see examples section

Please refer to the TEI Guidelines and take a look at the XML markup of a dictionary for more examples.

TEI Dictionary Template

You can use the FreeDict Dictionary TEI XML example dictionary as a template. You can find it in our repository under shared/lg1-lg2 or here.

Dictionary Quality

As dictionaries are used as authoritative sources for people looking up the spelling of words or learning foreign languages, it is important that they maintain a high quality standard.

The following quality criteria are important:

Quality Criteria for Dictionaries

Correctness : The reason for this we mentioned in the introductory paragraph. Headword Count : It is frustrating not to find a word in a dictionary. Usability : This is mainly a question for the platforms that our dictionaries are provided for. Dictionaries should be easy to install and word lookup should be easy as well. Electronic dictionaries have a great advantage in the speed of entry lookups and can provide lookup strategies that paper dictionaries can't. Paper dictionaries have the advantage to be quite portable, up to a number of kg that depends on the capacity of the bearer. PDA can replace nowadays a whole bunch of dictionaries while maintaining portability.

Means to Improve Dictionary Quality

So what can be done for the above criteria? Indirectly you can always support the people working on dictionaries, sparing resources for them. To contribute directly, the following can be done:

Revise entries manually

Having no comfortable editor presently makes this a bit hard. It requires profound knowledge of the languages of the dictionary, so this task is reserved for experts.

Spellcheck the dictionary

When you write a dictionary for a language where no spellchecker wordlist exists yet, of course you can't do this. But mostly dictionaries will be from some language into English. The English part of the dictionary can be spellchecked! This requires a suitable tool.

Check sanity/completeness of entries

Can help to spot entries without parts of speech information or carrying editorial marks requesting clarification. Requires a supporting tool.

Report and fix bugs

This is a natural activity for Open Source Software.

Grow the dictionary

Provided there is a way to enter/submit new entries, the question arises from where to get new entries or information to extend existing entries. This is a quite complex topic, so it might go into its own section one day when it has grown up (are there parallels to "How to grow a language" from Guy Steele?).

Having a miss during word lookup creates a likely candidate to be added to the dictionary if the query was not misspelled. Usually the translation can be found in another dictionary (that is what I do when I have a miss). This combination of headword and translation can be added to the dictionary you want to grow.

Doing it systematically can be called copying a dictionary. Watch out for author's rights here. Nobody can own words, but compiling a dictionary can be quite some work, so acknowledge this!

Often, having parts can help get you going. Having wordlists of the headwords or translation equivalents of one of the languages of the dictionary can help growing a dictionary. With a wordlist you only have to answer slightly more easy questions like "What is the translation of word XXX in the other language?" or "What is the Part of Speech of word XXX?".

Sometimes wordlists are quite easy to get. You can for example extract them from existing dictionaries or spellchecker databases. The tool index2wordlist.pl in the tools/testing directory can make a word list out of a dictd database index file. The aspell dump command can give you the wordlist of a database of the aspell spellchecker.

For languages where you cannot reuse existing word lists, e.g. when you are spearheading the development of the first-ever dictionary of a minority language, the situation is slightly more difficult.

If electronic documents - preferably websites - in that language exist, you can use a Natural Language Processing technique that employs seed words as input. The seeds you have to give should be specific to your language, ie. they should not be used in other languages. Then you can identify the electronic documents containing those seeds. They are likely to be in your language. From the documents in your language you can extract additional words which you can reuse to find more documents in your language and more words in turn. In hypertexts you can exploit a locality feature of documents in a certain language: The links are likely to lead to documents in the same language. So you can get more words from there as well.

An implementation of this technique was done by Prof. Kevin Scannell with Crudaban, a crawler that uses the Google API(s) to find websites.

Clone this wiki locally