-
Notifications
You must be signed in to change notification settings - Fork 51
discussion TEI
- Points 1-9 are directly taken from an
ML thread,
numbered A1-A9 there.
- Points 10-X also stem from around that thread.
-
Questions
- What about the TEI Lex-0 standard?
- Should it be followed?
-
Examples
- a)
<gram type="gender"/>
instead of<gen/>
. - b)
<usg>
with@type
(and possibly@norm
)
- a)
-
Potential advantages
- good, fixed list of
usg
types (see this comparison table)- The useful
@type
stextType
andattribute
have no equivalents in the TEI Guidelines' suggested values.-
textType
examples: bibl., poet., admin., journalese -
attribute
examples: derog., euph.
-
- The useful
- Requirement to fully annotate with
@xml:id
and@xml:id
- good, fixed list of
-
Further questions:
- Should
textType
andattribute
just be borrowed from TEI Lex-0? - Where to annotate with
@xml:id
and@xml:lang
?
- Should
-
Answers
- The FreeDict conversion style sheets do not support TEI Lex-0. (FreeDict TEI is in parts incompatible with TEI Lex-0)
- "It all boils down to somebody reading the document, defining our specific requirements and potentially modification and implementing it." / @shumenda
- The TEI Lex-0 guidelines may be used in addition wherever they do not contradict the FreeDict or TEI guidelines.
- TEI Lex-0 is meant to encode retrodigitized dictionaries including presentational information, while FreeDict TEI is not concerned with such.
- Consider to someday switch to another (related) standard: ISO LMF-4
- No public information yet.
- ISO standard is not available for free
- There is a skeletal example document
-
See also: this thread on the mailing list.
-
Status quo
-
Questions: How to annotate transitivity information?
-
Answer: The use of
subc
is strongly recommended.
-
Question: How can I enrich my dictionary with pronunciation, as annotated in
<pron>
tags? -
Answer: Unless present, the standard build process, using
make
, adds phonetics information using the teiaddphonetics script (which internally usese speak[-ng]).
-
Question: Should usage annotations (the content of
<usg>
tags) be normalized?- different languages (e.g. "[Sprw.]" ~ "[prov.]")
- same language (e.g. "[coll.]" ~ "[slang]")
-
Notes:
- Recommended by TEI Lex-0.
- The usage of
@norm
in<usg>
might render this less an issue.
-
Sub-questions
- Should they be normalised to a single label?
- Should they be normalised to some standard labels?
- ISO 12620 (cf. Wikipedia:Registers) (full standard only commercially available)
-
Answers
- An ontology should be defined.
- Questions:
- Similar to / linked to
shared/FreeDict_ontology.xml
?- This seems to only allow linking equivalent annotations in different languages, however not "coll." and "slang" (if these should even be considered equivalent).
- Where to find documentation on writing such an ontology?
- Similar to / linked to
- Questions:
- An ontology should be defined.
-
Examples
- "mainly Am."
- "bes. Süddt.", "especially Am."
-
Question
- How to represent the determiner ("mainly", "bes.", ...)?
-
Notes
- TEI Lex-0 suggests a separate attribute, but not which (there is a TODO
in the doc).
- None of the
<usg>
annotations really fit, maybe@subtype
?
- None of the
- TEI Lex-0 suggests a separate attribute, but not which (there is a TODO
in the doc).
-
Answer
- Likely the easiest:
<usg type="hint">mainly Am.</usg>
- Likely the easiest:
-
classes of such annotation
- a) dialect
- Ex.: "[Br.]", "[Am.]", "[Ös.]", "[Sächs.]"
- distinction from b) partially unclear (e.g., "Am.")
- b) Region or country
- Ex.: "[South Africa]", "[Hessen]", "[Berlin]", "[Wien]"
- c) Ex.: "[French]", "[Lat.]"
- a) dialect
-
Questions
- How to annotate/distinguish the above classes?
-
Notes
- TEI Lex-0:
usg[@type="geographic"]
: "marker which identifies the place or region where a lexical unit is mainly used"- Matches b), potentially partly a).
- TEI Lex-0:
-
Answers
- a), b):
usg[@type="geo"]
- c):
usg[@type="lang"]
- See the TEI Guidelines's corresponding section.
- Alternatively: Craft new type and document in the header
(
usg
type
s may be be freely chosen according to the TEI Guidelines.)- Also consider to adopt such a new type in the FreeDict guidelines.
- Use plain text but name the tag and attribute name explicitly.
- Consider to use a list of languages (e.g., this).
- a), b):
-
Notes example (
p
s andlist
s are both fine):
<notesStmt>
<note type="status">small</note> <!-- mandatory for our DB -->
<note xml:lang="de"> <!-- can be freely chosen -->
<list><item>blah</list>
</note>
</notesStmt>
-
Cases
- a) Headwords, which are annotations.
- rare
- b) Annotated on headwords.
- a) Headwords, which are annotations.
-
Question: How to represent in TEI?
-
Notes
-
Answers
- An
entry
should only contain a singleform
tag. - An
entry/form
may contain a nestedform[@type="abbrev"]
element. - In the case of a standalone abbreviation, the corresponding
form
element right belowentry
should be annotated with@type="abbrev"
.- potential issue: Shouldn't the topmost
form
elements have@type="lemma"
?
- potential issue: Shouldn't the topmost
- An
- Answer: Both are fine (also in parallel).
- Consider to put
gramGrp
insideform
, when also insense
.
- Consider to put
-
Question: Currently
<availability>
is suggested and used exclusively (for licensing information). Why not<license>
? -
Answer: The style sheets do not permit
<license>
, the validation would hence fail.- Consider to change this in a future style sheet update.
-
Q: Where to annotate a date special to a source the final TEI was imported from.
-
A: Annotate within
sourceDesc
.- Q: As plain text?
-
HowTo:
<ref>https://freedict.org/</ref>
-
(example) TEI:
<ref target="http://freedict.org/">http://freedict.org/</ref>
-
A: The HowTo is right.
-
Question: What to use when the TEI output is both influenced by a source's version and an importer's version?
-
Answers
- Whatever works or seems logical.
- Options: srcver.importerver | date | srcver | srcver.date
- Q: Set author of importer as editor?
- TEI Guidelines: "[...] acting as editor, compiler, translator, etc."
- A: Permitted.
A: Content!
-
Options:
superEntry/entry
entry/sense
entry/hom
-
entry/entry
- illegal in (FreeDict) TEI, suggested in TEI Lex-0.
-
Q: Is superEntry ok?
- A: No. "It doesn't seem necessary at all and is on its way out, in general." / @bansp
- A: Not handled by stylsheets. Also,
hom
is ignored.
-
Q: [imported dictionaries] What if it is not clear from the source whether two homographs qualify as senses of the same word?
- Note: The "Ding" dictionary contains many words repeatedly, usually with (close to) identical meaning.
- Q: If grouping, what to do with potentially differing annotations, including abbreviations,
gramGrp
, inflected forms?- Q: Only keep what applies to all on the top level?
- Q: Are all the tags valid e.g. on the
sense
level?
- Examples: "{v}" - the braces, ";", "~" - for references
- A: Drop
- A: Drop.
- Ex.: "Avis {m,n}" (german)
- A: Two
<gen>
in a singlegramGrp
.
-
Examples:
- "bread (baked in an oven)"
- "bread (wheat product)"
-
Options:
<note>
-
<usg>
--@type="hint"
?- Usually used for more specific usages, e.g. "Am.", "med.".
<def>
-
Answers:
- [imported dictionary] When undistinguishable, use
<note>
- When writing by hand, try to distinguish (
def
,usg
with specific@type
).
- [imported dictionary] When undistinguishable, use
-
Cases:
- a) case information: "wegen {+Gen.}"
- see 17.2)
- b) auxiliary words representing an object
- b.1) suffixing: "eat sth."
- b.2) prefixing: "etw. essen"
- b.3) alternatives: "notify sth./sb."
- b.3.1) switchable words: "to file away <> sth." (indicating the alternatives "to file away sth." and "to file sth. away")
- b.4) several: "give sth. to sb."
- potentially both prefixing and suffixing
- c) specific word(s)
- c.1) suffixing: "dismounting (of a machine)"
- c.2) prefixing
- c.3) combinations
- d) combinations of a), b), c)
- a) case information: "wegen {+Gen.}"
-
Available tags
-
<colloc>
(occurs in<gramGrp>
)- attribute
@type="left"
?- possible conflict with
@type
as suggested in 17.2).
- possible conflict with
- attribute
-
<usg type="colloc">
- attribute
@subtype="left"
?
- attribute
-
<cit type="colloc">
- Nested inside
<cit type="trans">
, seen ineng-pol
.- See also 25.3)
- Nested inside
-
-
Answers
- For a), see 17.2).
- b)
colloc
- c)
usg[@type="colloc"]
-
Proposed answers:
- b.i):
<colloc>
. This is grammar information. - b.ii):
@type
or@subtype
with valueobj
(or similar). - c):
<usg type="colloc">
/<cit type="colloc">
. This is not grammar information. - location:
@subtype="left"
resp. "right". - order: keep both
<colloc>
and<usg type="colloc">
(resp.cit
) in the original order.- Keeping the order of the union of both is impossible with the given suggestion, but things like "(of a machine)" are supposed to be optional anyways.
- b.3) (alternatives)
- i) group in
<choice>
or similar. - ii-iv) see below
- v) Use
@n
to define an order. Interchangeable collocates get the same `@n'. - iii) conflicts with several subsequent
<colloc>
s
- i) group in
- b.i):
<form><!-- ii) -->
<orth>notify</orth>
<gramGrp><colloc>sth.</colloc></gramGrp>
<form type="alternate">
<orth>notify</orth>
<gramGrp><colloc>sb.</colloc></gramGrp>
</form>
</form>
<!-- OR iii) -->
<form>
<orth>notify</orth>
<gramGrp>
<colloc>sth.</colloc>
<colloc>sb.</colloc>
</gramGrp>
</form>
<!-- OR iv) -->
<form>
<orth>notify</orth>
<gramGrp>
<colloc>sth./sb.</colloc>
</gramGrp>
</form>
-
How to encode "{+Gen.}", indicating that an object in the genitive case should follow?
- Special case: "{wo?, wann? +Dat.}" -- further enriched with corresponding interrogative pronoun(s)
- Similarly for POS: "{+conj}"
-
Option:
-
<colloc>[+ Gen.]</colloc>
(where "Gen." might be changed to something else)- Derived from TEI Lex-0
-
[]
is not very nice. - Likely use a non-language-specific case-abbreviation (i.e., "gen")
-
<colloc type="case">
- Would require a corresponding type for regular collocates, such as in the
TEI GUidelines' example "médire de".
- Option:
@type="plain"
.
- Option:
- Would require a corresponding type for regular collocates, such as in the
TEI GUidelines' example "médire de".
-
-
See also: 17), in particular
@type="left"
.
-
Consider "[formal/Am.]" vs. "[formal] [Am.]".
- The former indicates a disjunction, the latter a conjunction of the two annotations.
- Also possible with grammar annotations.
-
Q: How to differentiate?
-
Options:
- a) Don't.
- b) For grammar annotations: Several
gramGrp
s. - c) Literal retaining of the slash (or similar separator).
- May forbid to set a common
@type
(such as in the example above).
- May forbid to set a common
- d) Something like
<choice>
for disjunctions.
- Options
- Short english forms from
shared/FreeDict_ontology.xml
- Anything, but link to that ontology, as done in
eng-pol.tei
.
- Short english forms from
-
Example: "biological breakdown/degradation"
-
Q: How to encode
-
Options:
- literally
- derive two distinct headwords/translations
- headwords:
- link with
xr/ref
- sub-
form
with@type="alternate"
or similar.
- link with
- translations: separare
cit
elements
- headwords:
- Something else (e.g. something like
choice
)- likely only an option for translations.
- A: Only if they contain any information within a sense, such as a reference (
<ref>
).- only
gramGrp
or inflected forms are insufficient.
- only
-
Cases
- a) same main part: "v/trans" + "v/intr"
- Example: "essen {vt;vi}"
- b) different main part (awkward): "v/trans" + "pron/rel"
- a) same main part: "v/trans" + "v/intr"
-
Options
- a.1) One
pos
followed by severalsubc
. - *.2) Two pairs of
pos
,subc
- *.3) two
gramGrp
- *.4) only (two)
pos
, content e.g. "vt". - a.5) `trans/intr
- a.1) One
-
Status quo
- ML, Wiki,
lg1-lg2.tei
:infl
- TEI Guidelines, TEI Lex-0:
inflected
- ML, Wiki,
-
A:
infl
- FreeDict-TEI specific
- Consider to change someday.
- A: OK.
- Not permitted by the TEI Guidelines.
- Neither is
usg
insideusg
(where the latter might have@type="colloc"
).
- Neither is
- Example (from Ding): "{prp; +Gen.; +Dat. [ugs.]}"
- See also 17.2) on why "+Dat." becomes a
colloc
element.
- See also 17.2) on why "+Dat." becomes a
-
Possible annotations
- [answered]
usg
- Depending on
@type
? - Q: Use nested
cit
instead?-
eng-pol
has e.g.:<cit type="colloc">
-
- Depending on
-
gramGrp
- Q: Exclude information that can be safely derived from the corresponding source language's
gramGrp
?
- Q: Exclude information that can be safely derived from the corresponding source language's
-
colloc
-- probably yes -
note
- Example: "Kleinbären {pl} (Procyonidae) (zoologische Familie) [zool.] :: procyonids (zoological family)"
- Suggestion: first two () become
<note>
s insideentry/sense
, the last one a<note>
inside<cit type="trans">
.
- Suggestion: first two () become
- Example: "Kleinbären {pl} (Procyonidae) (zoologische Familie) [zool.] :: procyonids (zoological family)"
- [answered] Abbreviations
- How?
- inflected forms // was 26)
- Likely yes.
- How?
- [answered] examples // was 21)
- (It's common to have an example for a headword, together with a translation.)
- (Question is, what about examples particular to the translation.)
- Likely realisation:
<cit type="trans"><quote /><cit type="example" /></cit>
- [answered]
-
Answers
- Anything that is valid TEI is OK.
- abbreviations:
cit[@type="abbrev"]
- examples: options:
- a) Even if particular to the translation, keep on the
<sense>
level.- a.1)
<cit type="example"><quote xml:lang="SRCLANG" /><quote xml:lang="TGTLANG" /></cit>
- a.2)
<cit type="example"><quote xml:lang="SRCLANG" /><cit type="trans" xml:lang="TGTLANG"><quote xml:lang="TGTLANG" /></cit></cit>
- (There may be several more
quote
elements.)
- a.1)
- b) Add inside
<cit type="trans">
, next to the<quote>
element.- Translation in the source language may be added, within a nested
<cit type="trans">
, like in a.2).
- Translation in the source language may be added, within a nested
- a) Even if particular to the translation, keep on the
- Such is a noun that only occurs in singular or plural form, respectively.
- Q: How to encode?
- Likely:
<num>pl</num><subc>no sg</subc>
(plurale tantum)