Skip to content

Commit

Permalink
improves README
Browse files Browse the repository at this point in the history
  • Loading branch information
pruizf committed Apr 30, 2023
1 parent 6d1c7d0 commit 52cc0b0
Showing 1 changed file with 9 additions and 7 deletions.
16 changes: 9 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,12 +9,12 @@ The `<teiHeader>` element for the plays needs to be encoded separately, as does

Inspired by earlier literature (e.g. [Grobid](https://grobid.readthedocs.io/en/latest/Introduction/) among others), the tool uses Conditional Random Fields (CRF) as implemented in [sklearn-crfsuite](https://github.com/TeamHG-Memex/sklearn-crfsuite). Lexical and typographical cues present in OCR output, besides token coordinates on the page, are exploited to generate TEI elements.

The tool was developed by Andrew Briand (University of Washington), in the context of work supervised by Pablo Ruiz at the [Methal](https://methal.pages.unistra.fr) project (University of Strasbourg), which is creating a large TEI-encoded corpus of theater in Alsatian varieties.
The tool was developed by Andrew Briand (University of Washington), in the context of work supervised by Pablo Ruiz within the [Methal](https://methal.pages.unistra.fr) project (University of Strasbourg); the project is creating a large TEI-encoded corpus of theater in Alsatian varieties.

# Application structure

- `example`: example input, XML output obtained with it and CRF model used to predict the output.
- `hocr2alto`: Scripts to convert between these formats.
- `hocr2alto`: Scripts to convert between HOCR and ALTO formats.
- Usage is documented in the script
- Requires the [`ocr-fileformat`](https://github.com/UB-Mannheim/ocr-fileformat) package
- `sklearn_crfsuite`: The main program is in this directory, see [Generating TEI](#prediction) and [Training a model](#training) below for its usage.
Expand Down Expand Up @@ -60,17 +60,19 @@ python train.py html tei ../example/models/model-exp3-new.crf

The `exp3` infix in the model filename was used for the follwoing reason: Several feature combinations were implemented in the tool. The best one was called `exp3` and this model was trained with it, so we chose to include `exp3` in the filename (output-file naming is manual)

# Postprocessing
# Postprocessing the output XML

Let's show this with an example. If you trained a model using the example command above and use it to predict a TEI for `../example/input/hocr-verbotte-fahne`, your results should reproduce `../example/outputs/verbotte-fahne-exp3.xml`.
Let's show this with an example. If you trained a model using the example command above and use it to predict TEI for `../example/input/hocr-verbotte-fahne`, your results should reproduce `../example/outputs/verbotte-fahne-exp3.xml`.

The prediction doesn't look bad, but you'll see it is not valid XML. This is because the model is designed to handle the plays' body, from the start of the first act to the final curtain, but not the front matter and back matter that may precede and follow those. Since we did not remove HOCR files for the front matter and back matter, the model tried to generate TEI from them, but this was expected to give errors. Once the portions generated based on the front matter and back matter are removed, the file will be valid XML. You can compare the file before and after by comparing `../example/outputs/verbotte-fahne-exp3.xml` with `../example/outputs/verbotte-fahne-exp3-postpro.xml`. Instead of removing the front- and backmatter content from the output XML, we could also remove the input HOCR files (or paragraphs if the body does not start and end on its own page) for such content before generating the XML output.
The prediction doesn't look bad, but you'll see it is not valid XML. This is because the model is designed to handle the plays' body, from the start of the first act to the final curtain, but not the front matter and back matter that may precede and follow those. Since we did not remove HOCR files for the front matter and back matter, the model tried to generate TEI from them, but this was expected to give errors. Once the portions generated based on the front matter and back matter are removed, the file will be valid XML. You can compare the file before and after by comparing `../example/outputs/verbotte-fahne-exp3.xml` with `../example/outputs/verbotte-fahne-exp3-postpro.xml`.

Instead of postprocessing the output XML by removing the front- and backmatter content, we could also remove the input HOCR files (or paragraphs if the body does not start and end on its own page) for such content before generating the XML output.

# Adapting to other languages

The lexical cues used by the tool are currently suitable for Alsatian theater. Paratext in Alsatian theater is often in German and sometimes in French. Accordingly, lexical cues are now provided in Alsatian varieties and in those two other languages.
The lexical cues used by the tool are currently suitable for Alsatian theater. Paratext in Alsatian theater is often in German and sometimes in French. Accordingly, lexical cues are now provided in Alsatian varieties, besides German and French.

Given a corpus of HOCR plays and their corresponding TEI-encoding versions, the tool's lexical features (see [`sklearn_crfsuite/features.py`](./sklearn_crfsuite/features.py)) could be adapted to further languages.
The tool's lexical features (see [`sklearn_crfsuite/features.py`](./sklearn_crfsuite/features.py)) could be adapted to further languages. For training, a corpus of HOCR (or ALTO) plays and their corresponding TEI-encoding versions is needed (see [Training a model](#training) above).

# How to cite

Expand Down

0 comments on commit 52cc0b0

Please sign in to comment.