Pimp your OCR with structural information from METS.
OCR as delivered by commercial service providers (e.g. https://www.semantics.de/visual_library/) is usually completely independent from costly collected structural data represented in METS files. This tool aims at providing this missing connection. It was initially developed in the course of the DFG-funded project Digitale Sammlung Deutcher Kolonialismus. The development continues at the Saxon State and University Library Dresden.
To create a mapping between the manually collected structural information in a METS file and the
corresponding OCR fulltext, the <mets:structMap TYPE="LOGICAL" />
is evaluated. The labels of the
logical entities are mapped to the OCR fulltext using a moving window and Levenshtein distance as the
measure of confidence.
tocrify
is implemented in Python 3. In the following, we assume a working Python 3
(tested versions 3.5 and 3.6) installation.
The first installation step is the cloning of the repository:
$ git clone https://github.com/deutschestextarchiv/tocrify.git
$ cd tocrify
Using virtualenv
is highly recommended, although not strictly necessary for installing tocrify
. It may be installed via:
$ [sudo] pip install virtualenv
Create a virtual environement in a subdirectory of your choice (e.g. env
) using
$ virtualenv -p python3 env
and activate it.
$ . env/bin/activate
tocrify
depends on python-Levenshtein
. To build it, you may have to install the Python development library.
E.g., on apt-based linux:
$ sudo apt install libpython3-dev
tocrify
uses various 3rd party Python packages which may best be installed using pip
:
(env) $ pip install -r requirements.txt
Finally, tocrify
itself can be installed via pip
:
(env) $ pip install .
tocrify
comes with a help message explaining its usage:
(env) $ tocrify --help
Usage: tocrify [OPTIONS] METS
METS: Input METS XML
Options:
-o, --out-dir PATH Existing directory for storing the updated OCR
files [required]
-O, --order-file FILENAME Destination for file order information
-m, --mapping FILENAME METS to hOCR structural types mapping
-l, --log-level [DEBUG|INFO|WARN|ERROR|OFF]
--help Show this message and exit.
The METS file bundles all information on tocrify
's input, namely the hOCR
files which have to be referenced in dedicated file groups (fileGrp[@USE='FULLTEXT HOCR']
). With the help of the parameter -o
, the output directory for the updated hOCR
files can be specified. If given the parameter -O
, tocrify
writes the physical order of the hOCR
files (which is not necessarily equal to their alphanumeric order) to the specified destination.
A sample invocation could look like:
(env) $ tocrify -o hocr_plus -O order.txt export_mets_hocr.xml
The name of this tool was proposed by @kba. Parts of the code for METS handling were inspired by metsrw.