Skip to content
This repository has been archived by the owner on Jul 24, 2024. It is now read-only.

Modify python word segmentation script so that it handles <div>s with @textParts #132

Open
emylonas opened this issue Jun 23, 2021 · 1 comment

Comments

@emylonas
Copy link
Contributor

emylonas commented Jun 23, 2021

The python script that does word segmentation currently looks for //div[@subtype="transcription"]/p and applies the word segmentation rules to the text and element nodes inside that <p> element.

However, there are some inscriptions that have multiple texts on them or have texts on more than one part of the object. In this case , the structure of the transcription div is as follows:

//div[@subtype="transcription"]/div[@type="textPart"]/p where there is more than one textPart.
For ex caes0509.xml:

          <div type="edition" subtype="transcription" ana="b1">
                <div type="textpart" subtype="obverse">
                    <p>βονόσου</p>
                </div>
                <div type="textpart" subtype="reverse">
                    <p><foreign xml:lang="lat">Bonosu</foreign></p>
                </div>
            </div>

Other examples: jeru0522.xml, mare0437

The script currently locates and segments the contents of the <p> in the first textPart. It etiher converts or ignores any subsequent ones, but only writes out the first one in the segmented output.

The script should convert and output each of the textPart divs.

Python script
folder with output files

Will add example output - current and desired

@atbradley
Copy link
Collaborator

The current word segmenting script in https://github.com/atbradley/iip-texts/tree/atb-dev/scripts/word-segmentation does this with jeru0522:

<div type="edition" subtype="transcription_segmented"><p><w>Ἰοῦστος</w> <w>Χαλχιδηνός</w></p>
<p><w>Θεέννας</w></p>
</div>

Do we want to drop the <p> tags? keep the <div type="textpart">s?

word_indexer.py currently drops the second <p>--I'm trying to work out why now.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants