Skip to content
This repository has been archived by the owner on Jul 24, 2024. It is now read-only.

Create separate script to add IDs to the elements in the divs that have word segmentation #131

Open
emylonas opened this issue Jun 22, 2021 · 3 comments
Assignees

Comments

@emylonas
Copy link
Contributor

emylonas commented Jun 22, 2021

As part of the word segmentation process, we will have to correct or add some <w> markup by hand to inscriptions with complicated features. Also, in the future, inscriptions may be amended or corrected, so that the segmented <div> will change in order to mirror the changes in the transcription div. It would be very useful to be able to run a separate script to (re)generate the @xml:id attributes.

Files and Folders
The original word segmentation script that does this is here:
https://github.com/lukehollis/iip-word-lists/blob/master/word_segmentation/word_segmentation.py. l. 216

Folder that has files with word segmentation
I'm not sure this is worth copying, but this is how it's done now.

Input and Results
This new script should read in an inscription that has a <div type="edition" subtype="transcription_segmented">
It should take the content of the <div type="edition" subtype="transcription_segmented"> and add an @xml:id to each element in the div. These elements are likely to be <w>, <num>, <orig> and <g>.

The @xml:id should be in the form @xml:id="IIPID-001" where IIPID is the IIP number of the file. for ex. beth0345 (don't include the .xml extension) followed by the number of the element in sequence in the div.

Ex: <w xml:id="beth0100.xml-04"> would be the 4th element in the div, for inscription beth0010.xml

Note that most inscriptions have names like this: caes0002.xml, but they can also appear in the from idum0003a.xml

If this script is written in XSLT it will be easier to run in Oxygen. if however, it is written in Python, then it can become part of a pipeline that is run on the command line.

@atbradley
Copy link
Collaborator

There's a script here that does this. There's sample output here.

My thinking at this point is this can be part of a single command-line tool that handles all the NLP tasks.

@atbradley
Copy link
Collaborator

There are now three separate scripts in my fork of the iip-texts at https://github.com/atbradley/iip-texts/tree/atb-dev/scripts/word-segmentation. I've reorganized some code to make it easier to reuse these scripts as components of a larger tool.

@emylonas
Copy link
Contributor Author

emylonas commented Aug 9, 2021

@emylonas needs to check this and then close.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants