Create separate script to add IDs to the elements in the divs that have word segmentation #131

emylonas · 2021-06-22T23:50:41Z

As part of the word segmentation process, we will have to correct or add some <w> markup by hand to inscriptions with complicated features. Also, in the future, inscriptions may be amended or corrected, so that the segmented <div> will change in order to mirror the changes in the transcription div. It would be very useful to be able to run a separate script to (re)generate the @xml:id attributes.

Files and Folders
The original word segmentation script that does this is here:
https://github.com/lukehollis/iip-word-lists/blob/master/word_segmentation/word_segmentation.py. l. 216

Folder that has files with word segmentation
I'm not sure this is worth copying, but this is how it's done now.

Input and Results
This new script should read in an inscription that has a <div type="edition" subtype="transcription_segmented">
It should take the content of the <div type="edition" subtype="transcription_segmented"> and add an @xml:id to each element in the div. These elements are likely to be <w>, <num>, <orig> and <g>.

The @xml:id should be in the form @xml:id="IIPID-001" where IIPID is the IIP number of the file. for ex. beth0345 (don't include the .xml extension) followed by the number of the element in sequence in the div.

Ex: <w xml:id="beth0100.xml-04"> would be the 4th element in the div, for inscription beth0010.xml

Note that most inscriptions have names like this: caes0002.xml, but they can also appear in the from idum0003a.xml

If this script is written in XSLT it will be easier to run in Oxygen. if however, it is written in Python, then it can become part of a pipeline that is run on the command line.

The text was updated successfully, but these errors were encountered:

atbradley · 2021-07-02T17:02:29Z

There's a script here that does this. There's sample output here.

My thinking at this point is this can be part of a single command-line tool that handles all the NLP tasks.

atbradley · 2021-07-16T14:56:11Z

There are now three separate scripts in my fork of the iip-texts at https://github.com/atbradley/iip-texts/tree/atb-dev/scripts/word-segmentation. I've reorganized some code to make it easier to reuse these scripts as components of a larger tool.

emylonas · 2021-08-09T22:23:07Z

@emylonas needs to check this and then close.

emylonas assigned atbradley Jun 22, 2021

emylonas self-assigned this Aug 9, 2021

emylonas added the word list label Jan 28, 2022

emylonas mentioned this issue Mar 29, 2022

merge Adam's word segmentation improvements into main branch Brown-University-Library/iip-texts#186

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create separate script to add IDs to the elements in the divs that have word segmentation #131

Create separate script to add IDs to the elements in the divs that have word segmentation #131

emylonas commented Jun 22, 2021 •

edited

Loading

atbradley commented Jul 2, 2021

atbradley commented Jul 16, 2021

emylonas commented Aug 9, 2021

Create separate script to add IDs to the elements in the divs that have word segmentation #131

Create separate script to add IDs to the elements in the divs that have word segmentation #131

Comments

emylonas commented Jun 22, 2021 • edited Loading

atbradley commented Jul 2, 2021

atbradley commented Jul 16, 2021

emylonas commented Aug 9, 2021

emylonas commented Jun 22, 2021 •

edited

Loading