Skip to content
This repository has been archived by the owner on Jul 24, 2024. It is now read-only.

word segmentation and lemmatization should ignore <orig> when it's alone #119

Open
emylonas opened this issue Feb 26, 2021 · 2 comments
Open
Assignees

Comments

@emylonas
Copy link
Contributor

when the <orig> element appears alone, and not as a child of <choice> the content should be ignored. Characters inside that type of <orig> are not words. They represent something that is not recognizable as a word.

Examples:
masa0836 has the string <orig>CB</orig> this appears in the word list
same in masa0838
caes0062 has the string <orig>C</orig> and <orig>DO</orig> seems to not be in the list

@atbradley
Copy link
Collaborator

Do we want these to have @xml:ids?

@emylonas
Copy link
Contributor Author

emylonas commented Jul 6, 2021 via email

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants