-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Pāli Stemmer #197
Add Pāli Stemmer #197
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
externals ( stem ) | ||
|
||
stringescapes {} | ||
|
||
stringdef a- '{U+0101}' // ā | ||
stringdef i- '{U+012B}' // ī | ||
stringdef u- '{U+016B}' // ū | ||
stringdef m1 '{U+1E41}' // ṁ | ||
stringdef m2 '{U+1E43}' // ṃ | ||
|
||
define stem as ( | ||
backwards ( | ||
[substring] delete among ( | ||
'm{a-}na' 'anta' 'onta' 'unta' 'enta' 'issa' | ||
'ati' '{a-}ti' 'ant' 'esi' '{a-}si' 'eti' | ||
'a{m1}' 'a{m2}' 'u{m1}' 'u{m2}' 'i{m1}' 'i{m2}' | ||
'i' '{i-}' 'a' '{a-}' 'o' 'u' '{u-}' | ||
) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This will produce an empty stem if the input is exactly one of these suffixes, for example:
Both It can also produce very short stems (even a single character) - there are 86 cases of single character stems in your test vocabulary. I don't really know anything about Pali so maybe some or even all of these are actually appropriate, but typically it is a sign of overstemming since there are a very limited number of single character stems. Most of the existing stemmers define R1 and R2 regions based on vowels and non-vowels, and limit suffix removal to one of these regions which has proven to be a good approach in general - see https://snowballstem.org/texts/r1r2.html for more details. |
||
) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this code is correct, I'd suggest removing the
delete
here and instead adding(delete)
after line 17 below. Then it's clearer to a reader less familiar with the Snowball language that the deletion only happens when one of these suffixes matches. Framing it as an explicit action also more easily allows for different actions for different suffixes if that is useful.