Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Pāli Stemmer #197

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 20 additions & 0 deletions algorithms/pali.sbl
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
externals ( stem )

stringescapes {}

stringdef a- '{U+0101}' // ā
stringdef i- '{U+012B}' // ī
stringdef u- '{U+016B}' // ū
stringdef m1 '{U+1E41}' // ṁ
stringdef m2 '{U+1E43}' // ṃ

define stem as (
backwards (
[substring] delete among (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this code is correct, I'd suggest removing the delete here and instead adding (delete) after line 17 below. Then it's clearer to a reader less familiar with the Snowball language that the deletion only happens when one of these suffixes matches. Framing it as an explicit action also more easily allows for different actions for different suffixes if that is useful.

'm{a-}na' 'anta' 'onta' 'unta' 'enta' 'issa'
'ati' '{a-}ti' 'ant' 'esi' '{a-}si' 'eti'
'a{m1}' 'a{m2}' 'u{m1}' 'u{m2}' 'i{m1}' 'i{m2}'
'i' '{i-}' 'a' '{a-}' 'o' 'u' '{u-}'
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will produce an empty stem if the input is exactly one of these suffixes, for example:

$ echo anta|./stemwords -l pali

$

Both anta and esi are in your test vocabulary and suffer from this, but it would be undesirable even if it only affected strings which weren't real words in the language as such strings can occur as proper nouns, words from other languages, typos, OCR errors, etc.

It can also produce very short stems (even a single character) - there are 86 cases of single character stems in your test vocabulary. I don't really know anything about Pali so maybe some or even all of these are actually appropriate, but typically it is a sign of overstemming since there are a very limited number of single character stems.

Most of the existing stemmers define R1 and R2 regions based on vowels and non-vowels, and limit suffix removal to one of these regions which has proven to be a good approach in general - see https://snowballstem.org/texts/r1r2.html for more details.

)
)
1 change: 1 addition & 0 deletions libstemmer/modules.txt
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ italian UTF_8,ISO_8859_1 italian,it,ita
lithuanian UTF_8 lithuanian,lt,lit
nepali UTF_8 nepali,ne,nep
norwegian UTF_8,ISO_8859_1 norwegian,no,nor
pali UTF_8 pali,pi,pli
portuguese UTF_8,ISO_8859_1 portuguese,pt,por
romanian UTF_8 romanian,ro,rum,ron
russian UTF_8,KOI8_R russian,ru,rus
Expand Down
Loading