Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Pāli Stemmer #197

Closed
wants to merge 1 commit into from
Closed

Add Pāli Stemmer #197

wants to merge 1 commit into from

Conversation

khemarato
Copy link

This is currently a draft PR for starting a discussion.

The test cases are defined in snowballstem/snowball-data#26

The current, super naive, implementation of just removing common suffixes achieves an admirable accuracy of 97.7% on the test set. See this gist for the failing cases.

Any feedback at all is appreciated.


define stem as (
backwards (
[substring] delete among (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this code is correct, I'd suggest removing the delete here and instead adding (delete) after line 17 below. Then it's clearer to a reader less familiar with the Snowball language that the deletion only happens when one of these suffixes matches. Framing it as an explicit action also more easily allows for different actions for different suffixes if that is useful.

'ati' '{a-}ti' 'ant' 'esi' '{a-}si' 'eti'
'a{m1}' 'a{m2}' 'u{m1}' 'u{m2}' 'i{m1}' 'i{m2}'
'i' '{i-}' 'a' '{a-}' 'o' 'u' '{u-}'
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will produce an empty stem if the input is exactly one of these suffixes, for example:

$ echo anta|./stemwords -l pali

$

Both anta and esi are in your test vocabulary and suffer from this, but it would be undesirable even if it only affected strings which weren't real words in the language as such strings can occur as proper nouns, words from other languages, typos, OCR errors, etc.

It can also produce very short stems (even a single character) - there are 86 cases of single character stems in your test vocabulary. I don't really know anything about Pali so maybe some or even all of these are actually appropriate, but typically it is a sign of overstemming since there are a very limited number of single character stems.

Most of the existing stemmers define R1 and R2 regions based on vowels and non-vowels, and limit suffix removal to one of these regions which has proven to be a good approach in general - see https://snowballstem.org/texts/r1r2.html for more details.

@ojwb
Copy link
Member

ojwb commented Sep 29, 2024

I'm going to close this - I'm certainly open to adding a Pāli algorithm, but the current state here needs much more work and progress seems to have stalled.

Also Pāli is rather unusual case where the language is essentially fixed and maybe algorithmic stemming isn't the best approach. There's more discussion in snowballstem/snowball-data#26, especially snowballstem/snowball-data#26 (comment) lays down pros and cons for a snowball Pāli algorithm.

@ojwb ojwb closed this Sep 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants