Add Pāli Stemmer #197

khemarato · 2024-04-28T09:50:33Z

This is currently a draft PR for starting a discussion.

The test cases are defined in snowballstem/snowball-data#26

The current, super naive, implementation of just removing common suffixes achieves an admirable accuracy of 97.7% on the test set. See this gist for the failing cases.

Any feedback at all is appreciated.

ojwb · 2024-05-01T00:16:21Z

algorithms/pali.sbl

+
+define stem as (
+  backwards (
+    [substring] delete among (


While this code is correct, I'd suggest removing the delete here and instead adding (delete) after line 17 below. Then it's clearer to a reader less familiar with the Snowball language that the deletion only happens when one of these suffixes matches. Framing it as an explicit action also more easily allows for different actions for different suffixes if that is useful.

ojwb · 2024-05-01T00:31:27Z

algorithms/pali.sbl

+        'ati' '{a-}ti' 'ant' 'esi' '{a-}si' 'eti'
+        'a{m1}' 'a{m2}' 'u{m1}' 'u{m2}' 'i{m1}' 'i{m2}'
+        'i' '{i-}' 'a' '{a-}' 'o' 'u' '{u-}'
+    )


This will produce an empty stem if the input is exactly one of these suffixes, for example:

$ echo anta|./stemwords -l pali $

Both anta and esi are in your test vocabulary and suffer from this, but it would be undesirable even if it only affected strings which weren't real words in the language as such strings can occur as proper nouns, words from other languages, typos, OCR errors, etc.

It can also produce very short stems (even a single character) - there are 86 cases of single character stems in your test vocabulary. I don't really know anything about Pali so maybe some or even all of these are actually appropriate, but typically it is a sign of overstemming since there are a very limited number of single character stems.

Most of the existing stemmers define R1 and R2 regions based on vowels and non-vowels, and limit suffix removal to one of these regions which has proven to be a good approach in general - see https://snowballstem.org/texts/r1r2.html for more details.

ojwb · 2024-09-29T21:28:46Z

I'm going to close this - I'm certainly open to adding a Pāli algorithm, but the current state here needs much more work and progress seems to have stalled.

Also Pāli is rather unusual case where the language is essentially fixed and maybe algorithmic stemming isn't the best approach. There's more discussion in snowballstem/snowball-data#26, especially snowballstem/snowball-data#26 (comment) lays down pros and cons for a snowball Pāli algorithm.

Add super naive pali stemmer

c89c561

ojwb reviewed May 1, 2024

View reviewed changes

ojwb closed this Sep 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Pāli Stemmer #197

Add Pāli Stemmer #197

khemarato commented Apr 28, 2024

ojwb May 1, 2024

ojwb May 1, 2024

ojwb commented Sep 29, 2024

Add Pāli Stemmer #197

Add Pāli Stemmer #197

Conversation

khemarato commented Apr 28, 2024

ojwb May 1, 2024

Choose a reason for hiding this comment

ojwb May 1, 2024

Choose a reason for hiding this comment

ojwb commented Sep 29, 2024