-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge Czech stemmer #151
base: master
Are you sure you want to change the base?
Merge Czech stemmer #151
Commits on Sep 11, 2024
-
This has been on the web site since 2012, but never actually got included in the code distribution.
Configuration menu - View commit details
-
Copy full SHA for 918e0cb - Browse repository at this point
Copy the full SHA 918e0cbView commit details -
This helps avoid overstemming. Co-authored-by: Jim O’Regan <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 17d8527 - Browse repository at this point
Copy the full SHA 17d8527View commit details -
Implement the "light" version of the stemmer
The "aggressive" version is known to overstem. According to the original paper, the aggressive version performs slightly better, but the difference isn't statistically significant and conflation from overstemming can be problematic. Co-authored-by: Jim O’Regan <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 18016f5 - Browse repository at this point
Copy the full SHA 18016f5View commit details -
Improve comment about origin of algorithm
Co-authored-by: Jim O’Regan <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for fe8fe84 - Browse repository at this point
Copy the full SHA fe8fe84View commit details -
czech: Strip out unused "aggressive" code
Avoids snowball and C compiler warnings.
Configuration menu - View commit details
-
Copy full SHA for d391357 - Browse repository at this point
Copy the full SHA d391357View commit details -
czech: Remove -ům ending in do_case
The Java code removes this ending but it was missing from the Snowball version. Looking at the changes resulting from this, it seems a clear improvement so I've concluded it was an accidental omission. See snowballstem#151
Configuration menu - View commit details
-
Copy full SHA for 1ee3d2f - Browse repository at this point
Copy the full SHA 1ee3d2fView commit details -
Add initial version of CzechStemmerLight.java
Temporary addition to allow easy comparison with Snowball implementation. As downloaded, except for comment and whitespace tweaks, plus addition of main() to allow testing.
Configuration menu - View commit details
-
Copy full SHA for 17e83a1 - Browse repository at this point
Copy the full SHA 17e83a1View commit details -
Configuration menu - View commit details
-
Copy full SHA for 5ef5479 - Browse repository at this point
Copy the full SHA 5ef5479View commit details -
The Java implementation removes če but has an incorrect comment saying it removes č. Compare before and after on the test vocabulary this is a clear improvement.
Configuration menu - View commit details
-
Copy full SHA for 8c88ddf - Browse repository at this point
Copy the full SHA 8c88ddfView commit details -
czech: Change -čté/-šté to -čtí/-ští
The Java implementation removes the latter but has incorrect comments saying it removes the former. Changing the Snowball implementation makes no difference here (probably due to the oddness around when to remove a character vs calling do_palatalise) but changing Java to use the Snowball suffixes here leads to a clear regression, so adjust the Snowball implementation to match Java implementation.
Configuration menu - View commit details
-
Copy full SHA for ae49598 - Browse repository at this point
Copy the full SHA ae49598View commit details -
CzechStemmerLight: Remove one char for -es/-ém/-ím
This case was inconsistent with all the other cases where we call palatalise as we remove the whole suffix here but leave the first character in every over case. Checking the vocabulary list, this means palatalise will almost never match one of the suffixes, as the only words with this as an ending in the list are these, which look like they're actually English words (except "abies"): abies cookies hippies series studies This means palatalise will just remove the last character, which seems odd. This change changes a lot of stems but seems to be an improvement in pretty much every instance I checked in google translate.
Configuration menu - View commit details
-
Copy full SHA for 67a5863 - Browse repository at this point
Copy the full SHA 67a5863View commit details -
Fix handling of possessive removal
There are two issues here: One seems clearly unintentional, which is that the cursor position from do_case wasn't reset. The other is that do_possessive was only called if do_case did something which does not match the Java implementation. It seems likely this was not intended, and testing suggests it's not a helpful change.
Configuration menu - View commit details
-
Copy full SHA for 7f2e797 - Browse repository at this point
Copy the full SHA 7f2e797View commit details -
Adjust palatalise to work like the Java version
For the test vocabulary, this results in 1877 merges of groups of stems (all seem reasonable), 427 splits (all seem unhelpful) and 300 reshufflings of stems between existing groups (all seem neutral). Overall this seems a very clear improvement, but we should see if we can address the splits.
Configuration menu - View commit details
-
Copy full SHA for ac70135 - Browse repository at this point
Copy the full SHA ac70135View commit details -
Configuration menu - View commit details
-
Copy full SHA for 6fdd8fa - Browse repository at this point
Copy the full SHA 6fdd8faView commit details
Commits on Oct 7, 2024
-
czech: Don't remove -os suffix
Testing seems to show this was never helpful and sometimes harmful.
Configuration menu - View commit details
-
Copy full SHA for 401a2c9 - Browse repository at this point
Copy the full SHA 401a2c9View commit details
Commits on Oct 8, 2024
-
-es seems to be a valid suffix (e.g. diabetes) but there seem to be more cases where it is harmful to remove. -ich seems to only be a suffix for two pronouns. -iho doesn't seem to be a valid suffix and removing it makes no difference on the test vocabulary.
Configuration menu - View commit details
-
Copy full SHA for d3fbcd9 - Browse repository at this point
Copy the full SHA d3fbcd9View commit details -
This is a valid Czech suffix and removing it seems beneficial (88 cases in the sample vocabulary, all seem to be improvements).
Configuration menu - View commit details
-
Copy full SHA for baaa66d - Browse repository at this point
Copy the full SHA baaa66dView commit details -
czech: Use a better definition of R1
Use a definition of R1 more like the usual Snowball one, but take syllabic consonants 'l' and 'r' into account. It seems 'm' and 'n' can also be syllabic consonants but are much rarer so we ignore these for now at least. Testing suggests enforcing a minimum of 3 characters before R1 (like the Danish, Dutch and German stemmers do) helps so we do that here too. See snowballstem#151
Configuration menu - View commit details
-
Copy full SHA for c2d63e9 - Browse repository at this point
Copy the full SHA c2d63e9View commit details -
We can just handle the first character specially - after that we know the previous character is a consonant because otherwise we'd have already stopped. See snowballstem#151
Configuration menu - View commit details
-
Copy full SHA for fffc540 - Browse repository at this point
Copy the full SHA fffc540View commit details -
Configuration menu - View commit details
-
Copy full SHA for 360d722 - Browse repository at this point
Copy the full SHA 360d722View commit details
Commits on Oct 9, 2024
-
There seems no benefit from having a separate region we can remove possessive suffixes in. See snowballstem#151
Configuration menu - View commit details
-
Copy full SHA for 4b23362 - Browse repository at this point
Copy the full SHA 4b23362View commit details
Commits on Oct 10, 2024
-
Configuration menu - View commit details
-
Copy full SHA for bfccdb2 - Browse repository at this point
Copy the full SHA bfccdb2View commit details