Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge Czech stemmer #151

Open
wants to merge 22 commits into
base: master
Choose a base branch
from
Open

Merge Czech stemmer #151

wants to merge 22 commits into from

Commits on Sep 11, 2024

  1. Merge Czech stemmer

    This has been on the web site since 2012, but never actually got
    included in the code distribution.
    ojwb committed Sep 11, 2024
    Configuration menu
    Copy the full SHA
    918e0cb View commit details
    Browse the repository at this point in the history
  2. Only apply do_case in R1

    This helps avoid overstemming.
    
    Co-authored-by: Jim O’Regan <[email protected]>
    ojwb and jimregan committed Sep 11, 2024
    Configuration menu
    Copy the full SHA
    17d8527 View commit details
    Browse the repository at this point in the history
  3. Implement the "light" version of the stemmer

    The "aggressive" version is known to overstem.  According to the
    original paper, the aggressive version performs slightly better, but
    the difference isn't statistically significant and conflation from
    overstemming can be problematic.
    
    Co-authored-by: Jim O’Regan <[email protected]>
    ojwb and jimregan committed Sep 11, 2024
    Configuration menu
    Copy the full SHA
    18016f5 View commit details
    Browse the repository at this point in the history
  4. Improve comment about origin of algorithm

    Co-authored-by: Jim O’Regan <[email protected]>
    ojwb and jimregan committed Sep 11, 2024
    Configuration menu
    Copy the full SHA
    fe8fe84 View commit details
    Browse the repository at this point in the history
  5. czech: Strip out unused "aggressive" code

    Avoids snowball and C compiler warnings.
    ojwb committed Sep 11, 2024
    Configuration menu
    Copy the full SHA
    d391357 View commit details
    Browse the repository at this point in the history
  6. czech: Remove -ům ending in do_case

    The Java code removes this ending but it was missing from the Snowball
    version.  Looking at the changes resulting from this, it seems a clear
    improvement so I've concluded it was an accidental omission.
    
    See snowballstem#151
    ojwb committed Sep 11, 2024
    Configuration menu
    Copy the full SHA
    1ee3d2f View commit details
    Browse the repository at this point in the history
  7. Add initial version of CzechStemmerLight.java

    Temporary addition to allow easy comparison with Snowball
    implementation.
    
    As downloaded, except for comment and whitespace tweaks, plus addition
    of main() to allow testing.
    ojwb committed Sep 11, 2024
    Configuration menu
    Copy the full SHA
    17e83a1 View commit details
    Browse the repository at this point in the history
  8. Configuration menu
    Copy the full SHA
    5ef5479 View commit details
    Browse the repository at this point in the history
  9. Change č suffix check to če

    The Java implementation removes če but has an incorrect comment
    saying it removes č.  Compare before and after on the test vocabulary
    this is a clear improvement.
    ojwb committed Sep 11, 2024
    Configuration menu
    Copy the full SHA
    8c88ddf View commit details
    Browse the repository at this point in the history
  10. czech: Change -čté/-šté to -čtí/-ští

    The Java implementation removes the latter but has incorrect comments
    saying it removes the former.
    
    Changing the Snowball implementation makes no difference here (probably
    due to the oddness around when to remove a character vs calling
    do_palatalise) but changing Java to use the Snowball suffixes here leads
    to a clear regression, so adjust the Snowball implementation to match
    Java implementation.
    ojwb committed Sep 11, 2024
    Configuration menu
    Copy the full SHA
    ae49598 View commit details
    Browse the repository at this point in the history
  11. CzechStemmerLight: Remove one char for -es/-ém/-ím

    This case was inconsistent with all the other cases where we call
    palatalise as we remove the whole suffix here but leave the first
    character in every over case.
    
    Checking the vocabulary list, this means palatalise will almost never
    match one of the suffixes, as the only words with this as an ending in
    the list are these, which look like they're actually English words
    (except "abies"):
    
    abies
    cookies
    hippies
    series
    studies
    
    This means palatalise will just remove the last character, which seems
    odd.
    
    This change changes a lot of stems but seems to be an improvement in
    pretty much every instance I checked in google translate.
    ojwb committed Sep 11, 2024
    Configuration menu
    Copy the full SHA
    67a5863 View commit details
    Browse the repository at this point in the history
  12. Fix handling of possessive removal

    There are two issues here:
    
    One seems clearly unintentional, which is that the cursor position from
    do_case wasn't reset.
    
    The other is that do_possessive was only called if do_case did something
    which does not match the Java implementation.  It seems likely this
    was not intended, and testing suggests it's not a helpful change.
    ojwb committed Sep 11, 2024
    Configuration menu
    Copy the full SHA
    7f2e797 View commit details
    Browse the repository at this point in the history
  13. Adjust palatalise to work like the Java version

    For the test vocabulary, this results in 1877 merges of groups of
    stems (all seem reasonable), 427 splits (all seem unhelpful) and
    300 reshufflings of stems between existing groups (all seem
    neutral).
    
    Overall this seems a very clear improvement, but we should see if we can
    address the splits.
    ojwb committed Sep 11, 2024
    Configuration menu
    Copy the full SHA
    ac70135 View commit details
    Browse the repository at this point in the history
  14. Configuration menu
    Copy the full SHA
    6fdd8fa View commit details
    Browse the repository at this point in the history

Commits on Oct 7, 2024

  1. czech: Don't remove -os suffix

    Testing seems to show this was never helpful and sometimes harmful.
    ojwb committed Oct 7, 2024
    Configuration menu
    Copy the full SHA
    401a2c9 View commit details
    Browse the repository at this point in the history

Commits on Oct 8, 2024

  1. czech: Remove more suffixes

    -es seems to be a valid suffix (e.g. diabetes) but there seem to be
    more cases where it is harmful to remove.
    
    -ich seems to only be a suffix for two pronouns.
    
    -iho doesn't seem to be a valid suffix and removing it makes no
    difference on the test vocabulary.
    ojwb committed Oct 8, 2024
    Configuration menu
    Copy the full SHA
    d3fbcd9 View commit details
    Browse the repository at this point in the history
  2. czech: Remove -'{i'}mu'

    This is a valid Czech suffix and removing it seems beneficial (88
    cases in the sample vocabulary, all seem to be improvements).
    ojwb committed Oct 8, 2024
    Configuration menu
    Copy the full SHA
    baaa66d View commit details
    Browse the repository at this point in the history
  3. czech: Use a better definition of R1

    Use a definition of R1 more like the usual Snowball one, but take
    syllabic consonants 'l' and 'r' into account.
    
    It seems 'm' and 'n' can also be syllabic consonants but are much
    rarer so we ignore these for now at least.
    
    Testing suggests enforcing a minimum of 3 characters before R1 (like
    the Danish, Dutch and German stemmers do) helps so we do that here
    too.
    
    See snowballstem#151
    ojwb committed Oct 8, 2024
    Configuration menu
    Copy the full SHA
    c2d63e9 View commit details
    Browse the repository at this point in the history
  4. czech: Optimise R1 check

    We can just handle the first character specially - after that we
    know the previous character is a consonant because otherwise we'd
    have already stopped.
    
    See snowballstem#151
    ojwb committed Oct 8, 2024
    Configuration menu
    Copy the full SHA
    fffc540 View commit details
    Browse the repository at this point in the history
  5. Improve comments

    ojwb committed Oct 8, 2024
    Configuration menu
    Copy the full SHA
    360d722 View commit details
    Browse the repository at this point in the history

Commits on Oct 9, 2024

  1. czech: Use R1 instead of RV

    There seems no benefit from having a separate region we can remove
    possessive suffixes in.
    
    See snowballstem#151
    ojwb committed Oct 9, 2024
    Configuration menu
    Copy the full SHA
    4b23362 View commit details
    Browse the repository at this point in the history

Commits on Oct 10, 2024

  1. Configuration menu
    Copy the full SHA
    bfccdb2 View commit details
    Browse the repository at this point in the history