Czech and Slovak algorithms. #149

gaboull · 2021-07-12T06:09:55Z

No description provided.

ojwb · 2021-08-31T03:38:49Z

libstemmer/modules.txt

@@ -32,6 +33,7 @@ portuguese      UTF_8,ISO_8859_1        portuguese,pt,por
 romanian        UTF_8,ISO_8859_2        romanian,ro,rum,ron
 russian         UTF_8,KOI8_R            russian,ru,rus
 serbian         UTF_8                   serbian,sr,srp
+slovak          UTF_8,ISO_8859_2        slovak,sk,svk


The 2 and 3 letter codes should be those specified by ISO 639:

https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes

So for Slovak that should be: slovak,sk,slk,slo (those for Czech above are also wrong but I've opened #151 to merge the Czech stemmer since I can write that one up for the website).

ojwb · 2021-08-31T04:30:00Z

The process for submitting a new stemmer is documented in CONTRIBUTING.rst. In particular we need a test vocabulary adding to snowball-data so there's test coverage and a page adding to the website with some background on the algorithm to aid future maintenance (if a bug is reported and all we have is the snowball implementation it can be hard to tell if it's an intentional design trade-off or an oversight).

The czech stemmer is already on the website so I know that it comes from a paper and who implemented it, so I can easily fill that in and I've created a test vocabulary from wikipedia data (in #151).

I don't know any background to the slovak algorithm here though.

jimregan · 2021-08-31T09:11:06Z

algorithms/slovak.sbl

+	)
+)
+
+define lower_case as repeat (


I'm not sure if this is necessary: the stemmers usually received lower-cased input, no? And unicode-aware case folding generally does the right thing with Slovak.

gaboull added 3 commits July 12, 2021 08:07

Czech and Slovak algorithms.

3bd57d1

[ADD] Better slovak stemmer.

4574c77

Slovak stemmer: some frequent exceptions and suffixes

02ef0dd

ojwb reviewed Aug 31, 2021

View reviewed changes

jimregan reviewed Aug 31, 2021

View reviewed changes

gaboull closed this by deleting the head repository Aug 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Czech and Slovak algorithms. #149

Czech and Slovak algorithms. #149

gaboull commented Jul 12, 2021

ojwb Aug 31, 2021

ojwb commented Aug 31, 2021

jimregan Aug 31, 2021

Czech and Slovak algorithms. #149

Czech and Slovak algorithms. #149

Conversation

gaboull commented Jul 12, 2021

ojwb Aug 31, 2021

Choose a reason for hiding this comment

ojwb commented Aug 31, 2021

jimregan Aug 31, 2021

Choose a reason for hiding this comment