Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Czech and Slovak algorithms. #149

Closed
wants to merge 3 commits into from
Closed

Czech and Slovak algorithms. #149

wants to merge 3 commits into from

Conversation

gaboull
Copy link

@gaboull gaboull commented Jul 12, 2021

No description provided.

@@ -32,6 +33,7 @@ portuguese UTF_8,ISO_8859_1 portuguese,pt,por
romanian UTF_8,ISO_8859_2 romanian,ro,rum,ron
russian UTF_8,KOI8_R russian,ru,rus
serbian UTF_8 serbian,sr,srp
slovak UTF_8,ISO_8859_2 slovak,sk,svk
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 2 and 3 letter codes should be those specified by ISO 639:

https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes

So for Slovak that should be: slovak,sk,slk,slo (those for Czech above are also wrong but I've opened #151 to merge the Czech stemmer since I can write that one up for the website).

@ojwb
Copy link
Member

ojwb commented Aug 31, 2021

The process for submitting a new stemmer is documented in CONTRIBUTING.rst. In particular we need a test vocabulary adding to snowball-data so there's test coverage and a page adding to the website with some background on the algorithm to aid future maintenance (if a bug is reported and all we have is the snowball implementation it can be hard to tell if it's an intentional design trade-off or an oversight).

The czech stemmer is already on the website so I know that it comes from a paper and who implemented it, so I can easily fill that in and I've created a test vocabulary from wikipedia data (in #151).

I don't know any background to the slovak algorithm here though.

)
)

define lower_case as repeat (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is necessary: the stemmers usually received lower-cased input, no? And unicode-aware case folding generally does the right thing with Slovak.

@gaboull gaboull closed this by deleting the head repository Aug 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants