Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
The website says about german2: In the sample German vocabulary of 35,000 words, the main stemmer and the variant stemmer exhibit about 90 differences. Of these about half are in words of foreign language origin (raphael, poesie etc). Of the native German words, about half seem to be improved by the variant stemming, and the other half made worse. In any case the differences are little more than one word per thousand among the native German words. I did my own comparison of the output from german and german2 on snowball-data/german/voc.txt which has 35033 entries (so appears to be the same "sample German vocabulary of 35,000 words"; also the only change to this file since it was added to version control in 2005 has been to convert it to UTF-8). Comparing the results from stemming this with german and german2, the first interesting thing is I get 77 different stems (rather than "about 90"). The algorithm has changed a little over time - there was an extra rule to handle "-nisse" in 2009 and a fix for a bug handing "qu" so that it matches the algorithm description. I undid these two algorithm changes and got 76 different stems. Maybe "90" was a typo for "80"? I don't have a better theory. I also noticed a significant proportion of foreign words, as well as some proper nouns. Some cases definitely seem improved, and quite a few are just different but effectively just change the stem for a word or group of words to a stem that isn't otherwise generated. Contrary to that quote from the website however, I didn't spot any differences I would classify as clearly worse, though there are some changes that have good and bad aspects to them. An example is that german2 changes "Bluet" (Allemanic German word for "blood") to stem to "blut" which is the same stem as "Blut" (German word for "blood"), so that seems beneficial. The downside is that "Blüte" ("blossom") stems to "blut" with both the german and german2 algorithms, but this "Blut"/"Blüte" conflation is an already present minor problem so I think overall I'd view the change to "Bluet" as neutral at worst. The replacing of umlauts with "e" suffixes is presumably much less common in newly created text than it once was as modern computer systems generally don't have the limitations which motivated this, but there will still be large amounts of legacy text so I think it makes sense to just replace german with german2. Fixes #92
- Loading branch information