Skip to content

Commit

Permalink
Replace german with german2
Browse files Browse the repository at this point in the history
The website says about german2:

  In the sample German vocabulary of 35,000 words, the main stemmer and
  the variant stemmer exhibit about 90 differences. Of these about half
  are in words of foreign language origin (raphael, poesie etc). Of the
  native German words, about half seem to be improved by the variant
  stemming, and the other half made worse. In any case the differences
  are little more than one word per thousand among the native German
  words.

I did my own comparison of the output from german and german2 on
snowball-data/german/voc.txt which has 35033 entries (so appears to be
the same "sample German vocabulary of 35,000 words"; also the only
change to this file since it was added to version control in 2005 has
been to convert it to UTF-8).

Comparing the results from stemming this with german and german2, the
first interesting thing is I get 77 different stems (rather than "about
90"). The algorithm has changed a little over time - there was an extra
rule to handle "-nisse" in 2009 and a fix for a bug handing "qu" so that
it matches the algorithm description. I undid these two algorithm
changes and got 76 different stems. Maybe "90" was a typo for "80"? I
don't have a better theory.

I also noticed a significant proportion of foreign words, as well as
some proper nouns. Some cases definitely seem improved, and quite a few
are just different but effectively just change the stem for a word or
group of words to a stem that isn't otherwise generated. Contrary to
that quote from the website however, I didn't spot any differences I
would classify as clearly worse, though there are some changes that have
good and bad aspects to them.

An example is that german2 changes "Bluet" (Allemanic German word for
"blood") to stem to "blut" which is the same stem as "Blut" (German word
for "blood"), so that seems beneficial. The downside is that "Blüte"
("blossom") stems to "blut" with both the german and german2 algorithms,
but this "Blut"/"Blüte" conflation is an already present minor problem
so I think overall I'd view the change to "Bluet" as neutral at worst.

The replacing of umlauts with "e" suffixes is presumably much less
common in newly created text than it once was as modern computer systems
generally don't have the limitations which motivated this, but there
will still be large amounts of legacy text so I think it makes sense to
just replace german with german2.

Fixes #92
  • Loading branch information
ojwb committed Sep 22, 2023
1 parent 634c8a9 commit b08bdc5
Show file tree
Hide file tree
Showing 4 changed files with 14 additions and 156 deletions.
2 changes: 1 addition & 1 deletion GNUmakefile
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ tarball_ext = .tar.gz
# * KOI8_R_algorithms
include algorithms.mk

other_algorithms = german2 kraaij_pohlmann lovins
other_algorithms = kraaij_pohlmann lovins

all_algorithms = $(libstemmer_algorithms) $(other_algorithms)

Expand Down
20 changes: 13 additions & 7 deletions algorithms/german.sbl
Original file line number Diff line number Diff line change
Expand Up @@ -32,16 +32,22 @@ define st_ending s_ending - 'r'

define prelude as (

test repeat (
(
['{ss}'] <- 'ss'
) or next
)

repeat goto (
test repeat goto (
v [('u'] v <- 'U') or
('y'] v <- 'Y')
)

repeat (
[substring] among(
'{ss}' (<- 'ss')
'ae' (<- '{a"}')
'oe' (<- '{o"}')
'ue' (<- '{u"}')
'qu' ()
'' (next)
)
)

)

define mark_regions as (
Expand Down
145 changes: 0 additions & 145 deletions algorithms/german2.sbl

This file was deleted.

3 changes: 0 additions & 3 deletions libstemmer/modules.txt
Original file line number Diff line number Diff line change
Expand Up @@ -52,9 +52,6 @@ porter UTF_8,ISO_8859_1 porter english
# intended for general use, and use of them is is not fully supported. These
# algorithms are:
#
# german2 - This is a slight modification of the german stemmer.
#german2 UTF_8,ISO_8859_1 german2 german
#
# kraaij_pohlmann - This is a different dutch stemmer.
#kraaij_pohlmann UTF_8,ISO_8859_1 kraaij_pohlmann dutch
#
Expand Down

0 comments on commit b08bdc5

Please sign in to comment.