Add a compound splitting strategy to improve on affix decomposition #141

juanjoDiaz · 2024-08-08T09:20:01Z

Recently I started noticing that some inflected words are not correctly lemmatized.
However, when adding German to the list of languages that are processed by the affix decomposition strategy, most of this are solved.

Here are some examples:

Word	Real Lemma	Simplema Lemma		Simplemma with afix strategy for German
Motorschütz	Motorschütz	Motorschütz	✅	Motorschütz	✅
Motorschütze	Motorschütz	Motorschütze	❌	Motorschütz	✅
Motorschützes	Motorschütz	Motorschützes	❌	Motorschütz	✅
Motorschützen	Motorschütz	Motorschützen	❌	Motorschützen	❌
Motorschützüberwachung	Motorschützüberwachung	Motorschützüberwachung	✅	Motorschützüberwachung	✅
Motorschützüberwachungen	Motorschützüberwachung	Motorschützüberwachung	✅	Motorschützüberwachung	✅
Distanzstück	Distanzstück	Distanzstück	✅	Distanzstück	✅
Distanzstücke	Distanzstück	Distanzstücke	❌	Distanzstück	✅
Distanzstücks	Distanzstück	Distanzstücks	❌	Distanzstück	✅
Distanzstücken	Distanzstück	Distanzstücken	❌	Distanzstück	✅
Durchgangsprüfung	Durchgangsprüfung	Durchgangsprüfung	✅	Durchgangsprüfung	✅
Durchgangsprüfungen	Durchgangsprüfung	Durchgangsprüfung	✅	Durchgangsprüfung	✅

Adding the affix strategy to German, does increase the execution time a bit but doesn't change the precision numbers when executing the evaluation script against the latest UD treebanks.
But, to be honest, even removing all the rules and only keeping the dictionary lookup barely changes the evaluation results 😅

So, the questions are:
Why is German not included in the affix search?
Should we include it?

adbar · 2024-08-08T10:08:54Z

These are just German examples so the affix search works, I wouldn't add it for other languages, or maybe I don't understand the question?

Depending on the language the UD data does not include a lot of variation, real-world scenarios are different, UD is just a way to run tests.
Languages like German or Finnish will have much more words outside of the dictionary than French or Spanish for example.

juanjoDiaz · 2024-08-08T10:28:44Z

My the questions are:
Why is German not included in the affix search by default as some other languages?
Should we include it?

adbar · 2024-08-08T10:31:56Z

I see! There are already rules for German, I guess affixes are not included because it would harm precision but I'll check again.

adbar · 2024-08-08T11:48:24Z

The performance is degraded in German for non-greedy searches and the affixes don't improve the greedy mode. They also slows down things a bit. So I would be against it and in favor of adding more rules if necessary.

juanjoDiaz · 2024-08-08T13:36:31Z

What's dataset are you using to measure performance?
I measure with the evaluation script and the lates UD Treebanks as defined in the readme but the results were exactly the same wether affix was used or not.

What kind of rules do you think that would help with my examples?

adbar · 2024-08-08T14:42:53Z

I see a difference when I add "de" to AFFIX_LANGS and lemmatize with greedy=False, the rest is the same.

I checked again, your examples rather hint at a failing decomposition into subwords. "Distanz" and "Stück" are in the language data, thus it should be possible to break the token apart and find the right ending. I'm not sure why it happens here.

adbar · 2024-08-08T14:49:12Z

My best guess is that it's just because the decomposition strategy is missing, there can be several ways to break words apart. That being said it could be worth it to implement this:

Start from the end until a valid subword is found
See if the other part of the token is in the dictionary
Apply the lemmatization to the identified subword at the end

juanjoDiaz · 2024-08-12T10:44:28Z

Isn't that how the affix decomposition strategy works already?

The problem is that the strategy is not applied to German.
I could simply solve this by adding German to the list of languages that use the affix decomposition strategy.

The question is also: why nos enabling the strategy to all languages?

adbar · 2024-08-12T12:03:59Z

Because it degrades performance on the benchmark, at least in my experiments. Currently the script only evaluates accuracy, it could look different with a f-score but I'm not sure.

In German, morphology of compound words can be complex, with (often 0 or 1 but) up to 3 characters between the parts of a compound. I guess it's even trickier for other morphologically rich languages. So the approach used for affixes reaches its limits there and does not entail error-free decomposition.

My best guess is that the method needed to solve your cases is a compound splitting strategy. It is not explicitly included in Simplemma (only through through the affix decomposition) but it would be a nice addition to strategies/.

It would be the same idea as for the affixes but with a further search until two or more valid parts (i.e. dictionary words) are found. We would need to specify how many characters are allowed between the components, I suggest to do it empirically, by testing.

juanjoDiaz · 2024-08-12T13:55:40Z

I see.
How could we try such new strategy if these compounded words are not present in UD Treebanks?
(If they where, the precision will improve adding the affix search).

adbar · 2024-08-12T16:13:31Z

Yes, evaluating performance on rare words is an issue. We can implement the additional strategy and let the users decide if they want to use it.

adbar added the question Further information is requested label Aug 8, 2024

adbar changed the title ~~Why does the affix decomposition strategy only work for certain languages?~~ Add a compound splitting strategy to improve on affix decomposition Aug 12, 2024

adbar added enhancement New feature or request and removed question Further information is requested labels Aug 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a compound splitting strategy to improve on affix decomposition #141

Add a compound splitting strategy to improve on affix decomposition #141

juanjoDiaz commented Aug 8, 2024

adbar commented Aug 8, 2024

juanjoDiaz commented Aug 8, 2024

adbar commented Aug 8, 2024

adbar commented Aug 8, 2024

juanjoDiaz commented Aug 8, 2024

adbar commented Aug 8, 2024

adbar commented Aug 8, 2024

juanjoDiaz commented Aug 12, 2024

adbar commented Aug 12, 2024

juanjoDiaz commented Aug 12, 2024

adbar commented Aug 12, 2024

Add a compound splitting strategy to improve on affix decomposition #141

Add a compound splitting strategy to improve on affix decomposition #141

Comments

juanjoDiaz commented Aug 8, 2024

adbar commented Aug 8, 2024

juanjoDiaz commented Aug 8, 2024

adbar commented Aug 8, 2024

adbar commented Aug 8, 2024

juanjoDiaz commented Aug 8, 2024

adbar commented Aug 8, 2024

adbar commented Aug 8, 2024

juanjoDiaz commented Aug 12, 2024

adbar commented Aug 12, 2024

juanjoDiaz commented Aug 12, 2024

adbar commented Aug 12, 2024