Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a compound splitting strategy to improve on affix decomposition #141

Open
juanjoDiaz opened this issue Aug 8, 2024 · 11 comments
Open
Labels
enhancement New feature or request

Comments

@juanjoDiaz
Copy link
Collaborator

Hi @adbar ,

Recently I started noticing that some inflected words are not correctly lemmatized.
However, when adding German to the list of languages that are processed by the affix decomposition strategy, most of this are solved.

Here are some examples:

Word Real Lemma Simplema Lemma   Simplemma with afix strategy for German    
Motorschütz Motorschütz Motorschütz  ✅ Motorschütz  
Motorschütze Motorschütz Motorschütze Motorschütz  
Motorschützes Motorschütz Motorschützes ❌  Motorschütz  
Motorschützen Motorschütz Motorschützen Motorschützen  
Motorschützüberwachung Motorschützüberwachung Motorschützüberwachung ✅  Motorschützüberwachung  
Motorschützüberwachungen Motorschützüberwachung Motorschützüberwachung ✅  Motorschützüberwachung  
Distanzstück Distanzstück Distanzstück ✅  Distanzstück  
Distanzstücke Distanzstück Distanzstücke ❌  Distanzstück  
Distanzstücks Distanzstück Distanzstücks ❌  Distanzstück  
Distanzstücken Distanzstück Distanzstücken ❌  Distanzstück  
Durchgangsprüfung Durchgangsprüfung Durchgangsprüfung ✅  Durchgangsprüfung  
Durchgangsprüfungen Durchgangsprüfung Durchgangsprüfung ✅  Durchgangsprüfung  

Adding the affix strategy to German, does increase the execution time a bit but doesn't change the precision numbers when executing the evaluation script against the latest UD treebanks.
But, to be honest, even removing all the rules and only keeping the dictionary lookup barely changes the evaluation results 😅

So, the questions are:
Why is German not included in the affix search?
Should we include it?

@adbar
Copy link
Owner

adbar commented Aug 8, 2024

These are just German examples so the affix search works, I wouldn't add it for other languages, or maybe I don't understand the question?

Depending on the language the UD data does not include a lot of variation, real-world scenarios are different, UD is just a way to run tests.
Languages like German or Finnish will have much more words outside of the dictionary than French or Spanish for example.

@juanjoDiaz
Copy link
Collaborator Author

My the questions are:
Why is German not included in the affix search by default as some other languages?
Should we include it?

@adbar
Copy link
Owner

adbar commented Aug 8, 2024

I see! There are already rules for German, I guess affixes are not included because it would harm precision but I'll check again.

@adbar adbar added the question Further information is requested label Aug 8, 2024
@adbar
Copy link
Owner

adbar commented Aug 8, 2024

The performance is degraded in German for non-greedy searches and the affixes don't improve the greedy mode. They also slows down things a bit. So I would be against it and in favor of adding more rules if necessary.

@juanjoDiaz
Copy link
Collaborator Author

What's dataset are you using to measure performance?
I measure with the evaluation script and the lates UD Treebanks as defined in the readme but the results were exactly the same wether affix was used or not.

What kind of rules do you think that would help with my examples?

@adbar
Copy link
Owner

adbar commented Aug 8, 2024

I see a difference when I add "de" to AFFIX_LANGS and lemmatize with greedy=False, the rest is the same.

I checked again, your examples rather hint at a failing decomposition into subwords. "Distanz" and "Stück" are in the language data, thus it should be possible to break the token apart and find the right ending. I'm not sure why it happens here.

@adbar
Copy link
Owner

adbar commented Aug 8, 2024

My best guess is that it's just because the decomposition strategy is missing, there can be several ways to break words apart. That being said it could be worth it to implement this:

  1. Start from the end until a valid subword is found
  2. See if the other part of the token is in the dictionary
  3. Apply the lemmatization to the identified subword at the end

@juanjoDiaz
Copy link
Collaborator Author

Isn't that how the affix decomposition strategy works already?

The problem is that the strategy is not applied to German.
I could simply solve this by adding German to the list of languages that use the affix decomposition strategy.

The question is also: why nos enabling the strategy to all languages?

@adbar
Copy link
Owner

adbar commented Aug 12, 2024

Because it degrades performance on the benchmark, at least in my experiments. Currently the script only evaluates accuracy, it could look different with a f-score but I'm not sure.

In German, morphology of compound words can be complex, with (often 0 or 1 but) up to 3 characters between the parts of a compound. I guess it's even trickier for other morphologically rich languages. So the approach used for affixes reaches its limits there and does not entail error-free decomposition.

My best guess is that the method needed to solve your cases is a compound splitting strategy. It is not explicitly included in Simplemma (only through through the affix decomposition) but it would be a nice addition to strategies/.

It would be the same idea as for the affixes but with a further search until two or more valid parts (i.e. dictionary words) are found. We would need to specify how many characters are allowed between the components, I suggest to do it empirically, by testing.

@juanjoDiaz
Copy link
Collaborator Author

I see.
How could we try such new strategy if these compounded words are not present in UD Treebanks?
(If they where, the precision will improve adding the affix search).

@adbar
Copy link
Owner

adbar commented Aug 12, 2024

Yes, evaluating performance on rare words is an issue. We can implement the additional strategy and let the users decide if they want to use it.

@adbar adbar changed the title Why does the affix decomposition strategy only work for certain languages? Add a compound splitting strategy to improve on affix decomposition Aug 12, 2024
@adbar adbar added enhancement New feature or request and removed question Further information is requested labels Aug 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants