-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a compound splitting strategy to improve on affix decomposition #141
Comments
These are just German examples so the affix search works, I wouldn't add it for other languages, or maybe I don't understand the question? Depending on the language the UD data does not include a lot of variation, real-world scenarios are different, UD is just a way to run tests. |
My the questions are: |
I see! There are already rules for German, I guess affixes are not included because it would harm precision but I'll check again. |
The performance is degraded in German for non-greedy searches and the affixes don't improve the greedy mode. They also slows down things a bit. So I would be against it and in favor of adding more rules if necessary. |
What's dataset are you using to measure performance? What kind of rules do you think that would help with my examples? |
I see a difference when I add I checked again, your examples rather hint at a failing decomposition into subwords. "Distanz" and "Stück" are in the language data, thus it should be possible to break the token apart and find the right ending. I'm not sure why it happens here. |
My best guess is that it's just because the decomposition strategy is missing, there can be several ways to break words apart. That being said it could be worth it to implement this:
|
Isn't that how the affix decomposition strategy works already? The problem is that the strategy is not applied to German. The question is also: why nos enabling the strategy to all languages? |
Because it degrades performance on the benchmark, at least in my experiments. Currently the script only evaluates accuracy, it could look different with a f-score but I'm not sure. In German, morphology of compound words can be complex, with (often 0 or 1 but) up to 3 characters between the parts of a compound. I guess it's even trickier for other morphologically rich languages. So the approach used for affixes reaches its limits there and does not entail error-free decomposition. My best guess is that the method needed to solve your cases is a compound splitting strategy. It is not explicitly included in Simplemma (only through through the affix decomposition) but it would be a nice addition to It would be the same idea as for the affixes but with a further search until two or more valid parts (i.e. dictionary words) are found. We would need to specify how many characters are allowed between the components, I suggest to do it empirically, by testing. |
I see. |
Yes, evaluating performance on rare words is an issue. We can implement the additional strategy and let the users decide if they want to use it. |
Hi @adbar ,
Recently I started noticing that some inflected words are not correctly lemmatized.
However, when adding German to the list of languages that are processed by the affix decomposition strategy, most of this are solved.
Here are some examples:
Adding the affix strategy to German, does increase the execution time a bit but doesn't change the precision numbers when executing the evaluation script against the latest UD treebanks.
But, to be honest, even removing all the rules and only keeping the dictionary lookup barely changes the evaluation results 😅
So, the questions are:
Why is German not included in the affix search?
Should we include it?
The text was updated successfully, but these errors were encountered: