Lemmatization performance on Universal Dependency Treebanks #7164
abdullah-alnahas
started this conversation in
General
Replies: 1 comment
-
Thanks for sharing your code and results here. One thing that is very important is adjusting the Tokenizer to get tokens similar to the original datasets (gold standard). Or, bypassing the Tokenizer and using the tokens from the test dataset. (this way you can align token by token which results in lemma by lemma) I am going to the alignments and keeping this thread up to date. Once we complete this, I will include it in the documentation for the future, so thanks again for your contribution. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I am comparing the performance of the most popular lemmatization tools. I have found benchmark results for Stanza, Trankit, and spaCy on Universal Dependencies version 2.5. However, I couldn't find anything related to Spark NLP.
Could you please point me to it if such a benchmark has already been done?
I have tried to do it myself, and I got an aligned accuracy of ~
78%
(I am attaching the code and results below).Questions:
Appreciate your input.
Beta Was this translation helpful? Give feedback.
All reactions