Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenNLP chunker Biomed data #10

Open
khituras opened this issue Oct 10, 2017 · 0 comments
Open

OpenNLP chunker Biomed data #10

khituras opened this issue Oct 10, 2017 · 0 comments

Comments

@khituras
Copy link
Member

Our chunker data is derived from the GENIA treebank corpus. However, this corpus has complete nested constituencies instead of just chunks. So we use an algorithm to create the chunks out of the treebank. For this there are currently two algorithms in the jcore-base version of the opennlp chunker. I think the newer one works better than the old one but it is still not perfect.
Now I found these data in our internal file system: /archives/alumni_homes/tomanek/coling/corpora/Genia/chunks/genia_new.chunks.gz

This appears to be the GENIA conversion used originally within the JULIE Lab. We should do crossevaluations on both corpora to see if there tagging differences and also just a plain comparison. Perhaps the old data is better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant