Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Nerd standoff #37

Open
Aazhar opened this issue Jul 4, 2016 · 3 comments
Open

Improve Nerd standoff #37

Aazhar opened this issue Jul 4, 2016 · 3 comments

Comments

@Aazhar
Copy link
Member

Aazhar commented Jul 4, 2016

Actually, we select the first paragraphs, but it would be more fruitful to calculate the most significant concepts rather that pick them randomly.

@kermitt2
Copy link
Member

kermitt2 commented Jul 4, 2016

Each tool has its dedicated usage and should not be used for another purpose:

  • the keyterm extractor extracts the most significant/discriminant key terms, key concepts and wikipedia categories from an article as compared to the background collection. E.g. for doing facets of the most interesting concepts, this has to be used or the wikipedia catagories.
  • the NERD is dedicated to the exhaustive anotation of the concepts in a document for enabling semantic search - so it has to be used for search as the usual terms (the stems). The fact that only the abstract and the first paragraph were used before was simply due to cut the time given the deadline of the senate demo in february 2015 ;) The idea is to run it on the whole textual content in order to combine structural search, term search and semantic search.

@Aazhar
Copy link
Member Author

Aazhar commented Jul 4, 2016

Sure, but so far we're taking the first paragraphs (not necessarily the title and the abstract)
and what I meant is that knowing that improvements have to be made on NERD , we can set a threshold(for instance the average/article) for the nerd_score and conf_score to avoid badly disambiguated context..

@kermitt2
Copy link
Member

kermitt2 commented Jul 4, 2016

We were taking the first paragraphs just because if time constraints for the demo last year! We should take the whole for the NERD… I thought I changed it at some point to take the whole document.

NERD is not weighting the concepts in term of significance, it's grobid-keyterm which is doing that using various distributional information. NERD is disambiguating locally and try to disambiguate all mentions. We can set a different threashold while indexing NERD annotations for instance if we want to improve precision but there will always be some noise at this level. The point is that for semantic search it's the accumulation of the matches that set the scores (tf/idf or BM25) so it should be robust to noise from a ranking perspective.

It is a bit difference with the query disambiguation maybe - less context and more sensitive to noise. Currently the pruning threasholds are the same, but it could be refine based on experiments depending on the mode of usage…

For the facets, concepts and categories from the keyterm annotator make more sense than NERD annotations because there are already a selection of the key aspect of a document.

kermitt2 added a commit that referenced this issue Nov 26, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants