Improve Nerd standoff #37

Aazhar · 2016-07-04T10:29:40Z

Actually, we select the first paragraphs, but it would be more fruitful to calculate the most significant concepts rather that pick them randomly.

kermitt2 · 2016-07-04T16:29:40Z

Each tool has its dedicated usage and should not be used for another purpose:

the keyterm extractor extracts the most significant/discriminant key terms, key concepts and wikipedia categories from an article as compared to the background collection. E.g. for doing facets of the most interesting concepts, this has to be used or the wikipedia catagories.
the NERD is dedicated to the exhaustive anotation of the concepts in a document for enabling semantic search - so it has to be used for search as the usual terms (the stems). The fact that only the abstract and the first paragraph were used before was simply due to cut the time given the deadline of the senate demo in february 2015 ;) The idea is to run it on the whole textual content in order to combine structural search, term search and semantic search.

Aazhar · 2016-07-04T16:49:10Z

Sure, but so far we're taking the first paragraphs (not necessarily the title and the abstract)
and what I meant is that knowing that improvements have to be made on NERD , we can set a threshold(for instance the average/article) for the nerd_score and conf_score to avoid badly disambiguated context..

kermitt2 · 2016-07-04T17:08:35Z

We were taking the first paragraphs just because if time constraints for the demo last year! We should take the whole for the NERD… I thought I changed it at some point to take the whole document.

NERD is not weighting the concepts in term of significance, it's grobid-keyterm which is doing that using various distributional information. NERD is disambiguating locally and try to disambiguate all mentions. We can set a different threashold while indexing NERD annotations for instance if we want to improve precision but there will always be some noise at this level. The point is that for semantic search it's the accumulation of the matches that set the scores (tf/idf or BM25) so it should be robust to noise from a ranking perspective.

It is a bit difference with the query disambiguation maybe - less context and more sensitive to noise. Currently the pruning threasholds are the same, but it could be refine based on experiments depending on the mode of usage…

For the facets, concepts and categories from the keyterm annotator make more sense than NERD annotations because there are already a selection of the key aspect of a document.

…ics-frontend

Aazhar added enhancement help wanted labels Jul 4, 2016

Aazhar added this to the indexing milestone Jul 4, 2016

kermitt2 added a commit that referenced this issue Nov 26, 2016

Fix analyzer for author fullname field, issues #39 and #37 of anHALyt…

1f44d9c

…ics-frontend

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Nerd standoff #37

Improve Nerd standoff #37

Aazhar commented Jul 4, 2016

kermitt2 commented Jul 4, 2016 •

edited

Loading

Aazhar commented Jul 4, 2016

kermitt2 commented Jul 4, 2016

Improve Nerd standoff #37

Improve Nerd standoff #37

Comments

Aazhar commented Jul 4, 2016

kermitt2 commented Jul 4, 2016 • edited Loading

Aazhar commented Jul 4, 2016

kermitt2 commented Jul 4, 2016

kermitt2 commented Jul 4, 2016 •

edited

Loading