Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue on overlap entities in the task-1 training set #8

Open
jplu opened this issue Mar 27, 2015 · 6 comments
Open

Issue on overlap entities in the task-1 training set #8

jplu opened this issue Mar 27, 2015 · 6 comments

Comments

@jplu
Copy link
Collaborator

jplu commented Mar 27, 2015

Hi,

I found a new bug in the training set, this one is about the overlap of two entities:

<http://www.ontologydesignpatterns.org/data/oke-challenge/task-1/sentence-15#char=12,28>
        a                     nif:String , nif:RFC5147String ;
        nif:anchorOf          "Auburn, New York"@en ;
        nif:beginIndex        "12"^^xsd:int ;
        nif:endIndex          "28"^^xsd:int ;
        nif:referenceContext  <http://www.ontologydesignpatterns.org/data/oke-challenge/task-1/sentence-15#char=0,145> ;
        itsrdf:taIdentRef     <http://www.ontologydesignpatterns.org/data/oke-challenge/task-1/Auburn,_New_York> .

And

<http://www.ontologydesignpatterns.org/data/oke-challenge/task-1/sentence-15#char=2,27>
        a                     nif:String , nif:RFC5147String ;
        nif:anchorOf          "native of Auburn, New Yor"@en ;
        nif:beginIndex        "2"^^xsd:int ;
        nif:endIndex          "27"^^xsd:int ;
        nif:referenceContext  <http://www.ontologydesignpatterns.org/data/oke-challenge/task-1/sentence-15#char=0,145> ;
        itsrdf:taIdentRef     <http://www.ontologydesignpatterns.org/data/oke-challenge/task-1/Native_of_Auburn,_New_York_1> .

I think the second one is false.

Cheers.

@anuzzolese
Copy link
Owner

Hi Julien,

this overlap identifies two different entities:

  • oke:Auburn,_New_York, which is a place;
  • oke:Native_of_Auburn,_New_York_1, which is a person.

This distinction is correct, hence the overlap is correct as well.

@giusepperizzo
Copy link

well, if so why not tagging/linking New_York as well?

Would you mind to detail a bit more how you have managed nested entities in the creation of the GS?

@rtroncy
Copy link

rtroncy commented Mar 31, 2015

@anuzzolese Can we please re-open this issue? This is serious, since nested entities is a very _hard_ problem for the community. Fine that the organizers of the challenge want to consider it but then, you need to communicate what are/were the clear guidelines provided to the annotators. For example, @giusepperizzo just gave you an example of why not all possible nested entities have been annotated? Next, you need to guarantee that consistency will have been applied between the training and the test sets.

Warning: you really enter a can of worms by considering nested entities. You are likely to have a long adjudication phase where all systems having participated in the challenge will come back and complain and ask to re-compute the figures since they will discover inconsistencies.
Are you sure you want this?

@anuzzolese
Copy link
Owner

@giusepperizzo and @rtroncy I see you point and I agree it's very hard to address the task of overlapping entities.

I asked annotators to report possible different entities in case of overlaps. In this case the annotator found two distinct entities and considered New_York as a characterisation (a way for disambiguating) of Auburn.
However, the comment is highly pertinent and this way of generating entities might introduce a worm in the evaluation. In fact, someone could say that New York is a mention to another entity.

Hence, in my opinion there are two possibilities:

  • take into account all the nested entities (I will take personally care of updating the training set accordingly);
  • remove the identification of nested entities;

The issue is reopened.
WDYT?

@anuzzolese anuzzolese reopened this Mar 31, 2015
@rtroncy
Copy link

rtroncy commented Mar 31, 2015

Thanks for having re-opened the issue. For the challenge purpose, I think you should go for your second option, i.e. remove all identification of nested entities, in both the training and test dataset, and only consider the "largest" (this is often the longest surface form) entity.

Annotating the dataset in terms of nested entities is also a very valuable effort and, if you're willing to do it, it might be of great benefit for the community. This resource will be useful post-challenge for performing additional experiments. For example, TAC 2014 consider the nested entities as optional (for the systems which wanted to do some trials) but this was not part of the official competition since the community is still trying to learn and discover how this complex problem should be scored/evaluated, etc.

@jplu
Copy link
Collaborator Author

jplu commented May 28, 2015

According to this issue there is again two other cases:

  • In sentence 64: these two entities "Methodist Episcopal" and "clergyman" are following each other. They are both extracted and typed as a "Role" whereas it can be "Methodist Episcopal clergyman".
  • In sentence 31: the entity "Wiveliscombe" is extracted and the correct entity might be "a native of Wiveliscombe"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants