Different label categories than expected in a spark-nlp NER model #13167

ag-din · 2022-11-29T14:58:26Z

ag-din
Nov 29, 2022

Hello everyone,

I am a beginner with spark-nlp, and I want to train a NER model that recognises in texts 2 types of entities with SPECSKILL and HUMANSKILL labels. I'm using Python 3.7.12 and spark-nlp 4.2.3. The training and test datasets are in CoNLL 2003 format. I did a first small training and got the following results:

It is noted that labels=9. However, I expect labels=2 (SPECSKILL and HUMANSKILL only). As it caught my attention, I tried using the conll2003 data here, and got similar results with respect to the labels:

It is noted that labels=9. However, I expect labels=4 (LOC, ORG, MISC, and PER only).

Questions:

Is it normal to get these results in terms of number of labels?
What exactly does the trained NER model recognise that generates other categories than the ones I expect (I expect labels=2, from SPECSKILL & HUMANSKILL)?
I am following CoNLL 2003 documentation, to get the 4th item (entity), but I see this is causing “harm” (as far as I am concerned), as theoretically equal entities are getting recognized as different (i.e., “B-SPECSKILL” is being considered different to “I-SPECSKILL” when both should be recognized as “SPECSKILL” alone)... Does Spark NLP work differently in this regard (i.e., does it mandatorily need IOB format for the entities), compared to other frameworks?
In general, should we use “IOB schema” or “BILUO schema” (source) for our named entities, or is it OK to leave our named entities without it (i.e., as seen here)?

I hope your comments. Thank you.

Answered by maziyarpanahi

Nov 29, 2022

Hi,

Yes! I am not sure about your own CoNLL file, but you don't have any LABELONE or LABELTWO inside that file. Instead, you have the following labels (Obviously, Spark NLP is not making these labels up so you are either reading the wrong CoNLL index or this is what's actually inside that conll file):

U_SPECSKILL
I-SPECSKILL
I-HUMANSKILL
B-SPECSKILL
B-HUMANSKILL
U-HUMANSKILL

The number of labels are with B- and I- plus O when they are counted. That's why for CoNLL2003 file you have tested with 4 entities (not labels) you have 9 labels, 8 different labels starting with B- and I- and O which makes them 9 unique labels to learn during the training. You have mistaken the entities with labe…

View full answer

maziyarpanahi · 2022-11-29T15:12:27Z

maziyarpanahi
Nov 29, 2022
Maintainer

Hi,

Yes! I am not sure about your own CoNLL file, but you don't have any LABELONE or LABELTWO inside that file. Instead, you have the following labels (Obviously, Spark NLP is not making these labels up so you are either reading the wrong CoNLL index or this is what's actually inside that conll file):

U_SPECSKILL
I-SPECSKILL
I-HUMANSKILL
B-SPECSKILL
B-HUMANSKILL
U-HUMANSKILL

The number of labels are with B- and I- plus O when they are counted. That's why for CoNLL2003 file you have tested with 4 entities (not labels) you have 9 labels, 8 different labels starting with B- and I- and O which makes them 9 unique labels to learn during the training. You have mistaken the entities with labels. (we teach the model how labels work with beginning and ending schema, then it predicts those back, and then we merge them to entities via NerConverter)

If you just count them you see they are 9 (8+O):

B-LOC
I-ORG
I-MISC
I-LOC
I-PER
B-MISC
B-ORG
B-PER
O

The answer to the rest of your questions, NerDLApproach in Spark NLP accepts IOB or IOB2 schemas. However, I've seen people who have datasets without any B- or I- and only full entities like PERSON or DATE and the model still learns. I also have seen people with different schema who trained their models. The Schema helps to have more fine-grain NER models with span definition, that's all. (I recommend IOB or IOB2

Here is a very complete tutorial explaining everything I said in details (this will answer all your questions): https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/4.NERDL_Training.ipynb

1 reply

ag-din Nov 30, 2022
Author

Hi @maziyarpanahi. Thank you very much for your answers, they were clarifying, and the notebook is also helpful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different label categories than expected in a spark-nlp NER model #13167

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Different label categories than expected in a spark-nlp NER model #13167

ag-din Nov 29, 2022

Replies: 1 comment · 1 reply

maziyarpanahi Nov 29, 2022 Maintainer

ag-din Nov 30, 2022 Author

ag-din
Nov 29, 2022

Replies: 1 comment 1 reply

maziyarpanahi
Nov 29, 2022
Maintainer

ag-din Nov 30, 2022
Author