Modern approaches to Named Entity Recognition (NER) use neural networks (NN) to automatically extract features from text and seamlessly integrate them with sequence taggers in an end-to-end fashion. Word embeddings, which are a side product of pretrained neural language models (LMs), are key ingredients to boost the performance of NER systems. More recently, contextual word embeddings, which adapt according to the context where the word appears, have proved to be an invaluable resource to improve NER systems. In this work, we assess how different combinations of (shallow) word embeddings and contextual embeddings impact NER for the Portuguese Language. We show a comparative study of 16 different combinations of shallow and contextual embeddings and explore how textual diversity and the size of training corpora used in LMs impact our NER results. We evaluate NER performance using the HAREM corpus. Our best NER system outperforms the state-of-the-art in Portuguese NER by 5.99 in absolute percentage points. State-of-The-Art results evaluated by CoNLL-2002 Script.
Results for the Total Scenario (HAREM)
Approach | Precision | Recall | F1 |
---|---|---|---|
BiLSTM-CRF+FlairBBP | 74.91% | 74.37% | 74.64% |
BiLSTM-CRF (Castro, et al.) | 72.28% | 68.03% | 70.33% |
CharWNN (dos Santos, et al.) | 67.16% | 63.74% | 65.41% |
Results for the Selective Scenario (HAREM)
Approach | Precision | Recall | F1 |
---|---|---|---|
BiLSTM-CRF+FlairBBP | 83.38% | 81.17% | 82.26% |
BiLSTM-CRF (Castro, et al.) | 78.26% | 74.39% | 76.27% |
CharWNN (dos Santos, et al.) | 73.98% | 68.68% | 65.41% |
Before you begin, you should download the Flair library. Flair is a powerful NLP library with state-of-the-art results. Flair was developed by Zalando Research. You can see all details in this github link.
- Paper: Contextual String Embeddings for Sequence Labeling (Akbik, et al.)
STEP 1: Download our language model FlairBBP (backward and forward);
STEP 2: Clone this repository;
STEP 3: Install Flair. See how to install here;
STEP 4: Download NILC's Word Embedding. You must download Word2Vec-Skip-Gram with 300 dimensions; Put the file inside the cloned folder;
STEP 5: Run our script python3.6 ner_flair.py
Tag your text using our best model for NER. The model is formed by FlairBBP + NILC-Word2Vec-Skpg-300d. It is possible to recognize the following categories: PERSON, LOCATION, ORGANIZATION, TIME and VALUE. You need install the last version of Flair.
STEP 1: Download our NER model Download Here!;
STEP 2: Use the pToolNER to labelling your text.
pToolNER = PortugueseToolNER()
pToolNER.loadNamedEntityModel('best-model.pt')
pToolNER.sequenceTaggingOnText(
rootFolderPath='./PredictablesFiles',
fileExtension='.txt',
useTokenizer=True,
maskNamedEntity=False,
createOutputFile=True,
outputFilePath='./TaggedTexts',
outputFormat='plain',
createOutputListSpans=True
)
Alternative use (We strongly recommend you to use the pToolNER!):
STEP 1: Download our NER model Download Here!;
STEP 2: Clone this repository;
STEP 3: Run our script python3.6 tagging_ner.py [input_file_name.txt] [output_file_name.txt] [mode]
modes:
- conll - input text in conll formart
- plain - input text in plain formart
You can download our Flair Embeddings models (FlairBBP) in the following links:
- Backward: FlairBBP-Backward
- Forward: FlairBBP-Forward
You can download our Word Embedding models in the following links, note that all models were trained in 300 dimensions:
Algorithm | Architecture | Downloads |
---|---|---|
Word2Vec | Skip-Gram | Word2Vec_skpg_300d |
Word2Vec | CBOW | Word2Vec_cbow_300d |
FastText | Skip-Gram | Fasttext_skpg_300d |
FastText | CBOW | Fasttext_cbow_300d |
You can download the Word Embeddings provided by NILC in the following link: http://nilc.icmc.usp.br/embeddings
- Paper: Portuguese Word Embeddings: Evaluating on Word Analogies and Natural Language Tasks (Hartmann, et al.)
BlogSet-BR is a large corpus built from millions of sentences taken from Brazilian Portuguese web blogs.
- Paper: BlogSet-BR: A Brazilian Portuguese Blog Corpus (Santos, et al.)
- Download Here!
brWaC is another portuguese large corpus.
- Paper: The brWaC Corpus: A New Open Resource for Brazilian Portuguese (Filho, et al.)
- Download Here!
ptwiki-20190301 is a corpus formed by texts from wikipedia in Portuguese.
Language Model Corpora Size Details (after pre-processing):
Corpus | Sentences | Tokens |
---|---|---|
brWaC | 127,272,109 | 2,930,573,938 |
BlogSet-BR | 58,494,090 | 1,807,669,068 |
ptwiki-20190301 | 7,053,954 | 162,109,057 |
All Corpora | 192,820,153 | 4,900,352,063 |
@inproceedings{santos2019assessing,
author = {Joaquim Santos and
Bernardo Consoli and
Cicero dos Santos and
Juliano Terra and
Sandra Collonini and
Renata Vieira},
title = {Assessing the Impact of Contextual Embeddings for Portuguese Named Entity Recognition},
booktitle = {Proceedings of the 8th Brazilian Conference on Intelligent Systems},
pages = {437--442},
year = {2019}
}