Indic Tagger (Indian Language Tagger)

In this project, we build part-of-speech (POS) taggers and chunkers for Indian Languages.

Languages supported: Telugu (te), Hindi (hi), Tamil (ta), Marathi (mr), Punjabi (pa), Kannada (kn), Malayalam (ml), Urdu (ur), Bengali (bn)

If you reuse this software, please use the following citation:

@inproceedings{PVS:SPSAL2007,
  editor    = {P.V.S., Avinesh and Gali, Karthik},
  title     = {Part of Speech Tagging and Chunking using Conditional Random Fields and Transformation Based Learning}
  booktitle = {Proceedings of the  Shallow Parsing for South Asian Languages (SPSAL) Workshop, held at IJCAI-07, Hyderabad, India},
  series    = {{SPSAL} Workshop Proceedings},
  month     = {January},
  year      = {2007},
  pages     = {21--24},
}

Training Data Statistics and System Performances (F1 macro)

Languages	# Words	# Sents	CRF POS	CRF Chunk	BI-LSTM-CRF POS	BI-LSTM CRF Chunk
te	347k	30k	93%	96%	92%	92%
hi	350k	16.3k	93%	97%	94%	93%
bn	298.3k	14.6k	84%	95%	85%	88%
pa	152.5k	5.6k	92%	98%	94%	96%
mr	207.9k	8.5k	89%	95%	88%	90%
ur	158.9k	7.6k	90%	96%	92%	89%
ta	337k	14.2k	88%	92%	87%	85%
ml	192k	11.4k	96%	95%	98%	98%
kn	294.3k	16.5k	90%	98%	88%	87%

Training Data Statistics and System Performances (F1 macro) for NER

Languages	# Words	# Sents	CRF NER	BI-LSTM-CRF NER
te	347k	30k	69%	65%
hi	503k	19k	62%	63%
bn	120k	6k	54%	48%
ur	35k	1.5k	65%	56%
or	93k	1.8k	68%	43%

Install using Anaconda

    # INSTALL python environment
    conda create -n tagger3.6 anaconda python=3.6
    source activate tagger3.6
    
    # Install the tokenizer
    cd polyglot-tokenizer
    python setup.py install
    
    # Install requirements
    pip install -r requirements.txt

Run

    python pipeline.py -p predict -l te -t pos -m crf -f txt -e utf -i input_file -o output_file

    -l, --languages       select language (2 letter ISO-639 code) 
                          {hi, be, ml, pu, te, ta, ka, mr, ur}
    -t, --tag_type      	pos, chunk, parse, ner
    -m, --model_type    	crf, hmm, lstm
    -f, --data_format   	ssf, txt, conll
    -e, --encoding      	utf8, wx   (default: utf8)
    -i, --input_file      <input-file>
    -o, --output_file     <output-file>
    -s, --sent_split      True/False (default: True)
	
    python pipeline.py --help

Train the POS tagger:

    # CRF model
    python pipeline.py -p train -o outputs -l te -t pos -m crf -e utf -f ssf
    
    # BI-LSTM-CRF model
    python pipeline.py -p train -t pos -f conll -m lstm -e utf -l te

Predict on text:

    # CRF models 
    python pipeline.py -p predict -l te -t pos -m crf -f txt -e utf -i data/test/te/test.utf.txt
    
    # BI-LSTM-CRF models
    python pipeline.py -p predict -l te -t pos -m lstm -f txt -e utf -i data/test/te/test.utf.txt
    
    # SpaCy models
    python spacy_tagger_test.py -l te -t pos

Train the NER tagger:

    # CRF model
    python pipeline.py -p train -o outputs -l te -t ner -m crf -e utf -f conll
    
    # BI-LSTM-CRF model
    python pipeline.py -p train -t ner -f conll -m lstm -e utf -l te

Predict NER on text:

    # CRF model
    python pipeline.py -p predict -l hi -t ner -m crf -f txt -e utf -i data/test/hi/test.utf.txt
    
    # BI-LSTM-CRF model
    python pipeline.py -p predict -l hi -t ner -m lstm -f txt -e utf -i data/test/hi/test.utf.txt

ToDo List

Telugu, Hindi trained CRF models
Bengali, Punjabi, Marathi, Urdu, Tamil trained CRF models
Bug: Utf-8 error Malayalam, Kannada trained CRF models
Deep learning (BI-LSTM-CRF)
Analysis Comparision w.r.t other ML algorithms
Bug: Punjabi & Urdu training file doesn't have "|" (or) end of sentence marker.
NER for Indian Languages
Feature addition to BI-LSTM-CRF models
Active Learning based sampling strategies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Indic Tagger (Indian Language Tagger)

Training Data Statistics and System Performances (F1 macro)

Training Data Statistics and System Performances (F1 macro) for NER

Install using Anaconda

Run

ToDo List

Files

README.md

Latest commit

History

README.md

File metadata and controls

Indic Tagger (Indian Language Tagger)

Training Data Statistics and System Performances (F1 macro)

Training Data Statistics and System Performances (F1 macro) for NER

Install using Anaconda

Run

ToDo List