Skip to content

Latest commit

 

History

History
90 lines (63 loc) · 3.38 KB

README.md

File metadata and controls

90 lines (63 loc) · 3.38 KB

nounType

A type classifier that determines whether an entity is a Proper Name or Other.

Table of Contents

Installation

The easiest way to install the library is

pip install git+git://github.com/trifacta/nounType

Usage

Using this parser:

There are two main methods: parse and tag.

  • parse breaks the string into components and classifies each one.
  • tag merges consecutive components and strips punctuation and classifies the string as a whole.
>>> import properName as pn

# Proper Name
>>> name = "Charles Smith"

>>> pn.parse(name)
[('Charles', 'Name'), ('Smith', 'Name')]

>>> pn.tag(name)
(OrderedDict([('Name', 'Charles Smith')]), 'Proper Name')


# Other
>>> item = 'apples and oranges'

>>> pn.parse(item)
[('apples', 'Other'), ('and', 'Other'), ('oranges', 'Other')]

>>> pn.tag(item)
(OrderedDict([('Other', 'apples and oranges')]), 'Other')

(Re-)Training

Follow the below steps to re-train this model (as there is already trained crfsuite file):

  1. Install parserator onto your computer.
pip install parserator
  1. Prepare the XML file for the training data.
  • For help formatting and creating the XML file, there is a scripts/convert.py file that will create the file based on the tag you wish to use. Compilation of the seperate XML files will need to be done separately.
  • Another way to create the training file is to use parserator's command line interface to manually label tokens. It uses values in first column, and it ignores all other columns. To start labeling, run
parserator label [infile] [outfile] properName

For example, parserator label training/training.csv training/combined.xml properName.

  1. Once the XML File is created, to re-train the model, simply run the command
parserator train [traindata] properName

Substitute [traindata] for the filepath to the XML file. For example, parserator train training/totalcomb_v2.xml properName.

Functionality

Proper Name & Training

For the sake of preventing overlap, a name excludes places, company names, addresses, phone numbers, common nouns and other similar categories, even if there are cross-cases (i.e. a noun like Paris can be both a place and a noun).

The name parser was created using parserator to create domain-specific probabilistic parsers using python-crfsuite's implementation of conditional random fields. The model itself was trained using an xml file where depending on how the entity was tokenized and classified could either be labeled as 'Proper Name' or 'Other'.

The classifer can deal with a variety of cases and forms of names including Proper Case, Upper Case, First Last and names including prefixes and/or suffixes.

Classifier

A few of the features that the parser uses when tagging an entity include length of the entity, occurrence of special characters, such as 'at' sign (@), hyphens (-) and punctuation (, or .) and the double metaphone phonetic encoding algorithm.