Character n-grams #103

vboton · 2018-07-06T11:39:52Z

I have a training data where each token is a word and I've already extracted a few features like NER, POS and CHUNK for each token. But I have a problem when I try to extract character n-grams features. Since this features are computed at a character level, I don't know how to represent their values following the attribute value format. For example, if the current token is "food" then its character bigram feature will be something like "fo, oo, od". So how do I have to format that feature? By writing something like bigram[0]=fo, oo, od??

kaushikacharya · 2018-07-06T14:31:11Z

One way of doing this is assigning bigram as your key and bool True as its value: features['fo'] = True features['oo'] = True features[''od'] = True If you want to also consider position of the bigram, then it would be something like features['fo_word_prefix'] = True features['oo_word_middle'] = True features['od_word_suffix'] = True For reference https://github.com/TeamHG-Memex/sklearn-crfsuite/blob/master/docs/CoNLL2002.ipynb have a look at features['BOS'] = True in function word2features

…

On Fri, Jul 6, 2018 at 5:09 PM, yamivicen ***@***.***> wrote: I have a training data where each token is a word and I've already extracted a few features like NER, POS and CHUNK for each token. But I have a problem when I try to extract character n-grams features. Since this features are computed at a character level, I don't know how to represent their values following the attribute value format. For example, if the current token is "food" then its character bigram feature will be something like "fo, oo, od". So how do I have to format that feature? By writing something like bigram[0]=fo, oo, od?? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#103>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEWfs1mCu1CR6MYNzgTeszY64FBu9zWTks5uD0yJgaJpZM4VFUot> .

usptact · 2018-07-06T15:41:03Z

Take a look at Standford NLP NER features. These features are quite useful in morphologically rich languages like Finnsih, Turkish, Russian and others.

You can write word "food" prefixes like:

^f
^fo
^foo

And the suffixes:

d$#
od$#
ood$#

I don't remember the exact start and end flags but you get the idea.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Character n-grams #103

Character n-grams #103

vboton commented Jul 6, 2018

kaushikacharya commented Jul 6, 2018 via email •

edited

Loading

usptact commented Jul 6, 2018 •

edited

Loading

Character n-grams #103

Character n-grams #103

Comments

vboton commented Jul 6, 2018

kaushikacharya commented Jul 6, 2018 via email • edited Loading

usptact commented Jul 6, 2018 • edited Loading

kaushikacharya commented Jul 6, 2018 via email •

edited

Loading

usptact commented Jul 6, 2018 •

edited

Loading