Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Character n-grams #103

Open
vboton opened this issue Jul 6, 2018 · 2 comments
Open

Character n-grams #103

vboton opened this issue Jul 6, 2018 · 2 comments

Comments

@vboton
Copy link

vboton commented Jul 6, 2018

I have a training data where each token is a word and I've already extracted a few features like NER, POS and CHUNK for each token. But I have a problem when I try to extract character n-grams features. Since this features are computed at a character level, I don't know how to represent their values following the attribute value format. For example, if the current token is "food" then its character bigram feature will be something like "fo, oo, od". So how do I have to format that feature? By writing something like bigram[0]=fo, oo, od??

@kaushikacharya
Copy link

kaushikacharya commented Jul 6, 2018 via email

@usptact
Copy link

usptact commented Jul 6, 2018

Take a look at Standford NLP NER features. These features are quite useful in morphologically rich languages like Finnsih, Turkish, Russian and others.

You can write word "food" prefixes like:

  • ^f
  • ^fo
  • ^foo

And the suffixes:

  • d$#
  • od$#
  • ood$#

I don't remember the exact start and end flags but you get the idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants