GitHub - taranjeet/hindi-tokenizer: This is a package in Python which implements a tokenizer, stemmer for Hindi language

Tokenizer for Hindi

This package tends to implement a Tokenizer and a stemmer for Hindi language.

To import the package,

from HindiTokenizer import Tokenizer

This package implements various funcions, which are listed as below:

read_from_file
generate_sentences
tokenize
generate_freq_dict
generate_stem_word
generate_stem_dict
remove_stopwords
clean_text
print_sentences
print_tokens
print_freq_dict
print_stem_dict
len_text
sentence_count
tokens_count
concordance

The Tokenizer can be created in two ways

t=Tokenizer("यह वाक्य हिन्दी में है।")

Or

t=Tokenizer()
t.read_from_file('filename_here')

A brief description about all the functions

read_from_file

This function takes the name of the file which is present in the current directory and reads it.

t.read_from_file('hindi_file.txt')

generate_sentences

Given a text, this will generate a list of sentences.

t.generate_sentences()

print_sentences

This will print the sentences generated by print_sentences.

t.generate_sentences()
t.print_sentences()

tokenize

This will generate a list of tokens from the given text

t.tokenize()

print_tokens

This will print the sentences generated by print_tokens.

t.tokenize()
t.print_tokens()

generate_freq_dict

This will generate a dictionary of frequency of words and return it.

freq_dict=t.generate_freq_dict()

print_freq_dict

This will print the dictionary of frequency of words generated by generate_freq_dict.

freq_dict=t.generate_freq_dict()
t.print_freq_dict(freq_dict)

generate_stem_word

Given a word, this will generate its stem word.

word=t.generate_stem_word("भारतीय")
print word
भारत

generate_stem_dict

This will return the dictionary of stemmed words.

stem_dict=t.generate_stem_dict()

print_stem_dict

This will print the dictionary of stemmed words generated by generate_stem_dict.

stem_dict=t.generate_stem_dict()
t.print_stem_dict(stem_dict)

remove_stopwords

This will remove all the stopwords occuring from the given text.

t.remove_stopwords()

clean_text

This will remove all the punctuation symbols occuring in the given text.

t.clean_text()

len_text

Given a text, this will return the length of it.

print t.len_text()

sentence_count

Given a text, this will return the number of sentences in it.

print t.sentence_count()

tokens_count

Given a text, this will return the number of tokens in it.

print t.tokens_count()

concordance

Given a text, and a word, it will print all the sentences where that word is occuring.

sentences=t.concordace("हिन्दी")
t.print_sentences(sentences)

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
HindiTokenizer.py		HindiTokenizer.py
README.md		README.md
stopwords.txt		stopwords.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

taranjeet/hindi-tokenizer

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages