TextNorm (Text Normalizer)

Python package for Text Normalization. The purpose of this project is to build a standalone text normalizer for text pre-processing. Version 0.3 includes:

Spell correct: Correct spelling based in Peter Norvig's algorithm.
Text sanitization: Merge repeated punctuation or characters, handle contractions
Text tagging: Place tag on patterns (URL, USER, ACRONYM, EMOJIS among other)
Text tokenization: Tokenize text (match known patterns)

TextNorm Spell Corrector is based on Peter Norvig's algorithm[1] for word editing. Tagging and Tokenizer is based on ekphrasis[2]. Some features from textacy[3] are used in text pre-processing (only code snippets, no imports)

[1] http://norvig.com/spell-correct.html

[2] https://github.com/cbaziotis/ekphrasis

[3] https://github.com/chartbeat-labs/textacy

Installation

git clone https://github.com/kkorovesis/textnorm
cd textnorm
pip install -r requirements.txt
pip install .

Test

from textnorm.components import text_normalizer
text_normalizer.test()

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
textnorm		textnorm
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TextNorm (Text Normalizer)

Installation

Test

About

Releases

Packages

Languages

kkorovesis/textnorm

Folders and files

Latest commit

History

Repository files navigation

TextNorm (Text Normalizer)

Installation

Test

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages