This is a minor release with bug fixes. Running zemberek-full.jar now lists available applications correctly.
(April 17th 2019)
This is a small release with breaking changes.
Tokenization module now can be configured to ignore text in double quotes.
Zemberek now uses it's own Token class instead of using Antlr's Token class. This is a breaking change.
Behavior of two methods and add two methods in WordAnalysisSurfaceFormatter are changed. Now if apostrophe is provided, it will use it even in regular words.
Model compression for NER models is added.
Add Builder mechanism to TurkishSentenceExtractor.
TurkishTextToNumberConverter is now thread safe and stateless.
There is a NER postprocessor NEPostProcessor (Provided by Ayça Müge Sevinç).
Use shade plugin so that when using single jar library, dependent libraries will not conflict with ohter versions.
RootLexicon.DEFAULT
is now inaccessible. Use RootLexicon.getDefault()
for accessing
default lexicon.
RootLexicon.builder().addDefaultLexicon()
is renamed as RootLexicon.builder().addDefaultLexicon()
Zemberek now uses it's own Token class instead of using Antlr's Token class.
normalizedLemma() in DictionaryItem now removes -mek -mak suffixes and converts to lowercase.
Some method names in TurkishAlphabet is changed. Such as asciiTolerantEquals -> equalsIgnoreDiacritics and asciiTolerantStartsWith -> startsWithIgnoreDiacritics
A possible small memory leak caused by analysis of unidentified tokens is fixed.
We thank everybody for their contributions. Special thanks to Müge for her feedback on morphology and NER modules.
(October 29th 2018)
This is a major release with breaking changes and new features.
Initial release of grpc remote procedure call module. This is an experimental module that allows fast access to some functions of the project from other programming languages. We provide initial Python access codes for experimentation. Remote API is also subject to change until Version 1.0.0. Refer to the documentation for more information.
Now there is a sentence normalization functionality. Before this, normalization module only provided simple 1 distance word based spell check suggestion mechanism. Now. system offers a best effort text normalization functionality. This may be useful for pre-processing noisy text inputs before applying other functions.
Candidate correct words for noisy words are collected using several heuristics, informal morphotactics, distance matching and lookup tables that generated with an offline contextual graph random walk algorithm. After that, best correct sequence is found with Viterbi search on candidate words using n-gram language model scores.
Note that this is our first attempt, expect many errors.
We introduce a mechanism for analyzing Turkish informal words. For example, word okuycam
, analysis
may be:
[okumak:Verb] oku:Verb+yca:Fut_Informal+m:A1sg
Informal morpheme names (like Fut_Informal
) have _Informal
suffix.
For enabling informal morphological analysis, TurkishMorphology class should be initialized like this:
TurkishMorphology morphology = TurkishMorphology.builder()
.setLexicon(RootLexicon.DEFAULT)
.useInformalAnalysis()
.build();
morphology.analyzeAndDisambiguate("vurucam kırbacı")
.bestAnalysis()
.forEach(System.out::println);
Output:
[vurmak:Verb] vur:Verb+uca:Fut_Informal+m:A1sg
[kırbaç:Noun] kırbac:Noun+A3sg+ı:P3sg
Ambiguity resolution mechanism may not work well if sentence contains informal morphemes.
There is also a simple informal to formal conversion mechanism InformalAnalysisConverter
that
generates formal surface form of an informal word analysis.
Morphological analysis can be configured to ignore Turkish diacritics marks as used in characters [ç,ğ,i,ö,ü,ş] For that purpose ignoreDiacriticsInAnalysis() method is used. For example:
TurkishMorphology morphology = TurkishMorphology.builder()
.setLexicon(RootLexicon.DEFAULT)
.ignoreDiacriticsInAnalysis()
.build();
morphology.analyze("kisi").forEach(System.out::println);
Output will be:
[kış:Noun,Time] kış:Noun+A3sg+ı:Acc
[kış:Noun,Time] kış:Noun+A3sg+ı:P3sg
[kişi:Noun] kişi:Noun+A3sg
Same output will be generated for inputs "kısı, kışi, kişi, kışı" etc.
There are several new command line applications.
GenerateWordVectors
: Generates word vectors using a text corpus.StartGrpcServer
: Starts Zemberek gRPC Server.TrainNerModel
: Generates Turkish Named Entity Recognition model.EvaluateNer
: Evaluates an annotated NER data set.FindNamedEntities
: Finds named entities from a Turkish text file.
-
Most lexicon building methods in TurkishMorphology is now moved to RootLexicon's Builder mechanism. We did not go through a deprecation stage for this because there were too many changes.
RootLexicon.DEFAULT
is now contains default dictionary items. So if user wants to create a custom dictionary and add it to default, or remove items during instantiation, RootLexicon builder mechanism needs to be used. Example:RootLexicon myLexicon = RootLexicon.builder() .setLexicon(RootLexicon.DEFAULT) // start with default .addDictionaryLines("foo", "rar") // add two new nouns .addTextDictionaries(Paths.get("my-own-dictionary-file")) // add from file .build();
-
InterpretingAnalyzer
is nowRuleBasedAnalyzer
-
locations-tr.dict
(contains mostly village and district names) is removed from default binary dictionary because it was causing a lot of confusion. Users can add it manually. -
Instead of using Ability+Negative suffix couple, there is now a new morpheme called
Unable
. New suffix does not cause a Verb to Verb derivation. For example, for wordokuyamadım
:Before: oku:Verb|ya:Abil→Verb+ma:Neg+dı:Past+m:A1sg After : oku:Verb+yama:Unable+dı:Past+m:A1sg
-
Deprecated createWithTextDictionaries() method in TurkishMorphology is now removed.
#188 Cannot analyze sendeki, bendeki etc
#184 Cannot analyze abimsin
or any Noun+..+P1sg+..+Verb+..+A2sg
#183 Cannot analyze "Tübitak'a"
#178 Anlaysis fails with herkeste, gibime, gibimize
#175 Lemmatization may give incorrect results in Zero morpheme derivations
#174 Add ®™©℠symbols as punctuation.
#167 Redundant Adj -> Noun -> Noun conversion
#170 Justlike morpheme should not appear in some cases.
#171 Cannot analyze "kendimle"
#172 Cannot analyze "kendimde, kendimden, kendimce"
#173 "gelebilme" should not have an analysis with "Neg"
We thank everybody for their contributions. Special thanks to Müge for finding numerous morphology bugs and @bojie for fixing Language Model compression problem.
August 2nd 2018
Initial release of classification module. Classification module is based on our Java port of fastText project. Refer to the documentation for more information. There are two usage examples in examples module. There is also a Turkish wiki page on generating and evaluating classifier for Turkish news headline categories.
There is now a single jar containing all dependencies named zemberek-full.jar
.
Initial release of apps module. All console applications will be in this module. When zemberek-full.jar is run directly, a catalogue of available applications are listed. They can be run directly bt the class name without the package name. For example:
java -jar zemberek-full.jar TrainClassifier
will run the TrainFastTextClassifier application and list available parameters.
OpenNlp NER annotation style is added.
toAscii
method is added to TurkishAlphabet
Common foreign words with diacritic letters (like José
) will not generate an exception warning.
Tokenizer will handle ellipsis …
character correctly.
createWithTextDictionaries()
method in TurkishMorphology and addDefaultDictionaries()
in
TurkishMorphology.Builder are deprecated.
parseEntries()
in SentenceAnalysis is deprecated.
Deprecated analyzeAndResolveAmbiguity(String sentence)
in TurkishMorphology is removed.
Dictionary Rules and Morphemes wiki pages are added.
There is also word vector generation functionality but it is not yet documented.
There are more grpc services available but module is not yet ready for release.
This is a major release with improved Morphological disambiguation and initial release of NER module.
-
Morphological Disambiguation is improved and now became more usable.
-
There is now a Named Entity Recognition (NER) module for Turkish. This is our initial implementation and no pretrained NER model is provided yet. Users can train their own models. Please refer to the documentation for details.
-
Around 12 thousands lines of code is removed from project. Previous morphology and disambiguation code is gone.
-
Behavior for inputs proper nouns with apostrophe is changed. For example there will no longer be an analysis with root
oba
for inputObama'ydı
. However handling proper nouns with suffixes and apostrophes is still under development. -
After some cleaning, person names dictionary is added.
-
"…" is now considered a sentence boundary character.
-
Morphotactics improvements.
-
A Grpc server module is under development. Remote server functions can be accessed with multiple client libraries (Python, C#, Javascript etc.). Initial version will probably be available in the next release.
Breaking Changes:
analysis
parameter in SentenceWordAnalysis class is nowbestAnalysis
.- For consistency,
analyzeAndResolveAmbiguity
method inTurkishMorphology
is changed toanalyzeAndDisambiguate
. Old method still works but deprecated. addDefaultDictionaries()
inTurkishNMorphology.Builder
is deprecated. UseaddDefaultBinaryDictionary
instead.- Maven distibutions now includes sources (via Ali Ok)
This is a bug fix and small changes release.
-
A problem in ambiguity resolution is fixed. It was throwing NPE for unknown words. (Issue #157)
-
Abbreviations are not allowed to have possessive suffixes.
-
Morphotactics improvements.
-
We started experimenting with server side modules.
This release is the result of some major refactoring of the Morphology module. There are many breaking changes.
Maven artifact id names have now zemberek-
prefix.
Morphology module is re-written almost from scratch. Turkish morphotactics are now expressed in a simpler and more readable way in the code. New analyzer handles pronouns better and probably it generates more accurate results. But because this is a complete re-write, there might be new bugs and regressions.
Ambiguity resolution mechanism is changed. It now uses the old but popular Averaged Perceptron algorithm. For now, it is trained with the data generated from some corpora using simple rules. Therefore in this version disambiguation may not work so accurately. But it will improve in the upcoming releases quickly. Nevertheless, new module is probably working better than previous releases. Previous language model based algorithm is retired for now, but in future we may use a hybrid approach.
Default analysis representation is changed. Some examples:
kitap ("book", Noun, Singular.)
[kitap:Noun] kitap:Noun+A3sg
kitabımda ("in my book" Noun, Singular, First person possession, Locative)
[kitap:Noun] kitab:Noun+A3sg+ım:P1sg+da:Loc
dedim ("I told" Verb, past tense, first person singular)
[demek:Verb] de:Verb+di:Past+m:P1sg
diyerek ("By telling" Verb, derived to an adverb)
[demek:Verb] di:Verb|yerek:ByDoingSo→Adv
We decided to omit displaying implicit Pnon
and Nom
suffixes from nouns to make it more readable.
This format is probably not final. We consider changing some morpheme names and refine the representation.
We now use Caffeine for caching analysis results. There are static and dynamic caches for speeding up the word analysis.
Word generation mechanism is also re-written.
Dictionary serialization mechanism is written using protocol-buffers.
Now initialization of TurkishMorphology
class is faster.
There are Email, Url, Mention, HashTag, Emoticon, RomanNumeral, RegularAbbreviation, Abbreviation secondary POS information.
We added examples module. This is like a copy of turkis-nlp-examples
project. Users can see
high level usage examples there.
FAQ and front page Readme files are updated.
Z3AbstractDisambiguator, TurkishMorphDisambiguator, Z3AbstractDisambiguator, Z3MarkovModelDisambiguator, Z3ModelA removed from morphology modue.
TurkishSuffixes, TurkishSentenceAnalyzer, WordAnalyzer, SimpleGenerator, DynamicLexiconGraph, DynamicSuffixProvider, SuffixData, SuffixSurfaceNode, StemNode, StemNodeGenerator, Suffix, SuffixForm, SuffixProvider, SuffixSurfaceNodeGenerator are removed from morphology modue.
TurkishMorphology analysis methods now return WordAnalysis
object instead of List<WordAnalysis>
. WordAnalysis
contains a List<SingleAnalysis>
where analysis details can be reached. Methods like
getStem()
or getLemmas()
are moved from WordAnalysis
to SingleAnalysis
.
Generation is now handled by WordGenerator
class. generation rules are changed so that if user does not
provide empty surface morphemes, system search through them anyway. Check GenerateWords
example class.
System memory footprint is reduced. Analysis performance may be a bit slower but with cache, impact should be small. We will provide measurements later.
We wrote a port of Facebook's FastText library in Java. It can be used for word embeddings and classification tasks. However it is not yet ready for release.
There is an experimental Named Entity Recognition module. But it is not yet ready for release.
TurkishSpellChecker now can load internal resources correctly.
We made a lot of changes in Tokenization module. Some of them are breaking changes.
Package name is changed from zemberek.tokenizer to zemberek.tokenization for consistency.
Now there is a better sentence extraction class called TurkishSentenceExtractor. This class can split documents and paragraphs into sentences. It uses rules and a simple binary averaged perceptron algorithm for finding sentence boundaries. LexerSentenceExtractor is removed.
Tokenization is also improved. More token types are introduced. Including Date, Hashtag, Mention, URL, Email and Emoticon. However these tokens are not included as a POS type in morphological analysis yet. This will be done in upcoming versions. We changed the ZemberekTokenizer name to TurkishTokenizer. There are some low-level breaking changes. Token TurkishWord is now Word and TurkishWordWithApos is WordWithApostrophe.
Please refer to documentation for both sentence extraction and tokenization usage examples and test results.
Zemberek now includes an alpha level spell-checker. This spell checker is intended for well formed documents. It is not for automatic spell correction / normalization. Basically it can check if an individual word is written correctly and give suggestions for a word. Suggestions are ranked against a small unigram language model for now. We will improve the spell checker in later versions.
Plase refer to documentation for usage examples.
There are no big changes in morphology in this version. However several addition and fixes were made in dictionaries. Including around 100 more Adverbs and several Question type fixes. Suffix morphotactics for Pronouns are still mostly broken. Please refer to these commits for changes: 1 2 3 4 5
There was yet another memory leak fixed on use of HashMaps in RootLexicon.
We have started a new morphotactics work in experiment module. But it is not yet usable.
This module is removed and content is moved to core module for now. It was not holding much weight and code requires an overhaul. Before 1.0.0 we may move it to normalization module.
- Fixed a memory leak. Previous versions may suffer from this under heavy load. 87
- Re-introduce lang-id module. This provides a simple language identification API.
- Added normalization module. So far it only provides fast Levenshtein distance dictionary matching.
- Added an experiment module. This module will be used for experimental features.
- Dictionary fixes.
- Added city, village and district names.
- System can generate full jar containing all zemberek modules.
- Speed up Antlr based tokenizer. Now it is three times faster again. 89
- TurkishMorphology can be configured for not using cache and UnidentifiedTokenAnaysis.
- Eliminate static cache from TurkishMorphology 86
- Fix: Some inputs may cause excessive hypothesis generation during analysis 88
- Fix: Proper Nouns ending -nk or -og should not have Voicing attribute automatically.83
- Fix: "foo \nabc" should be tokenized as "foo \n abc" 69
- There are some name changes. TurkishMorphology.TurkishMorphParserBuilder -> TurkishMorphology.Builder UnidentifiedTokenAnalyzer parse -> analyze
- Improved morphological analysis coverage by cross checking with Oflazer-Analyzer. For this, a list of more than 7 million words are extracted from a 2 billion word corpora. Then a list of words that can be analyzed only by Oflazer-Analyzer is generated and Zemberek is fixed as much possible.
- Breaking change: zemberek.morphology.parse package is now zemberek.morphology.analysis 67
- Breaking change: Several classes are renamed.
TurkishWordParserGenerator -> TurkishMorphology
SentenceMorphParse -> SentenceAnalysis
MorphParse -> WordAnalysis
WordParser -> WordAnalyzer 67 - Breaking change: Methods with name "parse" are renamed to "analyze". 67
- Custom Antlr dependency is removed. We now use latest stable Antlr version. We decided this because maintaining a patched version of Antlr was time consuming. We were using such a fork because original version was around 3 times slower. But current speed is good enough. We may remove Antlr dependency altogether in future releases because tokenization should be less strict and it should not do detailed classification. [68] (ahmetaa#68)
- Add Oflazer compatible secondary POS information for Postp. 65
- Tokenization problem after capital letters after apostrophe. 64
- Cannot parse diyebil-, diyecek-, diyen-. 61
- Proper nouns should not have Voicing attribute automatically. 57
- Can parse reçelsi but not reçelimsi. 54
- Cannot parse maviceydi, yeşilcedir. 53
- Cannot parse "soyadları" 55
- Wrong start and stop indexes for abbreviation words on tokenization. 51
- Fixes in caching mechanism in TurkishMorphology.
- Added a dependency module. It does not perform parsing yet.
A lot of internal code changes. Added static and dynamic cache mechanisms for word parsing.
- Can parse [abdye ABDye] but not [abd'ye] [ABD'ye] #44
- Cannot parse words : [ cevaplandırmak çeşitlendirmek ] #42
- System can parse [ankaraya] but not [ankara'ya] #40
- Add ability to add a new Dictionary Item in run-time. #37
- resource test-lexicon-nouns.txt not found #36 (elifkus)
- Garip bir tokenization and stem problemi #30
- Cannot parse the word: yiyen #25 (volkanagun)
- First unstable public release.
- Removed language identification and spelling modules. They are different applications now.