A Japanese tokenizer and morphological analysis engine written in Kotlin
import com.github.wanasit.kotori.Tokenizer
fun main(args: Array<String>) {
val tokenizer = Tokenizer.createDefaultTokenizer()
val words = tokenizer.tokenize("お寿司が食べたい。").map { it.text }
println(words) // [お, 寿司, が, 食べ, たい, 。]
}
Kotori packages are hosted by bintray and JCenter. You can download and install it via Gradle or Maven.
Gradle:
repositories {
jcenter()
}
dependencies {
...
implementation 'com.github.wanasit.kotori:kotori:0.0.3'
}
Maven:
<dependency>
<groupId>com.github.wanasit.kotori</groupId>
<artifactId>kotori</artifactId>
<version>VERSION_NUMBER</version>
<type>pom</type>
</dependency>
You can also install Kotori via Jitpack.
Kotori has a built-in dictionary, based-on mecab-ipadic-2.7.0-20070801
.
val dictionary = Dictionary.readDefaultFromResource()
val tokenizer = Tokenizer.create(dictionary)
tokenizer.tokenize("お寿司が食べたい。")
However, it also works out-of-box with any Mecab dictionary. For example:
- IPADIC (2.7.0-20070801)
- UniDic (2.1.2)
- JUMANDIC (7.0-20130310)
val dictionary = MeCabDictionary.readFromDirectory("~/Download/mecab-ipadic-2.7.0-20070801")
val tokenizer = Tokenizer.create(dictionary)
tokenizer.tokenize("お寿司が食べたい。")
Note: Sudachi dictionaries and plugins support are under development.
Kotori is heavily inspired by Kuromoji and Sudachi, but its tokenization is even faster than other JVM-based tokenizers (based-on our probably unfair benchmark).
The following is statistic from tokenizing Japanese sentences from Tatoeba (193,898 sentences entries, 3,561,854 total characters) on Macbook Pro 2020 (2.4 GHz 8-Core Intel Core i9).
Token Count | Time (ns per document) | Time (ns per token) | |
---|---|---|---|
Kuromoji (IPADIC) | 2,264,560 | 10,095 | 864 |
Kotori (IPADIC) | 2,264,705 | 8,190 | 701 |
Sudachi (sudachi-dictionary-20200330-small) | 2,308,873 | 27,352 | 2296 |
Kotori (sudachi-dictionary-20200330-small) | 2,157,820 | 13,079 | 1175 |
-
Minimal String.substring() usage. After JDK 7, the function makes string copy and has O(n) overhead. Some tokenizers that design before the change (e.g. Kuromoji) still have a lot of substrings.
-
A customized Trie data structure.
TransitionArrayTrie
can be quickly built just-in-time when creating a tokenizer, but it has pretty good performance on Japanese in UTF-16.
-
Kotori doesn't rely on any pre-built data structure (e.g.
DoubleArrayTrie
). It reads a dictionary as list-of-terms format and builds Trie just-in-time. This is a design decision to make Kotori open to multiple dictionary formats in exchange for some bootup time. -
Kotlin (written by the inexperience library author) is slower than Java, mostly, because Kotlin's
Array<T?>
has some overhead comparing to Java's nativeT[]
.
Benchmark can be run as a gradle task.
./gradlew benchmark
./gradlew benchmark --args='--tokenizer=kuromoji'
./gradlew benchmark --args='--tokenizer=kotori --dictionary=sudachi-small'
Check the source code
in kotori-benchmark
project for more details.