Skip to content
/ kotori Public

A Japanese tokenizer and morphological analysis engine written in Kotlin

License

Notifications You must be signed in to change notification settings

wanasit/kotori

Repository files navigation

Kotori

A Japanese tokenizer and morphological analysis engine written in Kotlin

Usage

import com.github.wanasit.kotori.Tokenizer

fun main(args: Array<String>) {
    val tokenizer = Tokenizer.createDefaultTokenizer()
    val words = tokenizer.tokenize("お寿司が食べたい。").map { it.text }

    println(words) // [お, 寿司, が, 食べ, たい, 。]
}

Installation

Kotori packages are hosted by bintray and JCenter. You can download and install it via Gradle or Maven.

Gradle:

repositories {
    jcenter()
}

dependencies {
    ...
    implementation 'com.github.wanasit.kotori:kotori:0.0.3'
}

Maven:

<dependency>
  <groupId>com.github.wanasit.kotori</groupId>
  <artifactId>kotori</artifactId>
  <version>VERSION_NUMBER</version>
  <type>pom</type>
</dependency>

You can also install Kotori via Jitpack.

Dictionary

Kotori has a built-in dictionary, based-on mecab-ipadic-2.7.0-20070801.

val dictionary = Dictionary.readDefaultFromResource()
val tokenizer = Tokenizer.create(dictionary)

tokenizer.tokenize("お寿司が食べたい。")

However, it also works out-of-box with any Mecab dictionary. For example:

val dictionary = MeCabDictionary.readFromDirectory("~/Download/mecab-ipadic-2.7.0-20070801")
val tokenizer = Tokenizer.create(dictionary)

tokenizer.tokenize("お寿司が食べたい。")

Note: Sudachi dictionaries and plugins support are under development.

Performance

Kotori is heavily inspired by Kuromoji and Sudachi, but its tokenization is even faster than other JVM-based tokenizers (based-on our probably unfair benchmark).

The following is statistic from tokenizing Japanese sentences from Tatoeba (193,898 sentences entries, 3,561,854 total characters) on Macbook Pro 2020 (2.4 GHz 8-Core Intel Core i9).

Token Count Time (ns per document) Time (ns per token)
Kuromoji (IPADIC) 2,264,560 10,095 864
Kotori (IPADIC) 2,264,705 8,190 701
Sudachi (sudachi-dictionary-20200330-small) 2,308,873 27,352 2296
Kotori (sudachi-dictionary-20200330-small) 2,157,820 13,079 1175

(Speculative) What makes Kotori fast

  • Minimal String.substring() usage. After JDK 7, the function makes string copy and has O(n) overhead. Some tokenizers that design before the change (e.g. Kuromoji) still have a lot of substrings.

  • A customized Trie data structure. TransitionArrayTrie can be quickly built just-in-time when creating a tokenizer, but it has pretty good performance on Japanese in UTF-16.

(Speculative) What makes Kotori slow

  • Kotori doesn't rely on any pre-built data structure (e.g. DoubleArrayTrie). It reads a dictionary as list-of-terms format and builds Trie just-in-time. This is a design decision to make Kotori open to multiple dictionary formats in exchange for some bootup time.

  • Kotlin (written by the inexperience library author) is slower than Java, mostly, because Kotlin's Array<T?> has some overhead comparing to Java's native T[].

Benchmark

Benchmark can be run as a gradle task.

./gradlew benchmark
./gradlew benchmark --args='--tokenizer=kuromoji'
./gradlew benchmark --args='--tokenizer=kotori --dictionary=sudachi-small'

Check the source code in kotori-benchmark project for more details.

About

A Japanese tokenizer and morphological analysis engine written in Kotlin

Resources

License

Stars

Watchers

Forks

Packages

No packages published