Skip to content

Pre-processing text and tokenization for UTH-BERT

Notifications You must be signed in to change notification settings

jinseikenai/uth-bert

Repository files navigation

Pre-processing text and tokenization for UTH-BERT

This site provides source code for pre-processing text and tokenization for use of UTH-BERT.

  1. BERT: Bidirectional Encoder Representations from Transformers.
    https://github.com/google-research/bert

  2. UTH-BERT
    https://ai-health.m.u-tokyo.ac.jp/uth-bert

  3. Pre-print (medRxiv)
    A clinical specific BERT developed with huge size of Japanese clinical narrative
    https://doi.org/10.1101/2020.07.07.20148585

1. Quick setup

1-1. Install Mecab (Japanese morphological analyzer) on Ubuntu

sudo apt install mecab
sudo apt install libmecab-dev
sudo apt install mecab-ipadic-utf8

1-2. Install mecab-ipadic-neologd (general dictionary for Mecab)

git clone https://github.com/neologd/mecab-ipadic-neologd.git
cd mecab-ipadic-neologd
sudo bin/install-mecab-ipadic-neologd -n -a

Edit /etc/mecabrc

dicdir = /usr/lib/mecab/dic/mecab-ipadic-neologd

1-3. Download J-Medic (medical dictionary for Mecab)

You can download MANBYO_201907_Dic-utf8.dic from below URL.
http://sociocom.jp/~data/2018-manbyo/index.html

2. Pre-processing text

Japanese text includes two-byte full-width characters (mainly Kanji, Hiragana, or Katakana) and one-byte half-width characters (mainly ASCII characters). We applied the Normalization Form Compatibility Composition (NFKC) followed by full-width characterization to all characters as a pre-processing.

See preprocess_text.py for details

3. Tokenization

In non-segmented languages such as Japanese or Chinese, a tokenizer must accurately identify every word in a sentence before attempt to parse it and to do that requires a method of finding word boundaries without the aid of word delimiters. MecabTokenizer and FullTokenizerForMecab that segment a word unit into several pieces of tokens included in BERT vocabulary.

See tokenization_mod.py for details

4. Example

Original text

2002 年夏より重い物の持ち上げが困難になり,階段の昇りが遅くなるなど四肢の筋力低下が緩徐に進行した.2005 年 2 月頃より鼻声となりろれつが回りにくくなった.また,食事中にむせるようになり,同年 12 月に当院に精査入院した。

(English) Since the summer of 2002, there has been difficulty in lifting heavy objects and muscle weakness in the extremities, such as slow climbing of stairs. In February 2005, the patient's voice became nasal, and he had difficulty in turning his tongue. In December of the same year, he was admitted to our hospital for a thorough examination after becoming lethargic while eating.

After pre-processing

2002年夏より重い物の持ち上げが困難になり、階段の昇りが遅くなるなど四肢の筋力低下が緩徐に進行した.2005年2月頃より鼻声となりろれつが回りにくくなった.また、食事中にむせるようになり、同年12月に当院に精査入院した。

After tokenization

['2002年', '夏', 'より', '重い', '物', 'の', '持ち上げ', 'が', '困難', 'に', 'なり', '、', '階段', 'の', '[UNK]', 'が', '遅く', 'なる', 'など', '四肢', 'の', '筋力低下', 'が', '緩徐', 'に', '進行', 'し', 'た', '.', '2005年', '2', '月頃', 'より', '鼻', '##声', 'と', 'なり', 'ろ', '##れ', '##つ', 'が', '回り', '##にく', '##く', 'なっ', 'た', '.', 'また', '、', '食事', '中', 'に', 'むせる', 'よう', 'に', 'なり', '、', '同年', '12月', 'に', '当', '院', 'に', '精査', '入院', 'し', 'た', '。']

See example_main.py for details

About

Pre-processing text and tokenization for UTH-BERT

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages