This site provides source code for pre-processing text and tokenization for use of UTH-BERT.
-
BERT: Bidirectional Encoder Representations from Transformers.
https://github.com/google-research/bert -
Pre-print (medRxiv)
A clinical specific BERT developed with huge size of Japanese clinical narrative
https://doi.org/10.1101/2020.07.07.20148585
1-1. Install Mecab (Japanese morphological analyzer) on Ubuntu
sudo apt install mecab
sudo apt install libmecab-dev
sudo apt install mecab-ipadic-utf8
1-2. Install mecab-ipadic-neologd (general dictionary for Mecab)
git clone https://github.com/neologd/mecab-ipadic-neologd.git
cd mecab-ipadic-neologd
sudo bin/install-mecab-ipadic-neologd -n -a
dicdir = /usr/lib/mecab/dic/mecab-ipadic-neologd
1-3. Download J-Medic (medical dictionary for Mecab)
You can download MANBYO_201907_Dic-utf8.dic
from below URL.
http://sociocom.jp/~data/2018-manbyo/index.html
Japanese text includes two-byte full-width characters (mainly Kanji, Hiragana, or Katakana) and one-byte half-width characters (mainly ASCII characters). We applied the Normalization Form Compatibility Composition (NFKC) followed by full-width characterization to all characters as a pre-processing.
See preprocess_text.py for details
In non-segmented languages such as Japanese or Chinese, a tokenizer must accurately identify every word in a sentence before attempt to parse it and to do that requires a method of finding word boundaries without the aid of word delimiters. MecabTokenizer and FullTokenizerForMecab that segment a word unit into several pieces of tokens included in BERT vocabulary.
See tokenization_mod.py for details
2002 年夏より重い物の持ち上げが困難になり,階段の昇りが遅くなるなど四肢の筋力低下が緩徐に進行した.2005 年 2 月頃より鼻声となりろれつが回りにくくなった.また,食事中にむせるようになり,同年 12 月に当院に精査入院した。
(English) Since the summer of 2002, there has been difficulty in lifting heavy objects and muscle weakness in the extremities, such as slow climbing of stairs. In February 2005, the patient's voice became nasal, and he had difficulty in turning his tongue. In December of the same year, he was admitted to our hospital for a thorough examination after becoming lethargic while eating.
2002年夏より重い物の持ち上げが困難になり、階段の昇りが遅くなるなど四肢の筋力低下が緩徐に進行した.2005年2月頃より鼻声となりろれつが回りにくくなった.また、食事中にむせるようになり、同年12月に当院に精査入院した。
['2002年', '夏', 'より', '重い', '物', 'の', '持ち上げ', 'が', '困難', 'に', 'なり', '、', '階段', 'の', '[UNK]', 'が', '遅く', 'なる', 'など', '四肢', 'の', '筋力低下', 'が', '緩徐', 'に', '進行', 'し', 'た', '.', '2005年', '2', '月頃', 'より', '鼻', '##声', 'と', 'なり', 'ろ', '##れ', '##つ', 'が', '回り', '##にく', '##く', 'なっ', 'た', '.', 'また', '、', '食事', '中', 'に', 'むせる', 'よう', 'に', 'なり', '、', '同年', '12月', 'に', '当', '院', 'に', '精査', '入院', 'し', 'た', '。']
See example_main.py for details