Add fine-tuning

yahoojapan · Jun 1, 2022 · 70d93e4 · 70d93e4
1 parent 3518e15
commit 70d93e4
Show file tree

Hide file tree

Showing 4 changed files with 571 additions and 5 deletions.
diff --git a/README.md b/README.md
@@ -195,8 +195,9 @@ When you use NICT BERT base or Waseda RoBERTa base models, the dataset text shou
 Please refer to [preprocess/morphological-analysis/README.md](/preprocess/morphological-analysis/README.md).
 
 The fine-tuning was performed using [the transformers
-library](https://github.com/huggingface/transformers) provided by Hugging Face. The performance along
-with human scores on the JGLUE dev set is shown below. 
+library](https://github.com/huggingface/transformers) provided by Hugging Face. See [fine-tuning/README.md](/fine-tuning/README.md) for details.
+
+The performance along with human scores on the JGLUE dev set is shown below. 
 
 |Model|MARC-ja|JSTS|JNLI|JSQuAD|JCommonsenseQA|
 |-----|-------|-------|-------|-------|-------|
@@ -209,10 +210,10 @@ with human scores on the JGLUE dev set is shown below.
 |NICT BERT base|0.958|0.903/0.867|0.902|**0.897**/**0.947**|0.823| 
 |Waseda RoBERTa base|0.962|0.901/0.865|0.895|0.864/0.927|0.840|
 |Waseda RoBERTa large|0.954|**0.923**/**0.891**|**0.924**|0.884/0.940|**0.901**|
-|XLM RoBERTa base|0.961|0.870/0.825|0.893|-/-|0.687|
-|XLM RoBERTa large|**0.964**|0.915/0.882|0.919|-/-|0.840|
-
+|XLM RoBERTa base|0.961|0.870/0.825|0.893|-/-&dagger;|0.687|
+|XLM RoBERTa large|**0.964**|0.915/0.882|0.919|-/-&dagger;|0.840|
 
+&dagger;XLM RoBERTa base/large models use the unigram language model as a tokenizer and they are excluded from the JSQuAD evaluation because the token delimitation and the start/end of the answer span often do not match, resulting in poor performance.
 
 ## Leaderboard
 

diff --git a/fine-tuning/README.md b/fine-tuning/README.md
@@ -0,0 +1,145 @@
+# Fine-tuning
+
+We used [the transformers library](https://github.com/huggingface/transformers) for our fine-tuning experiments, and modified the original codes to fit our datasets and experimental settings. We used v4.9.2, but other versions may work.
+
+```bash
+# This directory is for installing transformers. This directory would be better outside the JGLUE repository.)
+$ cd /somewhere/  
+$ git clone https://github.com/huggingface/transformers.git -b v4.9.2 transformers-4.9.2
+$ cd transformers-4.9.2
+$ patch -p1 < /somewhere2/JGLUE/fine-tuning/patch/transformers-4.9.2_jglue-1.0.0.patch
+$ pip install .
+$ pip install -r examples/pytorch/text-classification/requirements.txt
+$ pip install protobuf==3.19.1 tensorboard
+# For the cl-tohoku/bert-base-japanese-v2 model
+$ pip install fugashi unidic-lite
+```
+
+## Hyperparameters
+
+The following table lists hyperparameters used in our experiments. The numbers in curly brackets represent the range of possible values. The best hyperparameters were searched using the dev set.
+
+|Name|Value(s)|
+|----|-------|
+|learning rate|{5e-5, 3e-5, 2e-5}|
+|epoch|{3, 4}|
+|warmup ratio|0.1|
+|max seq length|512 (MARC-ja), 128 (JSTS, JNLI), 384 (JSQuAD), 64 (JCommonsenseQA)|
+
+## Text classification and sentence pair classification tasks
+
+This section is for `MARC-ja`, `JSTS` and `JNLI`. When you fine-tune the `cl-tohoku/bert-base-japanese-v2` model for the `MARC-ja` dataset, run the following command:
+
+```bash
+$ export OUTPUT_DIR=/path/to/output_marc
+$ python /somewhere/transformers-4.9.2/examples/pytorch/text-classification/run_glue.py \
+     --model_name_or_path cl-tohoku/bert-base-japanese-v2 \
+     --metric_name sst2 \
+     --do_train --do_eval --do_predict \
+     --max_seq_length 512 \
+     --per_device_train_batch_size 32 \
+     --learning_rate 5e-05 \
+     --num_train_epochs 4 \
+     --output_dir $OUTPUT_DIR \
+     --train_file ../datasets/marc_ja-v1.0/train-v1.0.json \
+     --validation_file ../datasets/marc_ja-v1.0/valid-v1.0.json \
+     --test_file ../datasets/marc_ja-v1.0/valid-v1.0.json \
+     --use_fast_tokenizer False \
+     --evaluation_strategy epoch \
+     --save_steps 5000 \
+     --warmup_ratio 0.1
+```
+
+`--metric_name` option should be set according to a dataset as follows:
+- MARC-ja: sst2
+- JSTS: stsb
+- JNLI: wnli
+
+When you fine-tune the NICT BERT base or Waseda RoBERTa base/large models, please specify the word-segmented files with `--train_file`, `--validation_file` and `--test_file` options (see [preprocess/morphological-analysis/README.md](/preprocess/morphological-analysis/README.md) for how to perform word segmentation). For example, you use Waseda RoBERTa base/large models, these options are as follows:
+
+```bash
+     ..
+     --train_file ../datasets/marc_ja-v1.0_jumanpp/train-v1.0.json \
+     --validation_file ../datasets/marc_ja-v1.0_jumanpp/valid-v1.0.json \
+     -test_file ../datasets/marc_ja-v1.0_jumanpp/valid-v1.0.json \
+     ..  
+```
+
+The system prediction for the validation set is output to `$OUTPUT_DIR/predict_results_${metric_name}.txt`.
+For the examination of the system prediction, the system prediction as well as the evaluation result can be output as follows:
+
+- MARC-ja
+```bash
+$ python scripts/generate_results.py \
+     --system-predict-txt $OUTPUT_DIR/predict_results_sst2.txt \
+     --input-file ../datasets/marc_ja-v1.0/valid-v1.0.json \
+     --task-type single-sentence \
+     --additional-column-name-string review_id > $OUTPUT_DIR/predict_eval_results.tsv
+```
+- JSTS
+```bash
+$ python scripts/generate_results.py \
+     --system-predict-txt $OUTPUT_DIR/predict_results_stsb.txt \
+     --input-file ../datasets/jsts-v1.0/valid-v1.0.json \
+     --task-type sentence-pair \
+     --classification-type regression \
+     --additional-column-name-string sentence_pair_id,yjcaptions_id > $OUTPUT_DIR/predict_eval_results.tsv
+```
+
+- JNLI
+```bash 
+$ python scripts/generate_results.py \
+     --system-predict-txt $OUTPUT_DIR/predict_results_wnli.txt \
+     --input-file ../datasets/jnli-v1.0/valid-v1.0.json \
+     --task-type sentence-pair \
+     --additional-column-name-string sentence_pair_id,yjcaptions_id > $OUTPUT_DIR/predict_eval_results.tsv
+``` 
+## QA: JSQuAD
+
+```
+$ export OUTPUT_DIR=/path/to/output_jsquad
+$ python /somewhere/transformers-4.9.2/examples/legacy/question-answering/run_squad.py \
+     --model_type bert \
+     --model_name_or_path cl-tohoku/bert-base-japanese-v2 \
+     --do_train --do_eval \
+     --max_seq_length 384 \
+     --learning_rate 5e-05 \
+     --num_train_epochs 3 \
+     --per_gpu_train_batch_size 32 \
+     --per_gpu_eval_batch_size 32 \
+     --output_dir $OUTPUT_DIR \
+     --train_file ../datasets/jsquad-v1.0/train-v1.0.json \
+     --predict_file ../datasets/jsquad-v1.0/valid-v1.0.json \     
+     --save_steps 5000 \
+     --warmup_ratio 0.1 \
+     --evaluate_prefix eval
+```
+
+## QA: JCommonsenseQA
+
+```bash
+$ export OUTPUT_DIR=/path/to/output_jcommonsenseqa
+$ python /somewhere/transformers-4.9.2/examples/pytorch/multiple-choice/run_swag.py \
+     --model_name_or_path cl-tohoku/bert-base-japanese-v2 \
+     --do_train --do_eval --do_predict \
+     --max_seq_length 64 \
+     --per_device_train_batch_size 64 \
+     --learning_rate 5e-05 \
+     --num_train_epochs 4 \
+     --output_dir $OUTPUT_DIR \
+     --train_file ../datasets/jcommonsenseqa-v1.0/train-v1.0.json \
+     --validation_file ../datasets/jcommonsenseqa-v1.0/valid-v1.0.json \
+     --test_file ../datasets/jcommonsenseqa-v1.0/valid-v1.0.json \
+     --use_fast_tokenizer False \
+     --evaluation_strategy epoch \
+     --warmup_ratio 0.1
+```
+
+For the examination of the system prediction, the system prediction as well as the evaluation result can be output as follows:
+```bash
+$ python scripts/generate_results.py \
+     --system-predict-txt $OUTPUT_DIR/predict_results_valid.txt \
+     --input-file ../datasets/jcommonsenseqa-v1.0/valid-v1.0.json \
+     --task-type swag \
+     --additional-column-name-string q_id > $OUTPUT_DIR/predict_eval_results.tsv
+```