Is MultiWOZ a Solved Task? An Interactive TOD Evaluation Framework with User Simulator

Accepted by the findings of EMNLP2022.

Data Preprocess

Our experiments mainly focus on MultiWOZ 2.0. You should unzip the data.zip at first:

cd data
unzip data.zip

and then run this script:

python preprocess.py -version 2.0

Supervised Learning

Seq-to-seq training based on t5 models (simplified version of MTTOD).

User Simulator Supvervised Training

python main.py -version 2.0 -agent_type us -run_type train -ururu -backbone t5-small -model_dir simulator_t5_small -epoch 20

Dialogue System Supervised Training

python main.py -version 2.0 -agent_type ds -run_type train -ururu -backbone t5-small -model_dir dialogue_t5_small -epoch 20

Interaction

Conduct interactions between a user simulator and a dialogue system (either SL-based models or RL-based models). Generate dialogue sessions based on user goals from test or dev set. This script can be used for different dialogue models(mttod, ubar, pptod).

python interact.py -simulator_path ./your_simulator_model_dir/checkpoint -dialog_sys_path ./your_dialogue_model_dir/your_checkpoint -model_name mttod -generate_results_path output.json

Reinforcement Learning

Using success rates as rewards

python interact.py -do_rl_training -seed 1998 -simulator_save_path simulator_rl -dialog_save_path dialog_rl

Using success rates and sentence-score as rewards

python interact.py -do_rl_training -seed 1998 -simulator_save_path simulator_rl -dialog_save_path dialog_rl -use_gpt_score_as_reward -gpt_score_coef 0.1

Using success rates and sessions-score as rewards

python interact.py -do_rl_training -seed 1998 -simulator_save_path simulator_rl -dialog_save_path dialog_rl -use_nsp_score_as_reward -nsp_coef 0.1

Score Training

Sentence Score Training

python train_lm.py -model_dir sentence_score_model -task ppl -ppl_level bart_score -backbone gpt2

Sentence Score Evaluation

python train_lm.py -ckpt ./sentence_score_model/checkpoint -run_type predict -task ppl -ppl_level bart_score

Session Score Training

python train_lm.py -model_dir session_score_model -task nsp -backbone bert-base-uncased

Session Score Evaluation

python train_lm.py -ckpt ./session_score_model/checkpoint -run_type predict -task nsp

Evaluation

Traditional Evaluation

Computing Inform, Success and BLEU Score.

python main.py -run_type predict -predict_agent_type ds -ckpt ./dialogue_t5_small/checkpoint -output inference.json -batch_size 16

Computing BLEU Score(For evaluation of simulators).

python main.py -run_type predict -predict_agent_type us -ckpt ./simulator_t5_small/checkpoint -output inference.json -batch_size 16

Interactive Evaluation

First generate dialogue by interactions between a user simulator and a dialogue system. Then computing Inform, Success, Sentence-Score and Session-Score.

python compute_all_scores.py -output_result_path output.json -config_dir dialogue_t5_small -eval_type online -lm_ckpt ./your_sentence_score_model/checkpoint -nsp_ckpt ./your_session_score_model/checkpoint

If the result is generated by traditional evaluation, you should convert its format to online format at first using:

python convert_offline_to_online_format.py -offline_path traditional_results.json -online_path traditional_results_online_format.json

Different Models

UBAR Training

python ubar.py --backbone distilgpt2 --run_type train --model_dir ubar_model

UBAR Evaluation

python ubar.py --ckpt ./ubar_model/checkpoint --run_type predict --pred_data_type test

PPTOD Training

python pptod.py --backbone pptod_small --run_type train --model_dir pptod_model

PPTOD Evaluation

python pptod.py --ckpt ./pptod_model/checkpoint --run_type predict --pred_data_type test

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
data		data
galaxy		galaxy
utils		utils
.gitignore		.gitignore
README.md		README.md
add_pointer.py		add_pointer.py
compute_all_scores.py		compute_all_scores.py
config.py		config.py
convert_offline_to_online_format.py		convert_offline_to_online_format.py
evaluator.py		evaluator.py
external_knowledges.py		external_knowledges.py
galaxy_finetune.py		galaxy_finetune.py
generate_galaxy_data.py		generate_galaxy_data.py
interact.py		interact.py
lm_dataset.py		lm_dataset.py
main.py		main.py
pptod.py		pptod.py
preprocess.py		preprocess.py
reader.py		reader.py
runner.py		runner.py
temp.py		temp.py
tokenize_generate_text.py		tokenize_generate_text.py
train_lm.py		train_lm.py
ubar.py		ubar.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Is MultiWOZ a Solved Task? An Interactive TOD Evaluation Framework with User Simulator

Data Preprocess

Supervised Learning

User Simulator Supvervised Training

Dialogue System Supervised Training

Interaction

Reinforcement Learning

Using success rates as rewards

Using success rates and sentence-score as rewards

Using success rates and sessions-score as rewards

Score Training

Sentence Score Training

Sentence Score Evaluation

Session Score Training

Session Score Evaluation

Evaluation

Traditional Evaluation

Interactive Evaluation

Different Models

UBAR Training

UBAR Evaluation

PPTOD Training

PPTOD Evaluation

About

Releases

Packages

Languages

xiami2019/User-Simulator

Folders and files

Latest commit

History

Repository files navigation

Is MultiWOZ a Solved Task? An Interactive TOD Evaluation Framework with User Simulator

Data Preprocess

Supervised Learning

User Simulator Supvervised Training

Dialogue System Supervised Training

Interaction

Reinforcement Learning

Using success rates as rewards

Using success rates and sentence-score as rewards

Using success rates and sessions-score as rewards

Score Training

Sentence Score Training

Sentence Score Evaluation

Session Score Training

Session Score Evaluation

Evaluation

Traditional Evaluation

Interactive Evaluation

Different Models

UBAR Training

UBAR Evaluation

PPTOD Training

PPTOD Evaluation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages