Accepted by the findings of EMNLP2022.
Our experiments mainly focus on MultiWOZ 2.0. You should unzip the data.zip at first:
cd data
unzip data.zip
and then run this script:
python preprocess.py -version 2.0
Seq-to-seq training based on t5 models (simplified version of MTTOD).
python main.py -version 2.0 -agent_type us -run_type train -ururu -backbone t5-small -model_dir simulator_t5_small -epoch 20
python main.py -version 2.0 -agent_type ds -run_type train -ururu -backbone t5-small -model_dir dialogue_t5_small -epoch 20
Conduct interactions between a user simulator and a dialogue system (either SL-based models or RL-based models). Generate dialogue sessions based on user goals from test or dev set. This script can be used for different dialogue models(mttod, ubar, pptod).
python interact.py -simulator_path ./your_simulator_model_dir/checkpoint -dialog_sys_path ./your_dialogue_model_dir/your_checkpoint -model_name mttod -generate_results_path output.json
python interact.py -do_rl_training -seed 1998 -simulator_save_path simulator_rl -dialog_save_path dialog_rl
python interact.py -do_rl_training -seed 1998 -simulator_save_path simulator_rl -dialog_save_path dialog_rl -use_gpt_score_as_reward -gpt_score_coef 0.1
python interact.py -do_rl_training -seed 1998 -simulator_save_path simulator_rl -dialog_save_path dialog_rl -use_nsp_score_as_reward -nsp_coef 0.1
python train_lm.py -model_dir sentence_score_model -task ppl -ppl_level bart_score -backbone gpt2
python train_lm.py -ckpt ./sentence_score_model/checkpoint -run_type predict -task ppl -ppl_level bart_score
python train_lm.py -model_dir session_score_model -task nsp -backbone bert-base-uncased
python train_lm.py -ckpt ./session_score_model/checkpoint -run_type predict -task nsp
Computing Inform, Success and BLEU Score.
python main.py -run_type predict -predict_agent_type ds -ckpt ./dialogue_t5_small/checkpoint -output inference.json -batch_size 16
Computing BLEU Score(For evaluation of simulators).
python main.py -run_type predict -predict_agent_type us -ckpt ./simulator_t5_small/checkpoint -output inference.json -batch_size 16
First generate dialogue by interactions between a user simulator and a dialogue system. Then computing Inform, Success, Sentence-Score and Session-Score.
python compute_all_scores.py -output_result_path output.json -config_dir dialogue_t5_small -eval_type online -lm_ckpt ./your_sentence_score_model/checkpoint -nsp_ckpt ./your_session_score_model/checkpoint
If the result is generated by traditional evaluation, you should convert its format to online format at first using:
python convert_offline_to_online_format.py -offline_path traditional_results.json -online_path traditional_results_online_format.json
python ubar.py --backbone distilgpt2 --run_type train --model_dir ubar_model
python ubar.py --ckpt ./ubar_model/checkpoint --run_type predict --pred_data_type test
python pptod.py --backbone pptod_small --run_type train --model_dir pptod_model
python pptod.py --ckpt ./pptod_model/checkpoint --run_type predict --pred_data_type test