G2R: Distilling the Knowledge of Large-Scale Generative Models into Retrieval Models for Efficient Open-domain Conversation
This is a codebase for the EMNLP 2021 (Findings) Paper, "Distilling the Knowledge of Large-scale Generative Models into Retrieval Models for Efficient Open-domain Conversation".
We provide a link for downloading the dataset used in the paper: the augmented dialogue dataset generated by the data-level G2R, along with model scores generated by the model-level G2R.
- Extract the dataset zipfile into
datasets/
folder. - Activate virtualenv/conda python environment and install requirements
# Assuming that the Blended SKill Talk dataset is already build in ParlAI
PARLAI_DIR=/your/parlai/library/dir
mkdir -p ${PARLAI_DIR}/data/bst_distill
ln -s ${PARLAI_DIR}/data/blended_skill_talk/valid.txt ${PARLAI_DIR}/data/bst_distill/valid.txt
ln -s ${PARLAI_DIR}/data/blended_skill_talk/test.txt ${PARLAI_DIR}/data/bst_distill/test.txt
python3 score_result_to_parlai.py \
--input-path ./datasets/emnlp_2021_g2r_dataset/bst_data_level_g2r_dialogue.jsonl \
--output-parlai-path ${PARLAI_DIR}/data/bst_distill/train-g2r-ll.txt \
--score-name ll
python3 score_result_to_parlai.py \
--input-path ./datasets/emnlp_2021_g2r_dataset/bst_data_level_g2r_dialogue.jsonl \
--output-parlai-path ${PARLAI_DIR}/data/bst_distill/train-g2r-mi.txt \
--score-name mi
- Check
scripts/training
for training the model of data-level G2R, model-level G2R (LL score / MI score). - We assume that
INIT_MODEL_PATH
contains the ParlAI model path for initializing the model. Otherwise, the training starts with the model trained from Pushshift dataset.
Check scripts/inference
for generating the response using G2R models and other baselines.
# Inference of G2R based models
./scripts/generate/generate_g2r.sh trained_biencoder_model_path
Automatic Evaluation (Dist-2, Dist-3, Length calculation) of generated results.
python3 auto_evaluation.py --result-paths /path/for/generation/result
If you find our paper or this project helps your research, please kindly consider citing our paper in your publications.
@article{kim2021distilling,
title={Distilling the Knowledge of Large-scale Generative Models into Retrieval Models for Efficient Open-domain Conversation},
author={Kim, Beomsu and Seo, Seokjun and Han, Seungju and Erdenee, Enkhbayar and Chang, Buru},
journal={arXiv preprint arXiv:2108.12582},
year={2021}
}