PsyEval is a comprehensive task suite designed to evaluate the performance of language models in the domain of mental health. This repository contains the necessary resources and documentation for understanding and replicating our experiments. For more details, please refer to our paper "PsyEval: A Suite of Mental Health Related Tasks for Evaluating Large Language Models".
The datasets used in PsyEval include both external and internally constructed data. Below is a detailed description:
- MedQA: A medical question answering dataset available at MedQA GitHub repository.
- SMHD: The Self-reported Mental Health Diagnoses dataset available at Georgetown University IR Lab.
- D4: A dataset for disease detection, diagnosis, and description available at D4 website.
- PsyQA: A dataset for psychological question answering available at PsyQA GitHub repository.
Please review the specific usage policies of each dataset as specified in their respective repositories.
-
USMLE-mental: To construct the USMLE-mental dataset, we extracted relevant questions from MedQA related to USMLE and identified a list of keywords specific to the mental health domain. Questions were extracted using keyword matching, followed by a manual review to ensure their strong relevance to mental health, resulting in 727 labeled data points focusing on mental health knowledge.
-
Crisis Response QA: This dataset includes specific questions related to crisis response, expanding its coverage to address mental health crises. The dataset comprises 153 questions curated from authoritative sources, such as the "Responding to Mental Health Crisis" manual and the "Navigating a Mental Health Crisis" manual. Key text was extracted from these materials and transformed into question-answer pairs, followed by GPT-4 generating three fake answers for each question. Medical students reviewed the generated answers to ensure the quality of the dataset.
Please review our data usage policy before using any datasets.
We conducted a series of experiments to evaluate various language models on mental health tasks. Detailed instructions for replicating these experiments can be found in the Experiments directory.
The evaluation results for the twelve models in PsyEval are as follows:
For detailed results and analysis, please refer to the paper ("PsyEval: A Suite of Mental Health Related Tasks for Evaluating Large Language Models").
If you use any source codes or datasets included in this repository in your work, please cite the corresponding papers. The BibTeX entry is listed below:
@article{jin2023psyeval,
title={PsyEval: A Suite of Mental Health Related Tasks for Evaluating Large Language Models},
author={Haoan Jin, Siyuan Chen, Dilawaier Dilixiati, Yewei Jiang, Mengyue Wu, Kenny Q. Zhu},
journal={arXiv preprint arXiv:2311.09189},
year={2023}
}