Skip to content

Commit

Permalink
Merge pull request #4 from instadeepai/feat/update_readme
Browse files Browse the repository at this point in the history
Feature: Update the README to include GPT-4 results
  • Loading branch information
DriesSmit authored Apr 18, 2024
2 parents 6af602b + 62171c1 commit 1386095
Show file tree
Hide file tree
Showing 30 changed files with 26 additions and 25 deletions.
11 changes: 2 additions & 9 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -149,9 +149,8 @@ dmypy.json
# cache
3.8/

# OpenAI API key
api_key.txt
api_keys_human_read_only.txt
# API keys
*api_key.txt

# Experiment results
outputs/
Expand Down Expand Up @@ -181,11 +180,5 @@ llama
# Model parameters
params

# docs
api_keys_human_read_only.txt

# VScode env
.env

# API
api_keys_human_read_only.txt
40 changes: 24 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,52 +74,60 @@ To visualize the results with Neptune:

## 📊 Benchmarks

Our benchmarks showcase DebateLLM's performance on MedQA, PubMedQA, and MMLU datasets, focusing on accuracy versus cost, time efficiency, token economy, and agent agreement impact. These visualizations illustrate the balance between accuracy and computational cost, the speed and quality of responses, linguistic efficiency, and the effects of consensus strategies in medical Q&A contexts. Each dataset highlights the varied capabilities of DebateLLM's strategies.
Our benchmarks showcase DebateLLM's performance on MedQA, PubMedQA, and MMLU datasets, focusing on accuracy versus cost, time efficiency, token economy, and agent agreement impact. For all our experiments we use GPT 3.5, unless specified otherwise. These visualizations illustrate the balance between accuracy and computational cost, the speed and quality of responses, linguistic efficiency, and the effects of consensus strategies in medical Q&A contexts. Each dataset highlights the varied capabilities of DebateLLM's strategies.

### MedQA Dataset
#### MedQA Dataset

<div>
<img src="./imgs/results/average_seconds_per_question_vs_accuracy_medqa_0.png" alt="Average Seconds per Question vs. Accuracy MedQA" width="54%"/>
<img src="./imgs/results/average_tokens_per_question_vs_accuracy_medqa_0.png" alt="Average Tokens per Question vs. Accuracy MedQA" width="39.6%"/>
<img src="./imgs/results/MedQA_Average seconds per question_scatter_plots.png" alt="Average Seconds per Question vs. Accuracy MedQA" width="46.5%"/>
<img src="./imgs/results/MedQA_Average tokens per question_scatter_plots.png" alt="Average Tokens per Question vs. Accuracy MedQA" width="46.5%"/>
</div>
<div>
<img src="./imgs/results/accuracy_vs_cost_0.png" alt="Accuracy vs. Cost MedQA" width="45%"/>
<img src="./imgs/results/total_acc_box_medqa_0.png" alt="Total Accuracy Box MedQA" width="48.7%"/>
<img src="./imgs/results/MedQA_Total cost_scatter_plots.png" alt="Accuracy vs. Cost MedQA" width="51.8%"/>
<img src="./imgs/results/medqa_total_acc_box.png" alt="Total Accuracy Box MedQA" width="41.2%"/>
</div>

### PubMedQA Dataset

<div>
<img src="./imgs/results/average_seconds_per_question_vs_accuracy_pubmedqa_0.png" alt="Average Seconds per Question vs. Accuracy PubMedQA" width="54%"/>
<img src="./imgs/results/average_tokens_per_question_vs_accuracy_pubmedqa_0.png" alt="Average Tokens per Question vs. Accuracy PubMedQA" width="39.6%"/>
<img src="./imgs/results/PubMedQA_Average seconds per question_scatter_plots.png" alt="Average Seconds per Question vs. Accuracy PubMedQA" width="46.5%"/>
<img src="./imgs/results/PubMedQA_Average tokens per question_scatter_plots.png" alt="Average Tokens per Question vs. Accuracy PubMedQA" width="46.5%"/>
</div>
<div>
<img src="./imgs/results/accuracy_vs_cost_corrected_0.png" alt="Accuracy vs. Cost PubMedQA" width="45%"/>
<img src="./imgs/results/total_acc_box_pubmedqa_0.png" alt="Total Accuracy Box PubMedQA" width="48.7%"/>
<img src="./imgs/results/PubMedQA_Total cost_scatter_plots.png" alt="Accuracy vs. Cost PubMedQA" width="51.8%"/>
<img src="./imgs/results/pubmedqa_total_acc_box.png" alt="Total Accuracy Box PubMedQA" width="41.2%"/>
</div>

### MMLU Dataset

<div>
<img src="./imgs/results/average_seconds_per_question_vs_accuracy_mmlu_0.png" alt="Average Seconds per Question vs. Accuracy MMLU" width="54%"/>
<img src="./imgs/results/average_tokens_per_question_vs_accuracy_mmlu_0.png" alt="Average Tokens per Question vs. Accuracy MMLU" width="39.6%"/>
<img src="./imgs/results/MMLU_Average seconds per question_scatter_plots.png" alt="Average Seconds per Question vs. Accuracy MMLU" width="46.5%"/>
<img src="./imgs/results/MMLU_Average tokens per question_scatter_plots.png" alt="Average Tokens per Question vs. Accuracy MMLU" width="46.5%"/>
</div>
<div>
<img src="./imgs/results/total_cost_vs_accuracy_mmlu_0.png" alt="Accuracy vs. Cost MMLU" width="45%"/>
<img src="./imgs/results/total_acc_box_mmlu_0.png" alt="Total Accuracy Box MMLU" width="48.7%"/>
<img src="./imgs/results/MMLU_Total cost_scatter_plots.png" alt="Accuracy vs. Cost MMLU" width="51.8%"/>
<img src="./imgs/results/mmlu_total_acc_box.png" alt="Total Accuracy Box MMLU" width="41.2%"/>
</div>

### Agent Agreement Analysis

Modulating the agreement intensity provides a substantial improvement in performance for various models. For Multi-Persona, there is an approximate 15% improvement, and for Society of Minds (SoM), an approximate 5% improvement on the USMLE dataset. The 90% agreement intensity prompts applied to Multi-Persona demonstrate a new high score on the MedQA dataset, highlighted in the MedQA dataset cost plot as a red cross.

<div>
<img src="./imgs/results/prompt_agreement_vs_accuracy_0.png" alt="Agreement Intensity" width="47%"/>
<img src="./imgs/results/agreement_vs_accuracy_0.png" alt="Agreement vs Accuracy" width="47%"/>
<img src="./imgs/results/prompt_agreement_vs_accuracy.png" alt="Agreement Intensity" width="47%"/>
<img src="./imgs/results/agreement_vs_accuracy.png" alt="Agreement vs Accuracy" width="47%"/>
</div>

The benchmarks indicate the effectiveness of various strategies and models implemented within DebateLLM. For detailed analysis and discussion, refer to our [paper](https://arxiv.org/abs/2311.17371).

### GPT4 results

We also assessed GPT-4's capability on the MedQA dataset, applying the optimal agreement modulation value identified for Multi-Persona with GPT-3.5 on USMLE. The results suggest that these hyperparameter settings are indeed capable of transferring effectively to more advanced models. The results are shown below:
<div>
<img src="./imgs/results/medqa_gpt4.png" alt="MedQA gpt4" width="51.8%"/>
<img src="./imgs/results/medqa_gpt4_total_acc_box.png" alt="Total Accuracy Box MedQA" width="41.2%"/>
</div>
## Contributing 🤝
Please read our [contributing docs](docs/CONTRIBUTING.md) for details on how to submit pull requests, our Contributor License Agreement and community guidelines.
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added imgs/results/MMLU_Total cost_scatter_plots.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added imgs/results/MedQA_Total cost_scatter_plots.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed imgs/results/accuracy_vs_cost_0.png
Binary file not shown.
Binary file removed imgs/results/accuracy_vs_cost_corrected_0.png
Binary file not shown.
File renamed without changes
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file added imgs/results/medqa_gpt4.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added imgs/results/medqa_gpt4_total_acc_box.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added imgs/results/medqa_total_acc_box.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added imgs/results/mmlu_total_acc_box.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added imgs/results/pubmedqa_total_acc_box.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed imgs/results/total_acc_box_medqa_0.png
Binary file not shown.
Binary file removed imgs/results/total_acc_box_mmlu_0.png
Diff not rendered.
Binary file removed imgs/results/total_acc_box_pubmedqa_0.png
Diff not rendered.
Binary file removed imgs/results/total_cost_vs_accuracy_mmlu_0.png
Diff not rendered.

0 comments on commit 1386095

Please sign in to comment.