InfoCapability

Open Medical-LLM Leaderboard benchmarks healthcare LLMs across QA datasets

AI Impact Summary

The Open Medical-LLM Leaderboard provides a standardized benchmark across MedQA, MedMCQA, PubMedQA, and MMLU medical subsets, using accuracy as the primary metric to compare healthcare LLMs. Commercial models like GPT-4-base and Med-PaLM-2 lead on multiple datasets, while open-source options such as Starling-LM-7B and Mistral-7B variants show competitive results on select tasks, exposing domain gaps in anatomy, cardiology, and dermatology for some players. For engineering teams, the path to participate requires converting weights to safetensors, validating with HuggingFace Transformers AutoClasses, and ensuring public accessibility; leaderboard results can directly influence model selection, compliance risk posture, and deployment strategy in patient-care scenarios.

Affected Systems

GPT-3GPT-4

Date: Date not specified
Change type: capability
Severity: info

Open Medical-LLM Leaderboard benchmarks healthcare LLMs across QA datasets

More from Hugging Face

Get alerts for Hugging Face