Open Medical-LLM Leaderboard benchmarks healthcare LLMs across QA datasets
AI Impact Summary
The Open Medical-LLM Leaderboard provides a standardized benchmark across MedQA, MedMCQA, PubMedQA, and MMLU medical subsets, using accuracy as the primary metric to compare healthcare LLMs. Commercial models like GPT-4-base and Med-PaLM-2 lead on multiple datasets, while open-source options such as Starling-LM-7B and Mistral-7B variants show competitive results on select tasks, exposing domain gaps in anatomy, cardiology, and dermatology for some players. For engineering teams, the path to participate requires converting weights to safetensors, validating with HuggingFace Transformers AutoClasses, and ensuring public accessibility; leaderboard results can directly influence model selection, compliance risk posture, and deployment strategy in patient-care scenarios.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info