MediumCapability

Our math word problem solver achieves 55% accuracy vs 60% human baseline

AI Impact Summary

The system delivers 55% accuracy on grade-school math word problems, about half a human baseline of 60% on the same dataset, indicating meaningful improvements over a GPT-3 fine-tuned model but still notable gaps in real-world problem types. This suggests potential utility as an assistive tutor or automated practice tool, but reliability across diverse word problems, steps, and reasoning patterns remains uncertain. To scale to production, plan broad data collection across problem types, error-mode analysis, and targeted fine-tuning or model ensembling to close gaps. Also consider guardrails for pedagogy, fairness, and preventing overreliance by students.

Affected Systems

Our math word problem solverGPT-3 fine-tuned model

Date: Date not specified
Change type: capability
Severity: medium

Our math word problem solver achieves 55% accuracy vs 60% human baseline

More from OpenAI

Get alerts for OpenAI