Foundation Models Can Label Data Like Humans — Elo Ranking Analysis
AI Impact Summary
Foundation models can label data in a way that mimics human preferences, as demonstrated through a blind test comparing models like Vicuna, Koala, and OpenAssistant against GPT-4. The use of a Likert scale and Elo ranking provides a quantifiable measure of model performance based on human judgments of helpfulness and truthfulness, revealing nuanced differences in model capabilities. This research highlights the potential for LLMs to be used as efficient, albeit imperfect, tools for data labeling and evaluation.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info