Cosmopedia: Synthetic Data Generation for Phi-1.5 Pre-training
AI Impact Summary
The Cosmopedia project aims to replicate the training data used for Microsoft's Phi-1.5 model, leveraging synthetic data generated by Mixtral-8x7B-Instruct-v0.1. This effort highlights the challenges and techniques involved in creating large-scale synthetic datasets for pre-training LLMs, particularly focusing on prompt curation and diversity to avoid duplicate content. The project’s use of Mixtral-8x7B-Instruct-v0.1 and the generation of 25 billion tokens demonstrates a significant investment in synthetic data creation, offering a valuable resource for the community.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info