InfoCapability

Cosmopedia: Synthetic Data Generation for Phi-1.5 Pre-training

AI Impact Summary

The Cosmopedia project aims to replicate the training data used for Microsoft's Phi-1.5 model, leveraging synthetic data generated by Mixtral-8x7B-Instruct-v0.1. This effort highlights the challenges and techniques involved in creating large-scale synthetic datasets for pre-training LLMs, particularly focusing on prompt curation and diversity to avoid duplicate content. The project’s use of Mixtral-8x7B-Instruct-v0.1 and the generation of 25 billion tokens demonstrates a significant investment in synthetic data creation, offering a valuable resource for the community.

Affected Systems

Mixtral-8x7B-Instruct-v0.1GPT-3.5

Date: Date not specified
Change type: capability
Severity: info

Cosmopedia: Synthetic Data Generation for Phi-1.5 Pre-training

More from Hugging Face

Get alerts for Hugging Face