Synthetic Data Generator: No-code dataset creation with LLMs via Hugging Face and Argilla
AI Impact Summary
Introducing the Synthetic Data Generator provides a no-code workflow to produce synthetic text classification and chat datasets, driven by LLM prompts and a distilabel-based pipeline. It orchestrates from dataset description to generated data, then to model training via AutoTrain, exporting artifacts to Argilla and the Hugging Face Hub, enabling rapid prototyping for ML features. The workflow leverages Hugging Face text-generation API and supports swapping in alternative models (e.g., meta-llama/Llama-3.1-8B-Instruct or OpenAI gpt-4o) and providers, with explicit throughput targets (50 samples/min for text classification, 20/min for chat). This creates a repeatable, auditable data-generation-and-training loop but imposes dependency on external services and may incur data-transfer costs and governance considerations.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info