StarCoder2 open-code LLM family released (3B/7B/15B) trained on The Stack v2 with 16k context
AI Impact Summary
StarCoder2 introduces an open-code LLM family in 3B, 7B, and 15B sizes, trained on The Stack v2 with Grouped Query Attention and a 16k token context (with 4,096-token sliding window) using a Fill-in-the-Middle objective. The models, training data, and code are released publicly, with the 15B variant reportedly performing on par with larger models in several benchmarks, and trained by ServiceNow, Hugging Face, and NVIDIA NeMo on NVIDIA infrastructure. The Stack v2 dataset is distributed via Hugging Face Hub and builds on Software Heritage data, offering repository-context filtering and improved license detection, which has implications for provenance and licensing of generated code. This release lowers barriers to experimentation and deployment of code-generation capabilities but increases the need for governance around licensing and provenance across code generation pipelines.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info