StarCoder open-code LLMs: StarCoder and StarCoderBase with 8,000+ token context and OpenRAIL license updates
AI Impact Summary
StarCoder and StarCoderBase are open-code LLMs trained on permissively licensed data (including The Stack and GitHub content) and now offer an 8,000+ token context window, enabling longer, more complex code interactions. StarCoderBase was fine-tuned for Python tokens, achieving competitive or superior performance against both open models and some closed models on benchmarks like HumanEval and MultiPL-E, which raises the bar for enterprise code-assist capabilities. The release also introduces an improved PII redaction pipeline, a code-attribution tool, and a more permissive OpenRAIL license to streamline integration into commercial products, creating new opportunities for IDEs, CI tooling, and code-generation features while imposing license and safety considerations.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info