Arcee AI migrates SLM workloads from AWS EKS to Together Dedicated Endpoints
AI Impact Summary
Arcee AI transitioned seven specialized small language models from AWS EKS to Together Dedicated Endpoints, leveraging serverless inference and a private Hugging Face repo to simplify deployment. The change removes in-house Kubernetes and GPU-ops burden, enabling faster scaling and more cost-efficient runtimes. Early results show latency improvements up to 95% (485ms down to 29ms) and throughput of 41+ QPS at 32 concurrent requests, enhancing responsiveness across coding, general text, and tool-calling tasks. This migration also tightens integration with Conductor and Orchestra, enabling more flexible routing and agentic workflows while preserving the option to fall back to third-party models like GPT-4.1, Claude 3.7 Sonnet, or DeepSeek-R1 when necessary.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info