Dippy AI scales to 4M+ tokens/min with Together Dedicated Endpoints and HGX H100 GPUs
AI Impact Summary
Dippy AI migrated its inference workload to Together Dedicated Endpoints, deploying custom models on optimized GPU infrastructure (NVIDIA HGX H100) to reach 4M+ tokens per minute. This offloads infra management from Dippy's team and provides predictable, auto-scaling capacity for highly cyclical traffic. Production KPIs show median Time to First Token of 0.4 seconds, 99th-percentile throughput up to 4.1M tokens/min, and average latency around 3.44 seconds, enabling reliable user interactions during peak periods. The arrangement also positions Dippy for future features such as voice conversations, while delivering a lower cost-per-token through Together's optimizations.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info