InfoCapability

Dippy AI scales to 4M+ tokens/min with Together Dedicated Endpoints and HGX H100 GPUs

AI Impact Summary

Dippy AI migrated its inference workload to Together Dedicated Endpoints, deploying custom models on optimized GPU infrastructure (NVIDIA HGX H100) to reach 4M+ tokens per minute. This offloads infra management from Dippy's team and provides predictable, auto-scaling capacity for highly cyclical traffic. Production KPIs show median Time to First Token of 0.4 seconds, 99th-percentile throughput up to 4.1M tokens/min, and average latency around 3.44 seconds, enabling reliable user interactions during peak periods. The arrangement also positions Dippy for future features such as voice conversations, while delivering a lower cost-per-token through Together's optimizations.

Affected Systems

Dippy AITogether Dedicated Endpoints

Date: Date not specified
Change type: capability
Severity: info

Dippy AI scales to 4M+ tokens/min with Together Dedicated Endpoints and HGX H100 GPUs

More from Together AI

Get alerts for Together AI