Together AIGoogle Gemini / Vertex AIOpenSearch (AWS)AWS BedrockMeta (Llama - hosted)ReplicatePinecone

Together AI Dominates with GPU Clusters and Fine-Tuning Expansion: Week of 8 September

8 Sept 2025 – 15 Sept 20255 min read

Together AI Dominates with GPU Clusters and Fine-Tuning Expansion: Week of 8 September

Together AI has emerged as this week's infrastructure powerhouse, launching instant GPU clusters and dramatically expanding its fine-tuning capabilities across 15 new large language models. Meanwhile, Google is forcing a breaking migration for Vertex AI Agent Engine users, highlighting the ongoing evolution of enterprise AI platforms.

The Big Moves

Together AI Reshapes GPU Access with Instant Clusters

Together AI's launch of Instant Clusters represents a fundamental shift in how AI teams access compute resources. The self-service GPU clusters, powered by NVIDIA H100 and B200 GPUs, are ready in minutes rather than the weeks typically required for traditional procurement cycles. This addresses one of the most persistent pain points in AI development: the gap between having a model ready to train and actually getting the infrastructure to do it.

The technical implementation is particularly noteworthy. Together AI supports both Kubernetes and Slurm orchestration, with full NVLink and NVLink Switch connectivity for optimal multi-node performance. For teams running large-scale training jobs or inference workloads that require consistent, high-performance compute, this eliminates the operational overhead of cluster management whilst providing enterprise-grade reliability.

The competitive implications are significant. AWS, Google Cloud, and Azure all offer GPU compute, but typically through more complex provisioning processes. Together AI's API-first approach to cluster deployment aligns with modern cloud expectations and could accelerate adoption among AI-native companies that need to move fast without infrastructure bottlenecks.

Together AI Expands Fine-Tuning to 15 New Models

The expansion of Together AI's fine-tuning platform to include 15 new models from OpenAI, DeepSeek, Qwen, and Meta Llama creates new possibilities for developers working with complex datasets. The increased context lengths, reaching up to 131K tokens, unlock applications that were previously impractical due to memory constraints.

This capability expansion is particularly relevant for enterprise use cases involving long-document processing, comprehensive code analysis, and multi-turn conversational agents. The integration with Hugging Face Hub streamlines the model experimentation workflow, allowing teams to iterate faster between training, evaluation, and deployment phases.

For organisations currently using OpenAI's fine-tuning services, this presents an alternative that may offer better cost control and customisation options. The ability to fine-tune larger models (100B+ parameters) on Together AI's infrastructure could be particularly attractive for teams that have hit the limitations of smaller fine-tuned models.

Google Forces Vertex AI Agent Engine Migration

Google's release of a major Vertex AI Agent Engine update introduces code execution, A2A protocol, streaming, and memory management, but requires migration to Python SDK v1.112.0. This is a breaking change that will impact existing implementations, requiring proactive planning and code modifications.

The new capabilities are substantial: code execution enables agents to run Python code dynamically, whilst the A2A protocol and streaming support enable more responsive agent interactions. Memory management improvements should reduce latency for complex agent workflows. However, the breaking nature of this migration means teams must allocate development time to update their implementations.

Organisations using Vertex AI Agent Engine should prioritise migration planning now. Failure to update to the new SDK will result in incompatibility with the enhanced platform, effectively cutting off access to both existing functionality and new features. The migration timeline isn't specified, but Google's track record suggests a relatively short deprecation window.

Worth Watching

Amazon OpenSearch Service Supports OpenSearch 3.1

Amazon OpenSearch Service now supports OpenSearch 3.1, introducing GPU acceleration for vector indexes, auto-optimise features, and semantic highlighting. The GPU acceleration is particularly relevant for organisations running large-scale vector similarity searches, as it can significantly improve query performance for AI-powered search applications. Teams using OpenSearch for retrieval-augmented generation (RAG) implementations should evaluate the performance benefits of upgrading.

Google Expands Vertex AI Model Garden

Google has added SEA-LION V4, EmbeddingGemma, and DeepSeek-V3.1 to Vertex AI Model Garden. SEA-LION V4's focus on Southeast Asian languages addresses a significant gap in regional language support, whilst EmbeddingGemma provides new options for embedding generation tasks. This expansion reflects Google's strategy of offering diverse model options rather than relying solely on Gemini variants.

Veo 3 Video Generation Goes GA

Google's Veo 3 video generation is now generally available on Vertex AI, enabling creation of 4, 6, or 8-second videos from text prompts or images. Whilst the video lengths are still limited, this represents Google's entry into the competitive video generation market currently dominated by companies like Runway and Pika Labs. Early adoption could provide competitive advantages for content creation workflows.

TwelveLabs Enhances Marengo Embed

TwelveLabs has released Marengo Embed 2.7 with InvokeModel API support, expanding integration options for video, text, audio, and image embedding generation. This multimodal approach is increasingly important as AI applications become more sophisticated and require understanding across different content types.

Quick Hits

Meta Llama introduces OpenAI Prompts API compatibility and RAG Tool updates with Files API integration
Replicate adds invoicing for prepaid credits, torch compile caching for faster model startup, and web URLs for predictions
Pinecone launches multimodal assistant support for PDF image processing and releases Terraform provider v2.0.0
AWS Bedrock expands Guardrails to Asia Pacific (Jakarta) region for improved latency
Together AI releases Qwen3-Next-80B models optimised for complex reasoning tasks

The Week Ahead

Watch for migration announcements from Google regarding Vertex AI Agent Engine sunset dates. The breaking SDK change suggests Google is moving quickly to deprecate older versions, so expect timeline clarity soon.

Together AI's infrastructure expansion may prompt competitive responses from established cloud providers. AWS re:Invent preparations could include announcements about simplified GPU provisioning to match Together AI's instant cluster approach.

OpenSearch 3.1's GPU acceleration features warrant performance testing for teams running vector-heavy workloads. Early benchmarks could influence broader adoption decisions across the search and RAG ecosystem.

The continued expansion of model gardens across providers (Google, AWS Bedrock, Azure) suggests we're moving towards a more diverse, multi-provider AI landscape rather than consolidation around a few dominant models.