Google Forces Vertex AI Migration as Gemini 2.5 Preview Endpoints Face October Sunset
Google Forces Vertex AI Migration as Gemini 2.5 Preview Endpoints Face October Sunset
Google dropped a deprecation bombshell this week, forcing developers using Gemini 2.5 preview endpoints to migrate by October 2026. The move affects both Flash and Pro model variants, marking the end of the preview phase for these widely-adopted models. With seventeen signals this week spanning critical deprecations to new audio models, the AI landscape continues its relentless pace of change.
What's changing with Vertex AI endpoints?
Google's deprecation of Gemini 2.5 preview endpoints represents the most significant breaking change this week. Effective 15 July 2025, the preview versions of Gemini 2.5 Flash and Pro models will begin their sunset journey, with final termination scheduled for 16 October 2026. This gives developers a fifteen-month migration window, which sounds generous until you consider the potential scope of affected applications.
The timing suggests Google is confident in the stability of their GA models and wants to consolidate their offering. Preview endpoints typically offer experimental features or early access to capabilities, but maintaining parallel infrastructure becomes costly as models mature. For developers, this means auditing current implementations to identify preview endpoint usage and planning migration paths to supported GA versions.
The migration isn't just a simple endpoint swap. Developers need to verify that GA models provide equivalent functionality to their current preview implementations, test performance characteristics, and potentially adjust prompt engineering. Applications with hard-coded preview endpoint references will fail completely after the sunset date, making this a critical action item for any team using Vertex AI.
How Replicate's billing changes affect new users
Replicate's introduction of prepaid credit billing for new accounts signals a broader industry shift towards consumption-based pricing models. Starting 16 July 2025, new users must purchase credits upfront rather than receiving monthly bills, fundamentally changing how developers budget for AI inference costs.
This change reflects the challenges AI providers face with unpredictable usage patterns and payment collection. Prepaid models reduce financial risk for providers whilst giving users clearer spending controls. However, it also creates a barrier to entry for developers who prefer pay-as-you-go models or those conducting initial experiments with uncertain usage patterns.
Existing monthly billing users aren't immediately affected, but Replicate's stated intention to migrate most accounts suggests this is a preview of broader changes. The company mentions users can contact support for early migration, indicating they're testing the waters before forcing existing customers to switch. Smart developers should monitor their usage patterns now to understand how prepaid billing might affect their budgeting and cash flow.
Worth Watching
Google expands Gemma 3 fine-tuning capabilities: Vertex AI now supports fine-tuning across all Gemma 3 model variants (1b, 4b, 12b, and 27b parameters) through Axolotl docker integration. This significantly expands customisation options for developers already committed to the Gemma ecosystem, potentially reducing inference costs through model optimisation for specific use cases.
OpenSearch Service integrates with S3 Vectors: Amazon's new integration promises up to 90% cost reduction for semantic search applications whilst maintaining sub-second query times. The capability leverages S3's cost-effective storage for vector data whilst preserving OpenSearch's query performance, making it particularly attractive for large-scale similarity search applications.
Groq adds JSON schema support: Structured outputs with guaranteed formatting eliminate the parsing errors that plague many AI applications. This seemingly simple addition addresses a real pain point for developers building production systems that require reliable data extraction from model responses.
Qdrant v1.15.0 introduces multilingual support: The vector database's latest release includes advanced quantisation methods (2-bit and 1.5-bit binary) alongside multilingual capabilities. Whilst not requiring immediate action, users should note deprecated parameters will be removed in the next release.
AWS Bedrock enables custom model deployment: On-demand inference for custom models eliminates the need for provisioned capacity, offering more flexible cost management for organisations with variable workloads. This particularly benefits teams with custom models that don't require consistent high-throughput inference.
Quick Hits
- Veo 3 gains 1080p upscaling: Google's video generation model now supports higher resolution output through a new resolution parameter
- Bedrock imports SageMaker Nova models: Amazon bridges its AI services by allowing SageMaker-trained models to deploy on Bedrock infrastructure
- Medical AI models arrive in Vertex: Three new multimodal healthcare models (MedGemma 27B IT, MedSigLIP, T5Gemma) expand Google's medical AI capabilities
- Mistral enters audio processing: First audio models (Voxtral Small, Mini, and Mini Transcribe) mark Mistral's expansion beyond text
- Kimi K2 Instruct launches: Both Groq and Together AI deploy this 1T parameter MoE model optimised for agentic AI and coding tasks
- Groq SDK improvements: Updated Python and TypeScript SDKs enhance OpenAI compatibility and fix message formatting issues
The Week Ahead
The October 2026 deadline for Gemini 2.5 preview endpoint migration might seem distant, but organisations should begin planning now. Large-scale migrations require thorough testing, stakeholder coordination, and potential budget adjustments for any performance differences between preview and GA models.
Replicate's prepaid billing rollout bears watching as an indicator of industry pricing trends. If successful, expect other providers to adopt similar models, particularly those serving developers with highly variable usage patterns.
The proliferation of multimodal models across providers (Google's medical AI, Mistral's audio capabilities) suggests we're entering a new phase where text-only models become the exception rather than the rule. Teams should evaluate whether their current single-modal approaches limit future capabilities.
With seventeen signals in a single week, the AI provider landscape shows no signs of slowing. The combination of deprecations, new capabilities, and billing changes reinforces the need for systematic monitoring of provider changes rather than reactive discovery when things break.