Large Reasoning Models Fail Instructions 75% of the Time: Week of 20 October 2025
Large Reasoning Models Fail Instructions 75% of the Time: Week of 20 October 2025
The AI industry's dirty secret is out: large reasoning models consistently ignore user instructions during complex tasks. New research reveals that frontier models like GPT-OSS-120B and Qwen3-235B fail to follow basic constraints over 75% of the time, raising serious questions about deploying these systems in production environments.
The Big Moves
Large reasoning models can't follow orders
A bombshell study from Together AI has exposed a critical flaw in large reasoning models that threatens their reliability in real-world applications. The research introduces ReasonIF, a comprehensive benchmark that tests instruction-following across language, formatting, and length constraints during complex reasoning tasks.
The findings are stark: frontier LRMs including GPT-OSS-120B, Qwen3-235B, and DeepSeek-R1 consistently deviate from user instructions, with failure rates exceeding 75%. This isn't about occasional lapses in understanding, it's a systematic inability to maintain adherence to constraints whilst performing multi-step reasoning.
For organisations deploying these models in critical applications, this represents a significant operational risk. Legal document analysis, financial modelling, and technical specification generation all require strict adherence to formatting and content constraints. The research suggests that current reasoning models, despite their impressive capabilities, cannot be trusted to follow precise instructions reliably. Teams using these models for production workloads should implement additional validation layers and consider the implications for compliance and accuracy requirements.
Google forces Vertex AI endpoint migrations by June 2026
Google is pushing through major changes to Vertex AI Generative AI with the release of version 1 on 23 October. Whilst the new features including RAG Cross Corpus Retrieval and Veo 3.1 Lite support are welcome additions, the real story is the forced migration of image and video generation endpoints.
Developers have until 30 June 2026 to migrate to new recommended endpoints, after which legacy services will be discontinued. This six-month timeline might seem generous, but organisations with complex integrations or compliance requirements will need to start planning immediately. The migration affects all image and video generation workflows, requiring code changes, testing, and potentially architectural adjustments.
Google's timing here is telling. By bundling attractive new features with mandatory migrations, they're softening the blow whilst ensuring developers move to their preferred infrastructure. The Gemini 3.1 Flash-Lite model support suggests Google is positioning Vertex AI as a comprehensive platform rather than just another API endpoint. Organisations should audit their current usage patterns and begin migration planning now, particularly if they're running mission-critical applications that can't afford downtime.
Together AI consolidates the multimodal landscape
Together AI is making a bold play to simplify the fragmented multimodal AI landscape with the launch of 40+ new image and video models, including Sora 2 and Veo 3. More importantly, they're offering this through a unified OpenAI-compatible API with transparent pricing across text, image, and video generation.
This addresses a genuine pain point for developers who currently juggle multiple providers, each with different authentication schemes, pricing models, and API conventions. The promise of end-to-end multimodal app development through a single interface is compelling, particularly for teams building complex applications that span multiple content types.
The strategic implications are significant. By offering a consolidated platform, Together AI is positioning itself as the infrastructure layer for multimodal applications, potentially capturing developers who are tired of managing multiple vendor relationships. The OpenAI compatibility is particularly clever, allowing teams to switch without major code changes. However, the success of this approach will depend on model quality, reliability, and whether the unified pricing actually delivers cost savings compared to best-of-breed approaches.
Worth Watching
OpenAI launches ChatGPT Atlas browser
OpenAI's new ChatGPT Atlas browser represents a significant expansion beyond conversational AI into the browsing experience itself. Available across macOS, Windows, iOS, and Android, Atlas integrates ChatGPT capabilities directly into web navigation, with agent mode available for Plus, Pro, and Business subscribers. This isn't just another AI-powered browser, it's OpenAI's attempt to control more of the user experience and capture additional engagement time. The cross-platform availability suggests serious ambitions to compete with established browsers.
Groq introduces automatic prompt caching for cost reduction
Groq's automatic prompt caching for the gpt-oss-120b model delivers immediate value with 50% cost savings and reduced latency. The beauty of this implementation is that it requires no code changes, automatically optimising prompt reuse without manual configuration. For teams running high-volume inference workloads, this could represent significant operational savings. The automatic nature removes the complexity typically associated with caching strategies, making optimisation accessible to smaller teams.
Vertex AI resolves critical streaming response misrouting
Google has resolved a concerning issue where streaming responses from third-party models via Vertex AI were being misrouted between recipients. Whilst Google's own models were unaffected, this highlights potential vulnerabilities in how multi-tenant API platforms handle diverse model integrations. The incident underscores the importance of thorough testing when using third-party models through platform providers, particularly for applications handling sensitive data.
Weaviate v1.34.0 introduces server-side dynamic batching
Weaviate's latest release candidate brings server-side dynamic batching and flat index RQ quantisation, representing significant performance improvements for query workloads. The server-side batching optimisation should improve throughput for applications with variable query patterns, whilst the quantisation enhancements reduce memory requirements without sacrificing accuracy. Teams running large-scale vector search applications should test these capabilities in development environments.
Colab Enterprise adds interactive visualisation cells
Google's addition of interactive visualisation cells to Colab Enterprise eliminates the need to export data to external tools for chart creation and editing. This streamlines the data analysis workflow for teams already using Colab Enterprise, reducing context switching and improving productivity. The feature represents Google's continued investment in making Colab Enterprise a comprehensive data science platform rather than just a notebook environment.
Quick Hits
- OpenSearch 3.3.1 patches critical timestamp upgrade bugs and core update issues
- Llama Stack APIs restructured with stable OpenAI-compatible endpoints and clearer versioning
- Replicate Python SDK enters public beta with full HTTP API support
- ChatGPT adds company knowledge integration from Slack, Drive, and GitHub
- Elasticsearch releases maintenance updates across versions 8.19.6, 9.1.6, and 9.2.0
- Weaviate ships stability improvements in v1.33.2 and v1.32.14
The Week Ahead
The instruction-following crisis in large reasoning models will likely prompt responses from other providers as teams reassess their deployment strategies. Watch for Google's Vertex AI migration communications as the 30 June 2026 deadline approaches, organisations will need concrete migration paths and timelines.
Together AI's unified multimodal platform launch will face its first real-world stress tests as developers begin integrating the new video generation capabilities. The success of this consolidation play could influence how other providers structure their multimodal offerings.
With OpenAI's browser launch and ChatGPT's knowledge integration features, expect competitive responses from other providers looking to expand beyond pure API services into more integrated user experiences.