Better Models: Worse Tools

TL;DR

Recent observations indicate that the newest AI language models from Anthropic are generating more malformed tool call requests, especially in complex multi-turn interactions. This challenges assumptions that larger, more advanced models automatically improve in all operational aspects.

Recent testing of Anthropic’s latest models, Opus 4.8 and Sonnet 5, shows they increasingly produce malformed tool call requests with extra, invented fields, despite their advanced training. This issue is notable because it suggests that more powerful models are not necessarily better at following tool invocation schemas, which could impact their reliability in practical applications.

Over the past two days, users and researchers have observed that newer Anthropic models sometimes generate tool call requests with extraneous or nonsensical fields, such as ‘type’, ‘id’, ‘requireUnique’, and others, which do not conform to the expected schemas. This problem appears more prevalent in models like Opus 4.8 and Sonnet 5, compared to older versions, which rarely exhibited such issues.

The problem manifests primarily during multi-turn interactions where models read file contents, diagnose issues, and then produce complex, multi-line edits. In these contexts, malformed requests occur approximately 20% of the time, especially when the model has a history of reading and editing files. Researchers note that turning on strict tool invocation constraints reduces or eliminates these errors.

At a glance

reportWhen: ongoing; observations made from July 20…

The developmentThe latest Anthropic models, Opus 4.8 and Sonnet 5, are producing more frequent invalid tool calls with extra, invented fields, unlike older models, raising concerns about model reliability.

Implications for AI Reliability in Tool Usage

This development raises concerns about the reliability of advanced language models when performing precise, schema-dependent tasks such as code editing, data manipulation, or API interactions. If models produce invalid tool calls, it could lead to failures in automation workflows, reduce trust in AI-assisted programming, and complicate integration efforts for developers relying on these models.

Bruno API Testing for Beginners: Test REST APIs Step-by-Step (Without Postman)

As an affiliate, we earn on qualifying purchases.

Evolution of Model Tool-Calling Capabilities

Anthropic’s earlier models, trained before the widespread deployment of integrated coding tools, rarely exhibited such issues. The newer models, including Opus 4.8 and Sonnet 5, are likely trained with or fine-tuned on datasets that include code and tool-harness interactions, such as Claude Code. This training shift may have inadvertently introduced the tendency to generate malformed or overly complex tool call requests, especially in multi-turn, context-rich scenarios.

Observers note that this problem is not universal but appears more in specific interaction patterns, particularly those involving complex file edits and diagnostic reasoning. The issue seems to be linked to the models’ internal representation of tool invocation schemas and their learned conventions.

“The newer models are learning to call tools in increasingly complex ways, but it seems they’re also inventing extra fields that break the schema validation. It’s a worrying sign of overfitting to training patterns.”
— Researcher familiar with model training

Pydantic for AI in Production: A Practical Guide to Data Validation, Model Serving, Schema Governance, and High-Performance AI Pipelines with Python and FastAPI

As an affiliate, we earn on qualifying purchases.

Unclear Causes Behind the Increasing Malformed Calls

It is not yet confirmed whether the issue stems from training data artifacts, model architecture changes, or a combination of both. The exact internal mechanisms that lead to the generation of extra, nonsensical fields in tool calls remain unverified, and further investigation is ongoing.

AI Photo Editing for Beginners: Your Road from Novice to Skilled Professional

As an affiliate, we earn on qualifying purchases.

Monitoring and Improving Tool Call Robustness

Researchers and developers plan to conduct controlled experiments to isolate the causes of malformed tool calls and develop mitigation strategies, such as enhanced schema validation, constrained decoding, or training adjustments. Expect updates from Anthropic and the broader AI community as they address these reliability concerns in upcoming model releases.

dpnao Multitool Wrench With 7 Tools/Pliers/Wire Cutter/Flat Screwdriver/Phillips Screwdriver/Portable Folding Multifunctional Adjustable Multi Purpose Stainless Steel Tool

EDC Multitool & Small Multitool for Daily Carry – This lightweight multitool (only 5" long) combines 7 essential…

As an affiliate, we earn on qualifying purchases.

Key Questions

Why are newer models producing more malformed tool calls?

It is likely related to training data artifacts or the models’ internal learned conventions, especially as they incorporate more complex code and tool interaction patterns. The exact cause is still under investigation.

Does this mean the models are less reliable overall?

Not necessarily. The issue is specific to tool invocation schemas and complex multi-turn interactions. Basic tasks and simpler prompts still perform well, but reliability in schema-dependent tasks may be compromised.

Can this problem be fixed with current techniques?

Preliminary results suggest that constraining decoding or enforcing strict invocation rules can reduce or eliminate malformed calls. Further research is needed to develop comprehensive solutions.

Will this affect future AI model releases?

It is likely that model developers will prioritize addressing this issue in upcoming releases, especially as tool use becomes more central to AI capabilities and applications.

Source: Hacker News

Up next

Charles Seliger Painted Nature’s Invisible Architecture

Author

The Idea Magazine Team

Share article

Implications for AI Reliability in Tool Usage

Bruno API Testing for Beginners: Test REST APIs Step-by-Step (Without Postman)

Evolution of Model Tool-Calling Capabilities

Pydantic for AI in Production: A Practical Guide to Data Validation, Model Serving, Schema Governance, and High-Performance AI Pipelines with Python and FastAPI

Unclear Causes Behind the Increasing Malformed Calls

AI Photo Editing for Beginners: Your Road from Novice to Skilled Professional

Monitoring and Improving Tool Call Robustness

dpnao Multitool Wrench With 7 Tools/Pliers/Wire Cutter/Flat Screwdriver/Phillips Screwdriver/Portable Folding Multifunctional Adjustable Multi Purpose Stainless Steel Tool

Key Questions

Why are newer models producing more malformed tool calls?

Does this mean the models are less reliable overall?

Can this problem be fixed with current techniques?

Will this affect future AI model releases?

The queue. Why the grid, not the chip, is the binding constraint on AI.

Mini Shai-Hulud Strikes Again: 314 npm Packages Compromised

Show HN: Streambed – Stream Postgres to Iceberg on S3, Supports Postgres Wire

Linux devs are fighting the new age-gated internet

In Tasmania, nature sets table and pace

Kuaishou Announces Kling AI Video Unit’s Fundraising at $15 Billion Valuation

Finland’s last analogue landline phones go silent after 150 years

Charles Seliger Painted Nature’s Invisible Architecture

Better Models: Worse Tools

Up next

Author

The Idea Magazine Team

Share article

Implications for AI Reliability in Tool Usage

Bruno API Testing for Beginners: Test REST APIs Step-by-Step (Without Postman)

Evolution of Model Tool-Calling Capabilities

Pydantic for AI in Production: A Practical Guide to Data Validation, Model Serving, Schema Governance, and High-Performance AI Pipelines with Python and FastAPI

Unclear Causes Behind the Increasing Malformed Calls

AI Photo Editing for Beginners: Your Road from Novice to Skilled Professional

Monitoring and Improving Tool Call Robustness

dpnao Multitool Wrench With 7 Tools/Pliers/Wire Cutter/Flat Screwdriver/Phillips Screwdriver/Portable Folding Multifunctional Adjustable Multi Purpose Stainless Steel Tool

Key Questions

Why are newer models producing more malformed tool calls?

Does this mean the models are less reliable overall?

Can this problem be fixed with current techniques?

Will this affect future AI model releases?

You May Also Like