Anthropic Launches Claude Sonnet 4.5 with New Coding and Agentic State-of-the-Art Results

by CryptoExpert
Coinmama


Anthropic released Claude Sonnet 4.5 and sets a new benchmark for end-to-end software engineering and real-world computer use. The update also ships concrete product surface changes (Claude Code checkpoints, a native VS Code extension, API memory/context tools) and an Agent SDK that exposes the same scaffolding Anthropic uses internally. Pricing remains unchanged from Sonnet 4 ($3 input / $15 output per million tokens).

What’s actually new?

  • SWE-bench Verified record. Anthropic reports 77.2% accuracy on the 500-problem SWE-bench Verified dataset using a simple two-tool scaffold (bash + file edit), averaged over 10 runs, no test-time compute, 200K β€œthinking” budget. A 1M-context setting reaches 78.2%, and a higher-compute setting with parallel sampling and rejection raises this to 82.0%.
  • Computer-use SOTA. On OSWorld-Verified, Sonnet 4.5 leads at 61.4%, up from Sonnet 4’s 42.2%, reflecting stronger tool control and UI manipulation for browser/desktop tasks.
  • Long-horizon autonomy. The team observed >30 hours of uninterrupted focus on multi-step coding tasks β€” a practical jump over earlier limits and directly relevant to agent reliability.
  • Reasoning/math. The release notes β€œsubstantial gains” across common reasoning and math evals; exact per-bench numbers (e.g., AIME config). Safety posture is ASL-3 with strengthened defenses against prompt-injection.
https://www.anthropic.com/news/claude-sonnet-4-5

What’s there for agents?

Sonnet 4.5 targets the brittle parts of real agents: extended planning, memory, and reliable tool orchestration. Anthropic’s Claude Agent SDK exposes their production patterns (memory management for long-running tasks, permissioning, sub-agent coordination) rather than just a bare LLM endpoint. That means teams can reproduce the same scaffolding used by Claude Code (now with checkpoints, a refreshed terminal, and VS Code integration) to keep multi-hour jobs coherent and reversible.

On measured tasks that simulate β€œusing a computer,” the 19-point jump on OSWorld-Verified is notable; it tracks with the model’s ability to navigate, fill spreadsheets, and complete web flows in Anthropic’s browser demo. For enterprises experimenting with agentic RPA-style work, higher OSWorld scores usually correlate with lower intervention rates during execution.

Where you can run it?

  • Anthropic API & apps. Model ID claude-sonnet-4-5; price parity with Sonnet 4. File creation and code execution are now available directly in Claude apps for paid tiers.
  • AWS Bedrock. Available via Bedrock with integration paths to AgentCore; AWS highlights long-horizon agent sessions, memory/context features, and operational controls (observability, session isolation).
  • Google Cloud Vertex AI. GA on Vertex AI with support for multi-agent orchestration via ADK/Agent Engine, provisioned throughput, 1M-token analysis jobs, and prompt caching.
  • GitHub Copilot. Public preview rollout across Copilot Chat (VS Code, web, mobile) and Copilot CLI; organizations can enable via policy, and BYO key is supported in VS Code.

Summary

With a documented 77.2% SWE-bench Verified score under transparent constraints, a 61.4% OSWorld-Verified computer-use lead, and practical updates (checkpoints, SDK, Copilot/Bedrock/Vertex availability), Claude Sonnet 4.5 is developed for long-running, tool-heavy agent workloads rather than short demo prompts. Independent replication will determine how durable the β€œbest for coding” claim is, but the design targets (autonomy, scaffolding, and computer control) are aligned with real production pain points today.

Betfury

Michal Sutter is a data science professional with a Master of Science in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michal excels at transforming complex datasets into actionable insights.

πŸ”₯[Recommended Read] NVIDIA AI Open-Sources ViPE (Video Pose Engine): A Powerful and Versatile 3D Video Annotation Tool for Spatial AI





Source link

You may also like