Anthropic has announced an upgraded AI model, Claude 3.5 Sonnet, and a new model, Claude 3.5 Haiku. The upgraded Claude 3.5 Sonnet shows significant improvements, especially in coding, and introduces a groundbreaking capability in public beta: computer use. This allows Claude to interact with computers like humans, performing tasks such as moving a cursor, clicking buttons, and typing text.
Claude 3.5 Sonnet
The upgraded Claude 3.5 Sonnet demonstrates wide-ranging improvements on industry benchmarks, particularly in agentic coding and tool use tasks. Key performance metrics include:
- SWE-bench Verified: Improved from 33.4% to 49.0%.
- TAU-bench: Improved from 62.6% to 69.2% in retail and from 36.0% to 46.0% in the airline domain.
Early feedback indicates significant advancements in AI-powered coding, with companies like GitLab, Cognition, and The Browser Company reporting substantial improvements. Joint pre-deployment testing was conducted by the US AI Safety Institute (US AISI) and the UK Safety Institute (UK AISI), confirming the model's safety.
Claude 3.5 Haiku
Claude 3.5 Haiku, the next generation of the fastest model, offers improvements across all skill sets at the same cost and speed as its predecessor. It outperforms the previous largest model, Claude 3 Opus, on many intelligence benchmarks, particularly in coding tasks:
- SWE-bench Verified: Scores 40.6%.
Claude 3.5 Haiku is suitable for user-facing products, specialized sub-agent tasks, and generating personalized experiences from large data volumes. It will be available later this month on the Anthropic API, Amazon Bedrock, and Google Cloud’s Vertex AI.
Computer Use Capability
The new computer use capability allows Claude to perform tasks by interacting with computer interfaces. This includes automating repetitive processes, building and testing software, and conducting open-ended tasks like research. Key performance on OSWorld:
- Screenshot-only category: Scored 14.9%, better than the next-best AI system's 7.8%.
- More steps allowed: Scored 22.0%.
While the capability is still experimental and imperfect, it is expected to improve rapidly. Safety measures include new classifiers to identify misuse and prevent harm.
Anthropic aims to learn from initial deployments to better understand the potential and implications of increasingly capable AI systems. They encourage developers to explore the new models and provide feedback to help refine these capabilities.