OpenAI

OpenAI launches GPT-4.1 with significant improvements in coding, instruction, and context

April 15, 2025 at 4:57:28 AM

TL;DR OpenAI has released GPT-4.1, GPT-4.1 mini, and nano models, featuring major improvements in coding, instruction following, and long context comprehension, with support for up to 1 million tokens. GPT-4.1 outperforms GPT-4o in various benchmarks and is optimized for real-world applications, offering better performance at lower costs. GPT-4.1 is available via API, while GPT-4.5 Preview will be phased out.

OpenAI launches GPT-4.1 with significant improvements in coding, instruction, and context

OpenAI has launched a new series of models, GPT-4.1, GPT-4.1 mini, and GPT-4.1 nano, which demonstrate significant improvements over previous versions, specifically in coding, instruction following, and long context comprehension. These models support up to 1 million tokens of context and have a knowledge cutoff of June 2024.

Performance Improvements

Coding: GPT-4.1 achieves 54.6% on the SWE-bench Verified benchmark, surpassing GPT-4o by 21.4% and GPT-4.5 by 26.6%. It excels in solving coding tasks, making fewer extraneous edits, and reliably following diff formats.
Instruction Following: It scores 38.3% on Scale’s MultiChallenge benchmark, a 10.5% improvement over GPT-4o, indicating enhanced ability to follow complex instructions.
Long Context: In the Video-MME benchmark, GPT-4.1 scores 72.0%, a 6.7% increase over GPT-4o, showcasing its capability to understand and utilize long contexts effectively.

Model Variants

GPT-4.1 mini significantly reduces latency by nearly half and costs by 83%, while still outperforming GPT-4o in many benchmarks.
GPT-4.1 nano is the fastest and most cost-effective model, ideal for tasks requiring low latency, achieving 80.1% on MMLU and 50.3% on GPQA.

Real-World Applications

Developers have reported that GPT-4.1 models are more effective for real-world applications, including:

Windsurf: Noted a 60% performance increase on coding benchmarks.
Qodo: Found GPT-4.1 produced better code review suggestions in 55% of cases.
Blue J: Achieved 53% higher accuracy in tax scenario evaluations.
Thomson Reuters: Improved multi-document review accuracy by 17%.

Long Context Capabilities

The ability to process 1 million tokens allows GPT-4.1 to handle extensive documents and complex tasks across various domains, such as legal and coding applications. It has shown improved performance in retrieving relevant information from large contexts and disambiguating between multiple requests.

Vision and Multimodal Performance

GPT-4.1 models excel in image understanding and multimodal tasks, achieving state-of-the-art results in benchmarks like Video-MME, where it scored 72.0%.

Pricing

The GPT-4.1 series is now available, with GPT-4.1 being 26% cheaper than GPT-4o for median queries. The GPT-4.1 nano model is the most affordable and fastest option, with increased discounts for repeated context queries.

In summary, the GPT-4.1 series represents a significant advancement in AI capabilities, focusing on real-world utility and performance enhancements across various applications.