OpenAI has launched a new series of models, GPT-4.1, GPT-4.1 mini, and GPT-4.1 nano, which demonstrate significant improvements over previous versions, specifically in coding, instruction following, and long context comprehension. These models support up to 1 million tokens of context and have a knowledge cutoff of June 2024.
Performance Improvements
- Coding: GPT-4.1 achieves 54.6% on the SWE-bench Verified benchmark, surpassing GPT-4o by 21.4% and GPT-4.5 by 26.6%. It excels in solving coding tasks, making fewer extraneous edits, and reliably following diff formats.
- Instruction Following: It scores 38.3% on Scale’s MultiChallenge benchmark, a 10.5% improvement over GPT-4o, indicating enhanced ability to follow complex instructions.
- Long Context: In the Video-MME benchmark, GPT-4.1 scores 72.0%, a 6.7% increase over GPT-4o, showcasing its capability to understand and utilize long contexts effectively.
Model Variants
- GPT-4.1 mini significantly reduces latency by nearly half and costs by 83%, while still outperforming GPT-4o in many benchmarks.
- GPT-4.1 nano is the fastest and most cost-effective model, ideal for tasks requiring low latency, achieving 80.1% on MMLU and 50.3% on GPQA.
Real-World Applications
Developers have reported that GPT-4.1 models are more effective for real-world applications, including:
- Windsurf: Noted a 60% performance increase on coding benchmarks.
- Qodo: Found GPT-4.1 produced better code review suggestions in 55% of cases.
- Blue J: Achieved 53% higher accuracy in tax scenario evaluations.
- Thomson Reuters: Improved multi-document review accuracy by 17%.
Long Context Capabilities
The ability to process 1 million tokens allows GPT-4.1 to handle extensive documents and complex tasks across various domains, such as legal and coding applications. It has shown improved performance in retrieving relevant information from large contexts and disambiguating between multiple requests.
Vision and Multimodal Performance
GPT-4.1 models excel in image understanding and multimodal tasks, achieving state-of-the-art results in benchmarks like Video-MME, where it scored 72.0%.
Pricing
The GPT-4.1 series is now available, with GPT-4.1 being 26% cheaper than GPT-4o for median queries. The GPT-4.1 nano model is the most affordable and fastest option, with increased discounts for repeated context queries.
In summary, the GPT-4.1 series represents a significant advancement in AI capabilities, focusing on real-world utility and performance enhancements across various applications.