xAI has launched Grok-2 and Grok-2 Mini, two advanced language models with state-of-the-art reasoning capabilities, available in beta on the π platform. Grok-2, a significant upgrade from Grok-1.5, excels in chat, coding, and reasoning tasks, outperforming models like Claude 3.5 Sonnet and GPT-4-Turbo on the LMSYS leaderboard. Grok-2 Mini, a smaller yet capable variant, is also introduced.
Grok-2 Language Model and Chat Capabilities
Grok-2, tested under the name "sus-column-r" in the LMSYS chatbot arena, outperforms both Claude and GPT-4 in overall Elo scores. xAI's internal evaluations focus on instruction-following and factual accuracy, showing Grok-2's significant improvements in reasoning, content retrieval, and tool use.
Benchmarks
Grok-2 and Grok-2 Mini have been evaluated across various academic benchmarks, demonstrating significant improvements over Grok-1.5. Key areas include:
- Graduate-level science knowledge (GPQA)
- General knowledge (MMLU, MMLU-Pro)
- Math competition problems (MATH)
- Visual math reasoning (MathVista)
- Document-based question answering (DocVQA)
Benchmark | Grok-1.5 | Grok-2 Mini | Grok-2 | GPT-4 Turbo | Claude 3 Opus | Gemini Pro 1.5 | Llama 3 405B | GPT-4o | Claude 3.5 Sonnet |
---|---|---|---|---|---|---|---|---|---|
GPQA | 35.9% | 51.0% | 56.0% | 48.0% | 50.4% | 46.2% | 51.1% | 53.6% | 59.6% |
MMLU | 81.3% | 86.2% | 87.5% | 86.5% | 85.7% | 85.9% | 88.6% | 88.7% | 88.3% |
MMLU-Pro | 51.0% | 72.0% | 75.5% | 63.7% | 68.5% | 69.0% | 73.3% | 72.6% | 76.1% |
MATH | 50.6% | 73.0% | 76.1% | 72.6% | 60.1% | 67.7% | 73.8% | 76.6% | 71.1% |
HumanEval | 74.1% | 85.7% | 88.4% | 87.1% | 84.9% | 71.9% | 89.0% | 90.2% | 92.0% |
MMMU | 53.6% | 63.2% | 66.1% | 63.1% | 59.4% | 62.2% | 64.5% | 69.1% | 68.3% |
MathVista | 52.8% | 68.1% | 69.0% | 58.1% | 50.5% | 63.9% | β | 63.8% | 67.7% |
DocVQA | 85.6% | 93.2% | 93.6% | 87.2% | 89.3% | 93.1% | 92.2% | 92.8% | 95.2% |
Experience Grok with Real-Time Information on π
Grok-2 and Grok-2 Mini are available to π Premium and Premium+ users, featuring advanced text and vision understanding, real-time information integration, and a redesigned interface. Grok-2 offers enhanced capabilities for various tasks, while Grok-2 Mini balances speed and answer quality. Collaboration with Black Forest Labs aims to expand Grokβs capabilities further.
Build with Grok Using the Enterprise API
Later this month, Grok-2 and Grok-2 Mini will be available through a new enterprise API platform, offering multi-region inference deployments, enhanced security features, rich traffic statistics, and advanced billing analytics. The management API will facilitate integration with existing in-house tools and services.
What is Next?
Grok-2 and Grok-2 Mini are being rolled out on π, with future applications including enhanced search capabilities, deeper insights on π posts, and improved reply functions. A preview of multimodal understanding will also be released soon. xAI continues to advance AI development with a focus on core reasoning capabilities, driven by a small, highly talented team.