Google DeepMind Releases Gemini 3.1 Pro, Claims Top Spot on 12 of 18 Major AI Benchmarks
Google DeepMind's Gemini 3.1 Pro launches with a 2.5x reasoning improvement over its predecessor, topping the ARC-AGI-2 benchmark at 77.1% and ranking first across most tracked evaluations.
Google DeepMind released Gemini 3.1 Pro on February 19, 2026, positioning the model as its most capable general-purpose offering to date. The launch marks a significant step in the ongoing competition among frontier AI labs, with the new model topping 12 of 18 tracked benchmarks and debuting at the same price point as its predecessor.
A Substantial Reasoning Leap
The headline metric for Gemini 3.1 Pro is its performance on ARC-AGI-2, a reasoning benchmark designed to evaluate abstract problem-solving that resists rote memorization. According to Google DeepMind’s model card, the model scored 77.1%—more than double the 31.1% achieved by Gemini 3 Pro and a result that exceeds other frontier models on the same evaluation. On GPQA Diamond, a test of scientific knowledge requiring graduate-level expertise, the model reached 94.3%. On SWE-Bench Verified, which measures performance on real-world software engineering tasks, it scored 80.6%.
According to Artificial Analysis, Gemini 3.1 Pro achieves a score of 57 on the Artificial Analysis Intelligence Index, placing it first out of 115 evaluated models—well above the median of 27 for comparable reasoning systems.
On LiveCodeBench Pro, a competitive programming evaluation, the model reached an Elo rating of 2,887, while a RE-Bench score of 1.27 (human-normalized) compared to Gemini 3 Pro’s 1.04 indicates improved performance on machine learning research and development tasks.
Multimodal by Design
Gemini 3.1 Pro is built as a natively multimodal model. As described in the model card, it accepts text, images, audio, and video as input, with a context window of one million tokens and an expanded output limit of 64,000 tokens. This positions it for tasks such as full-codebase analysis, large document synthesis, and multi-step agentic workflows that require sustained reasoning over large volumes of information.
The model is available across Google’s developer and enterprise platforms including Google AI Studio, Vertex AI, the Gemini API, the Gemini app, the AntiGravity IDE, and NotebookLM.
A New Middle Tier for Thinking
One of the structural changes in this release is the addition of a Medium thinking level, slotted between the existing Low and High modes. According to Digital Applied, this provides developers with a cost-conscious path to balanced performance, allowing tasks that do not require maximum reasoning depth to avoid the latency and compute expense of full High-mode inference. The Medium mode is aimed at agentic workflows where compute allocation needs to be tuned across multiple steps rather than maximized uniformly.
Pricing Held Flat Despite Performance Gains
Google held pricing for Gemini 3.1 Pro at $2.00 per million input tokens and $12.00 per million output tokens—the same rate as Gemini 3 Pro. As noted by Artificial Analysis, this places the model in a somewhat expensive tier relative to peer models with comparable capabilities, though the flat pricing represents an effective cost reduction given the performance improvements. Context caching can reduce costs by up to 75% on repeated or structured inputs, according to Digital Applied.
Output speed sits at 92.9 tokens per second, above the 69.2 median for comparable reasoning models, though first-token latency remains relatively high at 34 seconds—a tradeoff common to models that perform extended internal reasoning before generating output.
Competitive Landscape
The release arrives as other frontier labs have also been pushing performance boundaries in early 2026. Digital Applied’s comparative analysis notes that Gemini 3.1 Pro leads on most general-purpose benchmarks against OpenAI’s GPT-5.3-Codex, though the latter maintains an advantage on specialized coding tasks such as Terminal-Bench 2.0, where it scored 77.3% versus Gemini 3.1 Pro’s 68.5%.
Google DeepMind also reported that the model remains below critical capability thresholds in CBRN, cybersecurity, and harmful manipulation domains under its internal safety assessment framework—a disclosure that has become increasingly standard practice as labs face regulatory scrutiny over the dual-use potential of high-capability models.
Broader Implications
The ARC-AGI-2 result has drawn particular attention from researchers because the benchmark was specifically designed to resist performance gains achieved purely through scale or data saturation. A score above 77% represents a level of abstract reasoning that, when Gemini 3 Pro scored 31% on the same test just months earlier, was considered a near-term ceiling. The jump suggests that architectural changes or training improvements rather than raw scaling are responsible for the gains—a distinction that could affect how labs and policymakers think about the pace of AI capability development.
Google’s official blog frames the release as designed for tasks where complex, iterative reasoning is necessary and a direct answer is insufficient—positioning Gemini 3.1 Pro as a tool for professional and research applications rather than general consumer use, even as it remains accessible through the Gemini app.