Google Launches Gemini 3.5 Flash at I/O 2026, Beating Its Own Pro Model on Agentic and Coding Benchmarks
Google's new efficiency flagship outperforms Gemini 3.1 Pro on most evals while running 4x faster and costing 40% less.
Editor's Note ·
- Correction:
- The article states the model 'delivers approximately 289 tokens per second on output.' No cited source snapshot contains this specific figure. The four accessible snapshots (TechCrunch, BigGo Finance, FelloAI, LLM-Stats, and the Google official blog) all describe the model as '4x faster than comparable frontier models in output tokens per second' but none states a baseline or derives the 289 figure. The 4x-faster characterization is well-supported; the 289 t/s figure is not.
- Correction:
- The article states Gemini 3.5 Flash 'is distributed across more than 230 countries and 70 languages, according to TechCrunch.' The TechCrunch source snapshot does not contain these specific figures. The Google official blog states the model is available 'to billions of people globally' without specifying country or language counts. The availability claim is plausible but the specific numbers and the TechCrunch attribution are unverified by the captured source snapshots.
Overview
Google unveiled Gemini 3.5 Flash at its I/O 2026 developer conference on May 19, positioning the model as its most capable agentic and coding release to date. The new model outperforms the larger Gemini 3.1 Pro on the majority of the company’s own benchmarks while delivering what Google describes as “four times faster output tokens per second than other frontier models” and cutting inference costs by roughly 40 percent compared to its predecessor.
“Our strongest agentic and coding model yet. It delivers frontier-level performance at 4x the speed of comparable frontier models, often at less than half the cost,” Google said in its official announcement.
Koray Kavukcuoglu, CTO of Google DeepMind and the company’s Chief AI Architect, described the model’s positioning succinctly: “3.5 Flash offers an incredible combination of quality and low latency.”
What We Know
Benchmark Performance
Across the agentic and coding evaluations Google uses as primary reference points, Gemini 3.5 Flash posted clear gains over Gemini 3.1 Pro:
| Benchmark | Gemini 3.5 Flash | Gemini 3.1 Pro |
|---|---|---|
| Terminal-Bench 2.1 (coding) | 76.2% | 70.3% |
| GDPval-AA (agentic Elo) | 1,656 | 1,314 |
| MCP Atlas (tool use) | 83.6% | 78.2% |
| Finance Agent v2 | 57.9% | 43.0% |
| CharXiv Reasoning | 84.2% | — |
The gains are not uniform. On Humanity’s Last Exam — a measure of broad knowledge breadth — Gemini 3.1 Pro still leads (44.4% vs. 40.2%), and on the ARC-AGI-2 abstract reasoning benchmark, Flash trails its predecessor (72.1% vs. 77.1%). FelloAI notes that GPT-5.5 retains an edge on ARC-AGI-2 at 84.6%, and that Claude Opus 4.7 excels on SWE-bench Verified coding tasks. The new model does lead GPT-5.5 on MMMU-Pro multimodal reasoning, 83.6% versus 81.2%, and nearly ties it on OSWorld-Verified agentic navigation (78.4% vs. 78.7%), according to BigGo Finance.
Speed and Pricing
The model delivers approximately 289 tokens per second on output, which Google and independent sources characterize as four times faster than comparable frontier models. API pricing is set at $1.50 per million input tokens and $9.00 per million output tokens, with cached inputs available at $0.15 per million tokens — compared to $2.50 input and $15.00 output for Gemini 3.1 Pro, according to LLM-Stats and FelloAI. The context window is one million input tokens with 65,536 output tokens, and the knowledge cutoff is January 2026.
Agentic Design
Unlike prior Gemini releases, 3.5 Flash was built with autonomous, long-horizon task execution as its primary use case. TechCrunch reports the model can run autonomously for multiple hours and pauses to request user input at decision points requiring human judgment. It was co-developed with Google’s Antigravity integrated development environment for native agent execution and supports parallel subagent deployment for multi-step enterprise workflows.
Tulsee Doshi, a Google senior director, described the role of the forthcoming Gemini 3.5 Pro in the ecosystem: it “becomes your orchestrator, your planner.”
Several enterprises are already running the model in production or pilot configurations. According to DataCamp, Shopify is analyzing data for merchant growth forecasts using parallel subagents; Macquarie Bank is piloting customer onboarding by reasoning over 100-page-plus documents; Salesforce is integrating it into its Agentforce platform; Xero is using it for multi-week workflows including 1099 tax form preparation; Databricks has deployed it for data monitoring and diagnosis; and Ramp is using its multimodal capabilities to improve OCR accuracy on complex invoices.
Availability
Gemini 3.5 Flash is immediately available as the default model in the Gemini app and Google Search AI Mode, with developer access through Google AI Studio, the Gemini API, Android Studio, Vertex AI, and the Gemini Enterprise Agent Platform. It is distributed across more than 230 countries and 70 languages, according to TechCrunch.
Safety
The model was developed under Google’s Frontier Safety Framework with what the company describes as strengthened cyber and CBRN (chemical, biological, radiological, nuclear) safeguards. According to DataCamp, Google also uses interpretability tools to check internal reasoning before responses are produced.
What We Don’t Know
Google has not disclosed detailed architectural information about Gemini 3.5 Flash — specifically whether it uses a mixture-of-experts topology, the total parameter count, or the training data composition and scale. The company has also not published third-party audit results for the Frontier Safety Framework assessments.
The relationship between the MCP Atlas benchmark scores and the MMMU-Pro multimodal evaluation is inconsistently labeled across sources: some reports attribute the 83.6% figure to MCP Atlas (a tool-use evaluation), while others describe it as a multimodal score. Google’s official blog associates 83.6% with MCP Atlas only.
Pricing parity with the claimed “less than half the cost” assertion is also difficult to verify independently: the comparison depends on workload type, and third-party cost benchmarks have not yet been published for the model’s first production week.
Analysis
Gemini 3.5 Flash is a notable inversion: a Flash-tier model that Google itself benchmarks above its Pro-tier predecessor on the evaluations the company cares most about in 2026 — autonomous task execution and coding. The benchmark data, across four independent sources and the official Google announcement, consistently shows the Flash variant leading on Terminal-Bench 2.1, GDPval-AA, and MCP Atlas while trailing on ARC-AGI-2 and Humanity’s Last Exam. That pattern suggests a deliberate architectural trade-off: optimizing depth of reasoning on structured, tool-augmented pipelines at the cost of some breadth on general knowledge and abstract cognition.
The pricing structure amplifies the significance. At $1.50/$9.00 per million tokens, Gemini 3.5 Flash undercuts Claude Sonnet 4.6’s $3.00/$15.00, which narrows one of the most commercially important variables in enterprise AI procurement. If the speed claim — 289 tokens per second, or 4x the rate of comparable frontier models — holds under independent testing, Google has built a compelling cost-performance argument for high-throughput agentic deployments. CEO Sundar Pichai teased at the I/O keynote that Gemini 3.5 Pro, the full-capability version, is already in internal testing and will launch the following month, suggesting the Flash release is the opening move in a broader 3.5 generation rollout.