Gemini 3.1 Pro and the New Agentic Benchmarks

By James Aspinwall

Watch the full episode on YouTube


Google’s Gemini 3.1 Pro represents a significant step forward in what AI models can actually do — not what they know, but how well they perform real tasks over extended sessions. The old question-answering benchmarks are saturated. The new frontier is agentic: web research, office productivity, terminal operations, and multi-turn collaboration. Gemini 3.1 Pro either leads or matches the best on nearly all of them.

The Numbers

ARC-AGI 2 (abstract reasoning): Gemini 3 Pro scored 31.1% roughly three months ago. Gemini 3.1 Pro hits 77%. That’s not incremental improvement — it’s a capability jump in core reasoning that happened in about 90 days.

BrowseComp (web research): Gemini 3.1 Pro leads at 85.9, slightly ahead of Claude Opus 4.6 at 84 and prior GPT-5.2. Humans score around 29%. This benchmark asks agents to find extremely obscure, verifiable facts across the live web — the kind of research that takes humans hours and often ends in failure.

Apex Agents (office productivity): Both Gemini 3.1 Pro and Opus 4.6 score 33.5, roughly doubling Gemini 3 Pro’s 18.4 from three months earlier. Still far from 100%, but the trajectory is steep.

TerminalBench 2.0 (command-line operations): Gemini 3.1 Pro leads at 68.5, ahead of Opus 4.6 at 65.4 and GPT-5.2 at 64.7. Gemini 3 Pro was at 56.2.

Tau-2 (conversational collaboration): Opus 4.6 holds the overall lead at 91.9, but Gemini 3.1 Pro hits 99.3 in the telecom subcategory and 90.8 in retail — essentially flawless at telecom support while slightly behind Opus on the aggregate score.

What These Benchmarks Actually Measure

The interesting story isn’t just the scores. It’s what’s being tested.

BrowseComp drops an agent onto the live web and asks it to find “truffle facts” — very short, verifiable answers that are extremely hard to discover. One example: identify a specific fictional character from a convoluted description. The answer is Plastic Man, but getting there requires navigating multiple sites, cross-referencing sources, and following threads across long sessions. Humans solve about 29% and often give up after hours of searching.

Apex Agents simulates a real office environment with documents, spreadsheets, emails, and Slack. The model is asked to produce client-ready deliverables — things like market penetration scoring analyses that a human analyst would spend one to two hours on. At 100%, this benchmark would imply that much of white-collar knowledge work is fully automatable. Current models are at roughly a third of that. Far from replacement, but far enough from zero to be meaningful.

TerminalBench 2.0 puts models in Docker sandboxes and asks them to operate via the command line: configuring web servers, processing data, training ML models. This plays to the strengths of language models — text-native interfaces, structured commands, clear feedback loops. It’s arguably the most natural environment for an LLM to work in.

Tau-2 is the most human benchmark of the set. It measures collaborative, stateful conversation where the model coordinates with a partner agent and a dynamic world state. One scenario: a telecom support agent helping a non-technical 64-year-old librarian troubleshoot a PC, where the human is inconsistent, confused, and doesn’t follow instructions cleanly. The model has to maintain context, adapt to confusion, and still reach a resolution. Gemini 3.1 Pro’s 99.3 in telecom means it essentially never fails at this.

Why the Old Benchmarks Don’t Matter Anymore

The video makes a point worth emphasizing: question-answering benchmarks are mostly saturated. When every frontier model scores above 90% on knowledge retrieval and reasoning puzzles, the numbers stop differentiating. The models all “know” roughly the same things.

The new benchmarks test something different — the ability to do work. Navigate the web for hours. Produce a deliverable from raw data. Operate a terminal. Handle a confused customer over multiple turns without losing context or patience. These are the capabilities that determine whether AI models become tools people actually rely on or remain impressive demos.

Google’s Framing

Google is positioning Gemini 3.1 Pro explicitly as a reasoning and agentic model. They describe it as having the same core intelligence as their DeepThink system but exposed for practical applications. The message is clear: this isn’t a chatbot upgrade, it’s a work-capable agent.

The competitive positioning is notable. On BrowseComp and TerminalBench, Gemini 3.1 Pro leads. On Apex Agents, it matches Opus 4.6. On Tau-2, it trails Opus overall but dominates specific verticals. Google is no longer trailing — on agentic tasks, they’re at or near the front.

The Caveat

Benchmarks are controlled environments. Launch-day API instability is real — the host notes that hands-on testing will be the true measure once things stabilize. Scores on curated tasks don’t always translate to reliable performance in messy, real-world workflows.

But the trajectory is what matters here. Doubling on Apex Agents in 90 days. Going from 31% to 77% on ARC-AGI 2 in the same period. These aren’t marginal gains. If this rate continues through even one more iteration, several of these benchmarks will be effectively solved — and the conversation shifts from “can AI do this work?” to “how do we integrate it?”

That shift is closer than most organizations are prepared for.