MLPerf Inference v6.0 Delivers Its Most Ambitious Benchmark Suite Yet as NVIDIA Triples DeepSeek-R1 Throughput in Six Months
MLCommons' April 2026 inference benchmark round adds text-to-video and vision-language tests while NVIDIA posts a 2.7x jump on DeepSeek-R1 through software alone.
Overview
MLCommons released results for MLPerf Inference v6.0 on April 1, 2026, marking the most significant expansion of the industry-standard benchmark suite since its inception. The round introduced six new tests covering text-to-video generation, multimodal vision-language tasks, and updated reasoning models — while drawing submissions from 24 organizations. The headline result: NVIDIA achieved a 2.7x improvement on DeepSeek-R1 throughput compared to its first submission just six months prior, driven entirely by software optimization rather than new hardware.
What’s New in v6.0
According to MLCommons, the v6.0 round added six benchmarks to the data center suite:
- GPT-OSS 120B — an open-weight mixture-of-experts model targeting mathematics, scientific reasoning, and coding
- DeepSeek-R1 Interactive — an expanded reasoning benchmark with an interactive scenario designed to measure token delivery speed and reduced time-to-first-token, including support for speculative decoding
- DLRMv3 — the third-generation recommender system benchmark, featuring the first sequential recommendation test, developed in collaboration with Meta
- Text-to-video generation — the suite’s first generative video benchmark, using WAN-2.2
- Vision-Language Model (VLM) — a Shopify-powered test converting multimodal product data into structured metadata, evaluated on Qwen3-VL-235B
- YOLOv11 Large — an updated edge object detection benchmark from Ultralytics
The round also introduced a new LoadGen++ harness enabling LLM serving with production-style software stacks, alongside an updated interactive dashboard for benchmark result visualization.
NVIDIA’s Dominant Showing
NVIDIA’s performance across the new suite was comprehensive. As reported by The Next Platform, the company posted a 2.7x improvement on DeepSeek-R1 server throughput compared to its v5.1 debut submission — the same GB300 NVL72 hardware, optimized through software alone. In absolute terms, the system reached 2,494,310 tokens per second on the DeepSeek-R1 offline scenario and 250,634 tokens per second in the interactive scenario.
The key software techniques behind the gains, detailed in NVIDIA’s own blog post, included:
- Disaggregated serving via the Dynamo framework, which splits the prefill and decode stages of inference across separate GPU pools
- Multi-Token Prediction, generating up to three tokens per forward pass
- Kernel fusion and KV-aware routing for more efficient memory bandwidth utilization
- Wide Expert Parallel, sharding mixture-of-experts layers across GPUs to reduce inter-GPU communication overhead
NVIDIA also submitted scale-out results using four GB300 NVL72 systems interconnected with Quantum-X800 InfiniBand — 288 Blackwell Ultra GPUs in total — setting new system-level throughput records. The company claims 291 cumulative MLPerf wins since 2018, or roughly nine times the total of all other submitters combined.
For the text-to-video benchmark, NVIDIA posted 0.059 samples per second on WAN-2.2, and for the vision-language benchmark it reached 79 samples per second on Qwen3-VL-235B offline. Cost efficiency improved markedly: NVIDIA described reducing the cost to generate each token by more than 60% through the same infrastructure and power footprint as v5.1.
AMD Crosses One Million Tokens Per Second
AMD’s Instinct MI355X achieved a milestone in v6.0: more than one million tokens per second on GPT-OSS-120B at cluster scale, using up to 94 GPUs in multi-node configurations. As The Decoder reported, AMD also submitted its first-ever results on GPT-OSS-120B and WAN-2.2 text-to-video generation, expanding its benchmark coverage beyond previous rounds.
In single-node, eight-GPU comparisons against NVIDIA’s B200, AMD’s MI355X matched or slightly exceeded it on several tests — delivering 97% of NVIDIA’s B200 throughput in server mode on Llama 2 70B and 119% in interactive scenarios. However, AMD notably declined to submit results for DeepSeek-R1, leaving the category’s headline results to NVIDIA alone.
Intel Targets the Edge
Intel took a differentiated approach in v6.0, focusing on workstations and edge deployments rather than data center competition. The company submitted results using Arc Pro B70 and B65 GPUs alongside Xeon 6 processors, positioning itself as the only server processor maker submitting standalone CPU inference results for MLPerf. The strategy signals Intel’s acknowledgment that competing with NVIDIA and AMD in large-scale data center inference benchmarks is not its near-term priority.
Participation and Competitive Dynamics
The MLCommons announcement noted a 30% increase in multi-node system submissions versus v5.1, with 10% of submitted systems now exceeding ten nodes — up from just 2% in the prior round. The largest configuration submitted involved 72 nodes with 288 accelerators, quadrupling the previous maximum. Twenty-four organizations participated in total.
Nebius, which submitted across HGX B200, HGX B300, and GB300 NVL72 configurations, observed consistent linear scaling as configurations expanded, with their GB300 NVL72 reaching 673,936 tokens per second on DeepSeek-R1 offline using 72 GPUs.
Notably absent from v6.0 results were Google TPUs and Cerebras — a pattern that continues from prior rounds. Direct cross-vendor comparisons remain difficult given that each company selectively submits to benchmarks where it has the strongest showing and uses different system configurations.
What It Means
The v6.0 round marks a shift in what MLPerf measures. Adding text-to-video and multimodal benchmarks reflects how production AI inference workloads have evolved well beyond text generation. The introduction of the interactive scenario for DeepSeek-R1 — measuring latency alongside throughput — acknowledges that real-world deployments increasingly require responsiveness guarantees, not just raw token output.
The 2.7x gain NVIDIA achieved on DeepSeek-R1 through software optimization alone is a concrete demonstration that inference efficiency improvements can be substantial even without hardware upgrades — a relevant data point as organizations managing large GPU fleets consider upgrade timelines. NVIDIA is also developing MLPerf Endpoints, a forthcoming benchmark designed to measure real-world API performance more directly, which could shift how industry comparisons are framed in future rounds.