Z.ai Open-Sources GLM-5.1, a 744-Billion-Parameter Model That Tops SWE-Bench Pro Ahead of GPT-5.4 and Claude Opus 4.6

Overview

Z.ai, the Chinese AI lab formerly known as Zhipu AI, on April 8 released GLM-5.1, an open-source large language model that the company says outperforms both GPT-5.4 and Claude Opus 4.6 on the SWE-Bench Pro software engineering benchmark. The model weights are available on Hugging Face under the MIT license, according to The Decoder.

GLM-5.1 is a post-training upgrade to the GLM-5 foundation model released in February 2026. The architecture remains unchanged: a 744-billion-parameter mixture-of-experts (MoE) design with 40 billion active parameters per token, a 200,000-token context window, and a maximum output length of 131,072 tokens, as reported by The Decoder.

Benchmark Results

On SWE-Bench Pro, a benchmark that measures a model’s ability to resolve real-world software engineering tasks, GLM-5.1 scored 58.4 percent, according to The Decoder. That places it ahead of GPT-5.4 at 57.7 percent and Claude Opus 4.6 at 57.3 percent. On CyberGym, a cybersecurity benchmark, the model scored 68.7 percent, exceeding Claude Opus 4.6’s 66.6 percent.

The results are not uniformly strong. On the Humanity’s Last Exam (HLE) reasoning benchmark, GLM-5.1 scored 31 percent, trailing Gemini 3.1 Pro at 45 percent and GPT-5.4 at 39.8 percent, as reported by The Decoder. This gap suggests the model’s improvements are concentrated in coding and agentic tasks rather than general reasoning.

Agentic Capabilities

The primary advance in GLM-5.1 over its predecessor is the ability to sustain autonomous work sessions for extended periods. Where GLM-5 and competing models typically plateau after roughly 50 turns of interaction, GLM-5.1 continues refining its approach across hundreds of iterations, according to VentureBeat. Z.ai describes this as the model’s capacity for “self-directed optimization,” allowing it to recognize dead ends, revise strategy, and try new approaches without human intervention.

In one demonstration cited by The Decoder, the model made six distinct strategic shifts while optimizing a vector database over 600 iterations, ultimately achieving a throughput of 21,500 queries per second compared to a starting baseline of 3,547 QPS. Z.ai frames the upgrade as a shift “from vibe coding to agentic engineering,” positioning the model for tasks that require sustained multi-hour execution rather than single-turn responses.

Architecture and Training

GLM-5.1’s improvements come from refined reinforcement learning and alignment rather than additional pre-training. The underlying GLM-5 architecture was trained on approximately 28.5 trillion tokens, nearly double the 23 trillion used for its predecessor GLM-4.5, according to The Decoder.

Notably, the entire GLM-5 family was trained on approximately 100,000 Huawei Ascend 910B chips using the MindSpore framework, with no Nvidia GPUs used at any point in the process. This makes GLM-5.1 one of the most capable models trained entirely on non-Nvidia hardware, a detail with implications for the ongoing US export controls on advanced AI chips to China.

Company Background

Z.ai, a Tsinghua University spinoff backed by Alibaba, Tencent, and Xiaomi, completed a Hong Kong IPO on January 8, 2026, raising approximately HKD 4.35 billion (roughly $558 million), according to The Decoder. The company has been valued at over $5 billion.

What Remains Unclear

All benchmark results cited above come from Z.ai’s own testing. Independent verification by third parties has not yet been published. The model’s weaker performance on abstract reasoning benchmarks like HLE also raises questions about whether the coding-focused improvements generalize to other domains.

The computational resources required to run GLM-5.1 locally remain substantial, requiring multiple high-end GPUs for inference. While the MIT license removes legal barriers to commercial use, the hardware demands may limit adoption to organizations with significant compute budgets or those willing to use Z.ai’s hosted API.