Zyphra Releases ZAYA1-8B, an 8.4B-Parameter MoE Reasoning Model Trained End-to-End on 1,024 AMD MI300X GPUs
Zyphra's open-weight ZAYA1-8B uses 760M active parameters out of 8.4B total and was trained on a 1,024-GPU AMD Instinct MI300X cluster, narrowing the gap to frontier reasoning models on math benchmarks.
Overview
AI lab Zyphra on May 6 released ZAYA1-8B, an open-weight mixture-of-experts (MoE) reasoning model that the company says matches or exceeds substantially larger open-weight competitors on mathematics, coding, and reasoning benchmarks. According to Zyphra’s Hugging Face model card, the model has 760 million active parameters and 8.4 billion total parameters. The release is notable on two axes: it is one of the strongest reasoning models trained end-to-end on AMD silicon rather than NVIDIA, and it advances Zyphra’s bet on architectural efficiency over sheer parameter scale.
What We Know
Training stack. Zyphra pretrained, midtrained, and fine-tuned ZAYA1-8B on what it describes as a custom AMD-only cluster: per the press release, the model was “trained on custom AMD Instinct™ MI300X clusters with AMD Pensando Pollara networking on IBM Cloud infrastructure.” Yahoo Finance, republishing the announcement, specifies a cluster of 1,024 MI300X GPUs connected via AMD Pensando Pollara interconnect. The accompanying arXiv technical report, submitted the same day, describes the setup as a “full-stack AMD compute, networking, and software platform.”
Architecture. ZAYA1-8B is built on what Zyphra calls its MoE++ architecture, as described by VentureBeat, which introduces three departures from standard transformer MoE designs. The first is Compressed Convolutional Attention (CCA), which performs sequence mixing in a compressed latent space and yields what VentureBeat reports as an “8x reduction in KV-cache size compared to full multi-head attention.” The second is the ZAYA1 MLP router, which replaces the linear gating function that most MoE models use with a multi-layer MLP and a bias-balancing scheme inspired by PID controllers. The third is learned residual scaling, which Yahoo Finance describes as a mechanism for controlling residual-norm growth during training. The press release also describes a four-stage reinforcement learning cascade after pretraining, comprising reasoning warmup, an adaptive difficulty curriculum, math/code RL, and behavioral fine-tuning.
Test-time compute. The model ships alongside a new test-time-compute scheme Zyphra calls Markovian RSA. The arXiv report describes it as “a test-time compute method that recursively aggregates parallel reasoning traces.” VentureBeat expands the description: it combines parallel trace generation with the Markovian thinker pattern of performing reasoning in chunks of fixed duration.
Benchmarks. On the Hugging Face model card, Zyphra reports a score of 89.1 on AIME’26, 71.6 on HMMT Feb.‘26, and 71.0 on GPQA-Diamond. Under extended test-time compute, the arXiv technical report reports 91.9 percent on AIME’25 and 89.6 percent on HMMT’25, with the model matching or exceeding DeepSeek-R1-0528 on “several challenging mathematics and coding benchmarks.” VentureBeat notes that ZAYA1-8B retains competitive performance against GPT-5-High and DeepSeek-V3.2 on third-party evaluations.
Distribution and license. Per the Hugging Face model card, the model is released under Apache-2.0 and shipped in BF16 weights. It is also accessible through Zyphra Cloud, according to the press release.
Leadership framing. In the press release, Zyphra Founder and CEO Krithik Puthalath said: “ZAYA1-8B demonstrates what is possible when architecture, pretraining, and reinforcement learning are co-designed toward a single objective: maximizing the intelligence extracted per parameter and per FLOP.”
What We Don’t Know
Zyphra has not publicly disclosed the dollar cost of the training run, the wall-clock duration, or the size of the pretraining corpus. The company has also not detailed how it allocated MI300X capacity between pretraining, midtraining, and the four-stage RL cascade, nor how Markovian RSA’s compute budget compares to standard chain-of-thought decoding under matched FLOPs. Independent third-party benchmark replications had not appeared at publication time, and the arXiv submission notes the parameter count as “700M active and 8B total” — a slight rounding compared with the 760M / 8.4B figure on the Hugging Face card, which is the more granular of the two.
Analysis
The release sits at the intersection of two industry currents. The first is AMD’s push to be taken seriously as a training platform, not just an inference one: a 1,024-GPU MI300X cluster running an end-to-end pretraining, midtraining, and RL pipeline on Pensando Pollara networking is a more demanding existence proof than a single-node demo. The second is the growing argument that frontier reasoning is increasingly a function of architecture, post-training, and test-time compute rather than raw parameter count. VentureBeat frames the release as a counter-narrative to scaling-only roadmaps, arguing the next frontier is “smarter ‘thinking’ algorithms that can do more with less.” Whether ZAYA1-8B’s reported scores hold up under independent evaluation — particularly its claim, in the arXiv report, of narrowing the gap to Gemini-2.5 Pro, DeepSeek-V3.2 and GPT-5-High — will determine how much weight the broader research community gives the MoE++ recipe.