Alibaba Qwen 3.5 Small Models Pack Multimodal Intelligence Into Sub-10-Billion Parameters for Edge Devices

Overview

Alibaba’s Qwen team released the Qwen 3.5 Small model series in late February 2026 — four dense language models at 0.8 billion, 2 billion, 4 billion, and 9 billion parameters, all licensed under Apache 2.0. The release represents a deliberate bet that multimodal AI capable of processing text, images, and video can run locally on phones, laptops, and single-GPU workstations without cloud connectivity.

The headline result: according to the official model card, the 9B variant scores 70.1 on MMMU-Pro versus 57.2 for OpenAI’s GPT-5-Nano and 78.9 on MathVision versus 62.2 — outperforming a model from a company with vastly greater resources on vision-language tasks.

What We Know

Architecture and design choices

All four models share a unified architecture built on Gated DeltaNet hybrid attention, as detailed in the Qwen3.5-9B and Qwen3.5-0.8B model cards on HuggingFace. The design uses a 3:1 ratio of linear-to-full attention blocks, reducing the computational cost of long-context processing while preserving the quality of full attention where it matters most. All models support a native context window of 262,144 tokens, with the 9B variant extendable to over 1 million tokens via YaRN scaling. A 248,000-token vocabulary covers 201 languages, and multi-token prediction accelerates inference by generating several tokens per forward pass.

Text and visual data are processed through early fusion multimodal training rather than through separate encoders, meaning vision capabilities are integrated from the start of training rather than bolted on afterward. According to the model documentation, this unified vision-language foundation achieves cross-generational parity with the larger Qwen3 models while outperforming Qwen3-VL models on reasoning, coding, agents, and visual understanding.

Benchmark performance

The 9B model’s benchmarks show it scoring 82.5 on MMLU-Pro and 81.7 on GPQA Diamond. On instruction following (IFEval), it reaches 91.5, and on LongBench v2 it scores 55.2. As Awesome Agents notes, these numbers surpass the previous-generation Qwen3-30B — a model three times its size — which scored 77.2 on GPQA Diamond.

In vision tasks, the gap against OpenAI’s GPT-5-Nano is striking. According to the official benchmarks, the 9B achieves 70.1 on MMMU-Pro versus GPT-5-Nano’s 57.2, and 78.9 on MathVision versus 62.2. Even the smallest 0.8B model posts competitive scores: 62.2 on MathVista and 74.5 on OCRBench, according to its model card.

Hardware requirements and deployment

The models are designed to run on consumer hardware without cloud infrastructure. As Awesome Agents reports, the 0.8B variant requires roughly 1.6 GB of VRAM, suitable for phones and embedded devices. The 2B model fits in approximately 4 GB, running on any recent laptop GPU. The 4B targets mid-range GPUs like the RTX 3060 at about 8 GB, and the 9B needs around 18 GB, fitting an RTX 3090 or 4090. All models are available on HuggingFace with quantized GGUF variants for further memory reduction.

Strategic context

The release enters a competitive field of small-model offerings from Meta (Llama 3.2), Microsoft (Phi-3.5), and Google (Gemma 2). Alibaba’s approach differs in a key respect: while most competitors have treated text and vision as separate model families at small scales, the Qwen 3.5 Small series ships native multimodal capabilities across the entire lineup, from the 0.8B model up. The Apache 2.0 license removes commercial use restrictions, allowing immediate integration into products and services.

What We Don’t Know

Alibaba has not published detailed training data composition or the total compute budget used to develop the series. The company has also not disclosed whether the models underwent red-teaming or adversarial safety testing beyond standard benchmark evaluation, a gap that matters as on-device deployment removes the guardrails that cloud providers typically layer around hosted models.

While the benchmark numbers are strong, independent evaluations by third parties are still emerging. How well the models perform on real-world enterprise workloads — retrieval-augmented generation, multi-turn tool use, and long-document summarization in production settings — remains to be demonstrated at scale.