PrismML Exits Stealth With First Commercially Viable 1-Bit Large Language Models, Fitting an 8B-Parameter Model in 1.15 GB

Overview

PrismML, a startup founded by Caltech researchers, emerged from stealth on March 31 with what it calls the first commercially viable family of fully 1-bit large language models. The flagship 1-bit Bonsai 8B compresses an 8.19-billion-parameter model into 1.15 GB of memory — a 14x reduction from the 16.38 GB required by the same architecture at full FP16 precision — while scoring competitively against leading 8B-class models on standard benchmarks. All three variants ship under the Apache 2.0 license.

The release lands in the same efficiency-focused lane as Microsoft Research’s BitNet b1.58, which The Machine Herald covered in February, but PrismML’s models mark the first time a commercially licensed, production-ready 1-bit model family has been made available for general download.

How 1-Bit Quantization Works in Bonsai

Traditional large language models store each parameter as a 16-bit or 32-bit floating-point number. Bonsai reduces every weight to a single bit: each value maps to either positive or negative scale, with every 128 weights sharing one FP16 scale factor. This yields an effective 1.125 bits per weight across the entire network — embeddings, attention projections, MLP layers, and the language model head.

The architecture underneath is a Qwen3-8B dense transformer with 36 decoder blocks, grouped-query attention (32 query heads, 8 key-value heads), SwiGLU activation, and rotary position embeddings supporting a 65,536-token context window. PrismML trained the models on Google TPU v4 pods using proprietary techniques developed from research at Caltech, where CEO Babak Hassibi holds a professorship in electrical engineering and computing.

Benchmark Performance

PrismML published evaluation scores using EvalScope v1.4.2 across six benchmarks. The 1-bit Bonsai 8B achieved an average score of 70.5, compared with 79.3 for the full-precision Qwen 3 8B (16 GB), 71.0 for Mistral 3 8B (16 GB), and 67.1 for Meta’s Llama 3.1 8B (16 GB). The model scored 88 on GSM8K math reasoning, 65.7 on MMLU-Redux, and 79.8 on instruction-following (IFEval).

The company introduced an “intelligence density” metric — the negative log of error rate divided by model size in gigabytes — on which Bonsai 8B scored 1.062 per GB, roughly 10.8x higher than Qwen 3 8B’s 0.098. The metric highlights the core thesis: at one-fourteenth the memory, Bonsai retains the large majority of a full-precision model’s capability.

Speed and Energy Efficiency

The memory savings translate directly into inference speed because on-device inference is memory-bandwidth bound. On an NVIDIA RTX 4090, Bonsai 8B generates 368 tokens per second — 6.2x faster than the FP16 baseline’s 59 tokens per second. On an Apple M4 Pro laptop, throughput reaches 85 tokens per second, a 5.4x improvement. An RTX 3060 laptop GPU sees the most dramatic gain at 23x, jumping from 3.5 to 81 tokens per second.

Energy consumption drops in proportion. PrismML measured 0.276 milliwatt-hours per token on the RTX 4090, compared with 1.134 mWh for the FP16 model — a 4.1x efficiency gain. On the M4 Pro, the advantage widens to 5.1x.

The Model Family

Alongside the 8B flagship, PrismML released two smaller variants. The 4B model fits in 0.57 GB and the 1.7B model in just 0.24 GB, targeting robotics, wearable devices, and real-time agent applications. The company claims the 1.7B model runs at 130 tokens per second on an iPhone 17 Pro Max.

All models are available for download on Hugging Face in GGUF format and run through a fork of llama.cpp maintained by PrismML. An MLX variant is also provided for Apple Silicon.

Backing and Team

PrismML was founded by Babak Hassibi, the Mose and Lillian S. Bohn Professor of Electrical Engineering and Computing and Mathematical Sciences at Caltech, along with co-founders Sahin Lale, Omead Pooladzandi, and Reza Sadri, all holding PhDs with Caltech ties. The company has received investment from Khosla Ventures and Cerberus Ventures, with compute support from Google and Caltech grants.

“1-bit functions as a starting point” for achieving “maximum intelligence per unit of compute and energy,” Hassibi stated in the company’s announcement. Khosla Ventures founder Vinod Khosla added that “AI’s future” depends on delivering the “most intelligence per unit of energy and cost.”

What Comes Next

The release arrives at a moment when on-device inference is gaining strategic importance. Apple, Google, Samsung, and Qualcomm are all shipping phones with dedicated AI accelerators, and automotive manufacturers are embedding language models in vehicle systems. A model that fits an 8B-parameter architecture in just over a gigabyte could lower the hardware floor for deploying capable language models from data center GPUs to consumer laptops, phones, and embedded systems — provided the roughly 10-point accuracy gap to full-precision models proves acceptable for the target applications.