Basecamp Research Launches Trillion Gene Atlas to Map 100 Million Species and Accelerate AI-Designed Therapeutics

Basecamp Research, a UK biotech startup that has spent five years building what it calls the world’s largest ethically sourced biological dataset, announced on March 18 that it is launching the Trillion Gene Atlas, a landmark initiative to expand known evolutionary genetic diversity by 100-fold. Unveiled simultaneously at SXSW in Austin and NVIDIA’s GTC conference in San Jose, the project aims to collect genomic data from more than 100 million species across thousands of sampling sites in 31 countries on five continents, compressing what would otherwise take over two decades of data gathering into less than two years.

The company has assembled a coalition of technology partners to execute the initiative. Ultima Genomics and PacBio will provide industrial-scale DNA sequencing, Anthropic’s Claude will serve as a reasoning engine for interpreting complex biological data, and NVIDIA’s CUDA-X libraries and AI infrastructure will handle the computational demands of processing trillions of genetic sequences. The scale of ambition has drawn comparisons to the Human Genome Project, which took 13 years and roughly $3 billion to sequence a single species.

From Arctic expeditions to a biological internet

Basecamp Research was co-founded by Glen Gowers and Oliver Vince after a 2019 Arctic expedition revealed that two-thirds of the biological samples they collected had never been recorded by science. That experience seeded the company’s central thesis: the vast majority of Earth’s genetic diversity remains unmapped, and public genomic databases represent only a narrow slice of life’s evolutionary solutions to biological problems.

Over the subsequent five years, Basecamp built BaseData, a proprietary metagenomic dataset spanning over one million newly discovered species and containing more than 10 billion genes that are new to science. The company says BaseData represents a 10-fold expansion of known protein diversity compared to all public databases combined. Having raised $85 million in venture capital to date, Basecamp has also built a data provenance system that pays royalties to 60 organizations across 21 countries based on how their contributed samples are used in downstream applications.

EDEN and the scaling laws of biological AI

The Trillion Gene Atlas builds on Basecamp’s EDEN (Environmentally-Derived Evolutionary Network) family of foundation models, described in a January 2026 preprint. The flagship model, EDEN-28B, is a 28-billion-parameter system trained on 9.7 trillion nucleotide tokens from BaseData. It is, in effect, a GPT-4-scale model for biology rather than language.

Training EDEN on BaseData’s contextually rich metagenomic sequences rather than the decontextualized protein records in public databases revealed what Basecamp describes as new scaling laws for biological AI. Where most biological foundation models plateau as dataset size increases, EDEN’s performance followed steeper scaling trajectories with higher quality, fully contextualized data. Public databases, the company notes, contain fewer than 250 million sequences. The Trillion Gene Atlas is designed to extend this principle by another 100-fold, feeding far greater evolutionary diversity into future EDEN iterations.

The practical implications have already emerged. EDEN became what Basecamp calls the first model capable of designing diverse therapeutics directly from a disease prompt, rather than simply predicting protein structures or properties. In antimicrobial peptide design, 32 out of 33 EDEN-designed peptides were functional, a 97 percent hit rate against WHO critical-priority and multidrug-resistant pathogens. The model also demonstrated zero-shot activity in primary human T-cells for gene therapy applications.

AI-Programmable Gene Insertion

Perhaps the most consequential capability is what Basecamp has branded aiPGI (AI-Programmable Gene Insertion), a system that uses EDEN to design integration site-specific enzymes capable of inserting large therapeutic DNA sequences at precise locations in the human genome. In laboratory assays, aiPGI-engineered CAR T-cells achieved over 90 percent tumor-cell clearance when cancer-fighting DNA was integrated at novel safe-harbor sites. The approach bypasses the viral vector delivery that constrains many current gene therapies, potentially enabling larger genetic payloads and more predictable integration patterns.

The combination of programmable gene insertion and antimicrobial design positions EDEN across two of medicine’s most pressing frontiers: personalized cell therapies for cancer and the global antibiotic resistance crisis. The Machine Herald has previously covered the expanding intersection of AI and biotechnology, including OpenAI and Ginkgo Bioworks’ autonomous laboratory for protein synthesis and the FDA’s evolving framework for personalized CRISPR therapies. The Trillion Gene Atlas represents a different bet: that the bottleneck in AI-driven drug discovery is not compute or model architecture, but the breadth and contextual richness of training data drawn from the natural world.

The sequencing infrastructure

Executing the Atlas requires sequencing capacity that did not exist at consumer price points even a year ago. Ultima Genomics’ UG200 Series, which began shipping in the second quarter of 2026, can produce 20 billion reads per wafer and sequence more than 60,000 human genomes per year on a single instrument, at a cost of roughly $80 per genome. PacBio’s HiFi long-read technology complements this with high-accuracy reads that resolve complex genomic regions, repeat sequences, and structural variants that short-read platforms miss.

The pairing of high-throughput short reads with high-fidelity long reads is significant. Metagenomic samples from environmental sources contain fragments from hundreds or thousands of organisms, and accurately assembling complete gene sequences from that mixture requires both depth and length. The dual-platform strategy is designed to produce approximately 100,000 deeply sequenced samples from more than 31 countries, creating what the partners describe as the largest and most diverse high-fidelity metagenomic dataset ever assembled.

Anthropic’s role and the convergence of language and biology

The Anthropic partnership extends beyond providing compute. Basecamp is working with the Claude for Life Sciences team to make Claude a more productive research partner for scientists and clinicians, integrating EDEN’s therapeutic design capabilities with Claude’s reasoning abilities. The envisioned workflow would allow researchers to move from interpreting complex clinical data to generating candidate therapeutics in a single integrated system.

This convergence of large language models and biological foundation models reflects a broader trend. Where early biological AI focused on narrow tasks like protein structure prediction, the field is moving toward systems that can reason across modalities: reading scientific literature, interpreting genomic sequences, designing therapeutic molecules, and planning experimental validation. Whether that integration delivers on its promise will depend on how well models trained on fundamentally different data types, natural language and nucleotide sequences, can be made to complement each other.

Data ethics in an era of extraction

Basecamp’s data provenance and royalty system sets it apart from many genomics initiatives. Since 2023, the company has tagged, tracked, and measured each sample’s origin and downstream contribution, distributing payments when commercial value is created. The approach addresses long-standing criticisms of biopiracy and data extraction from biodiversity-rich nations that see little benefit from the discoveries their ecosystems enable.

Glen Gowers, Basecamp’s CEO, stated that the initiative aims to establish “a new paradigm for programmable therapeutic design.” Whether the Trillion Gene Atlas achieves that goal depends on several open questions: whether metagenomic scaling laws hold at the trillion-gene scale, whether EDEN’s laboratory results translate to clinical efficacy, and whether the two-year timeline for data collection proves realistic given the logistical complexity of sampling across 31 countries and five continents.

What is not in question is the scale of the undertaking. If completed, the Trillion Gene Atlas would represent the most comprehensive survey of Earth’s genetic diversity ever attempted, and the first designed from the outset to train AI systems capable of designing medicines.