Tahoe Therapeutics

Tahoe Therapeutics is building large perturbation datasets and perturbation-trained foundation models that aim to represent how cancer cells change when exposed to drugs or gene edits, and making them accessible through an analysis application that can be queried in plain language. Its open Tahoe-100M single-cell perturbation atlas and Tahoe-x1 (Tx1) model family are positioned as core inputs for virtual cell modelling, while TahoeDive provides an interface for scientists to query these resources without writing code.

Founding Date

Apr 1, 2022

Headquarters

South San Francisco, CA

Total Funding

$42M

Status

Private

Stage

Series A

Employees

32

Careers at Tahoe Therapeutics

Memo

Updated

January 8, 2026

Reading Time

41 min

Thesis

Drug development remains expensive and uncertain: analyses of large biopharma pipelines estimate that bringing a new drug to market costs an average of about $2.2 billion per asset in 2024, and composite success rates from first-in-human trials to approval were around 10.8% across therapy areas, with oncology among the lowest-performing categories. A growing response to this problem is to build cell-state representation models, digital descriptions of how individual cells behave, by measuring which genes are switched on or off in single cells under many conditions and then training computer models to predict how those cells will change when exposed to a drug or a gene edit.

Single-cell RNA-sequencing and CRISPR screening technologies now allow researchers to record gene-activity patterns for hundreds of thousands of cells in a single experiment, and industry data show that platforms such as 10x Genomics’ GEM-X chemistry have cut the cost per analysed cell by more than 50%, in some settings to below $0.01 per cell. At the same time, pooled CRISPR screens, which allow for experimentation across many genes across large cell populations in one pooled experiment, have become a widely used and relatively cost-effective way to map gene function at scale.

Alongside these experimental advances, several academic groups have shown that machine-learning models can predict how single cells will respond to perturbations using existing data. Methods such as scGen and the Compositional Perturbation Autoencoder (CPA) use deep generative models to learn how gene-expression profiles shift when cells are treated or edited, and can generate in-silico predictions for new cell types, drug doses, and combinations.

Regulators are also formalising how computer models fit into drug development: the FDA’s model-informed drug development (MIDD) meeting program and the ICH M15 draft guideline set out principles for using quantitative models in regulatory decisions, and recent guidance emphasises a structured credibility framework rather than excluding such models outright.

In parallel, the US FDA and EMA have approved CRISPR-based and other gene-therapy products such as Casgevy and Lyfgenia for sickle-cell disease and related indications in December 2023, establishing that genome-editing interventions can meet regulatory standards for safety and efficacy. Together, cheaper single-cell and CRISPR experiments, emerging in-silico perturbation models, and clearer regulatory pathways for both modelling and gene editing define a distinct digital cell space: using large experimental datasets and machine learning to forecast how real cells will respond to interventions before running large, expensive wet-lab and clinical programmes.

Tahoe Therapeutics is building large perturbation datasets and perturbation-trained foundation models that aim to represent how cancer cells change when exposed to drugs or gene edits, and making them accessible through an analysis application that can be queried in plain language. Its open Tahoe-100M single-cell perturbation atlas and Tahoe-x1 (Tx1) model family are positioned as core inputs for virtual cell modelling, while TahoeDive provides an interface for scientists to query these resources without writing code.

Weekly Newsletter

Subscribe to the Research Rundown

Founding Story

Co-founders of Tahoe Therapeutics

Source: Forbes

Tahoe Therapeutics, originally launched as Vevo Therapeutics, was founded in 2022 by Nima Alidoust (CEO), Johnny Yu (CSO), Hani Goodarzi, and Kevan Shokat to commercialize a cell-based drug discovery platform developed at UCSF and licensed through UCSF Innovation Ventures. The company rebranded from Vevo Therapeutics to Tahoe Therapeutics in April 2025.

The scientific foundation of Tahoe’s platform traces back to Yu’s doctoral work at UCSF. Yu completed his PhD in Biomedical Sciences at UCSF, where his dissertation focused on a scalable drug discovery system that tracks how individual cells inside living tumors respond to many different drug candidates in a single experiment. Working in the Goodarzi laboratory and collaborating with Shokat, he co-developed what became the Mosaic platform, which uses mosaic tumors built from many different cell lines or patient-derived models to read out cell-by-cell drug responses using sequencing.

Before founding Tahoe, Yu gained industry experience at the Broad Institute and Biogen, focusing on oncology targets and high-throughput functional genomics. He serves as Chief Scientific Officer of Tahoe, leading the company’s experimental programs, including the Mosaic-based campaigns and the Tahoe-100M dataset.

Yu’s doctoral training was supervised in the laboratory of Goodarzi, the first link in the founding team. Goodarzi earned his PhD in quantitative and computational biology from Princeton University, followed by postdoctoral training in cancer systems biology at Rockefeller University. He joined UCSF in 2016 and is now an Associate Professor in the Department of Biochemistry and Biophysics, with a research program that combines computational and experimental genomics to study how complex diseases such as cancer progress.

In addition to his academic role, Goodarzi previously co-founded Exai Bio, a biotechnology company built around oncology RNA diagnostics. At UCSF, Goodarzi’s lab served as the main academic base for Mosaic, co-developed with Yu and collaborators; several of the early publications on mosaic tumor models list both Goodarzi and Yu as senior authors. This collaboration set up the subsequent involvement of Shokat, who appears as a co-author on the core Mosaic work.

Shokat, a long-time UCSF and UC Berkeley faculty member, provided the medicinal chemistry and drug-target perspective connected to this work. Shokat joined UCSF, where he is Professor of Cellular and Molecular Pharmacology. Over the past two decades, Shokat has co-founded multiple biotechnology companies, including Intellikine (later acquired by Takeda), Revolution Medicines, Araxes/Wellspring, and Erasca, all centered on small-molecule drug discovery. Through his collaboration with Yu and Goodarzi on Mosaic and related tumor models, he joined the group as a co-founder when the UCSF platform was spun out into Tahoe in 2022.

Goodarzi’s Princeton connection provided the bridge to Alidoust, who became the founding CEO. Like Goodarzi, Alidoust completed his PhD at Princeton University, where he worked on computational and quantum chemistry. After Princeton, Alidoust spent two years as a Senior Associate at McKinsey & Company, before moving on to Rigetti Computing, where he led business development and strategic partnerships and later served as Vice President of Product. He subsequently became the CEO of the computational chemistry group that was spun out of 1QBit and later acquired by SandboxAQ, focusing on simulation software for chemistry and materials. In 2022, Alidoust joined Yu, Goodarzi, and Shokat as CEO and co-founder to build a company around Mosaic-generated data and large AI models of cell behavior.

From 2023 onward, Tahoe began adding senior staff to turn the founding group’s academic and computational work into a broader platform organization. In January 2023, computational biologist Daniele Merico joined Tahoe as Chief Data Officer and VP of Computational Biology after previously serving as Vice President and Head of Target Identification at Deep Genomics and holding roles at Toronto’s Hospital for Sick Children.

Product

Scientific Background

Every cell in the body carries essentially the same DNA, but different cell types use different parts of the DNA at various times. A cell uses a gene to carry out its specific functions. It does this by making an RNA copy of the DNA; those RNA molecules are then used to make proteins or to regulate other genes. To get a snapshot of the cell’s activity, the number of RNA copies of each gene present is determined.

RNA sequencing (RNA-seq) is a laboratory method that does this at scale. In a typical RNA-seq experiment, RNA is extracted from cells, converted into DNA that sequencing machines can read, and then counted to estimate expression levels across thousands of genes at once.

Even within a single tumor or tissue, cells are not identical. They can differ in their DNA mutations, the genes they have turned on or off, how fast they divide, and how they respond to stress or treatment. This heterogeneity, more common in tumor cells, is defined as the genetic and phenotypic diversity among cancer cells within one tumor and is driven by the ongoing evolution and mutation of multiple subclones.

Clinical and experimental studies consistently link this heterogeneity to treatment resistance and relapse. Therefore, it is important to understand which distinct cell states exist inside a tumor is central to predicting how long a treatment will keep the disease under control.

Gene-expression measurements give a direct readout of how a cell’s internal programs change when it is exposed to a drug or genetic change. In drug discovery, these readouts are used to answer three practical questions: whether the drug is targeting the correct pathways, what other pathways are being affected (on-target and off-target effects), and whether there are early molecular signs that some cells or patients will respond differently from others.

Historically, most gene-expression experiments in this context have used bulk RNA-seq. In a bulk experiment, RNA from many cells is pooled together before sequencing, and the result is a single average expression profile per sample. This average can show whether genes go up or down overall, but it cannot reveal how individual cells differ from each other.

Single-cell RNA sequencing (scRNA-seq) takes a different approach: each cell is labeled with a molecular barcode, and sequencing reads are assigned back to individual cells, producing a separate gene-expression profile for each one. This cell-by-cell view directly addresses the problem of intratumor heterogeneity: instead of inferring diversity from an average, the experiment measures the diversity explicitly at the level of individual cells.

Bulk RNA-seq became widespread first because it is technically simpler and cheaper per sample. Standard workflows, extracting RNA from a tissue, generating libraries, sequencing them as a pooled sample, and running established analysis pipelines, are now routine in many laboratories and contract research organizations.

By contrast, early scRNA-seq methods were technically demanding and limited in throughput. They often required manual or low-throughput isolation of single cells, generated smaller numbers of profiled cells per run, and produced large, sparse datasets that posed additional computational challenges. Over the past decade, advances have substantially increased throughput and lowered per-cell costs. Routine experiments profile tens of thousands of cells and specialised “atlas” or perturbation studies designed to reach hundreds of thousands or even millions of cells, though these still typically require more complex infrastructure than bulk experiments.

Many experiments that use these technologies are perturbation experiments, in which cells are deliberately changed to see how they respond. A perturbation can be a small-molecule drug, a genetic change introduced by tools such as CRISPR, or another controlled stimulus. In bulk perturbation studies, RNA-seq is used to measure gene expression before and after the perturbation, allowing researchers to map which pathways are affected and to link those changes to observed effects on cell behavior.

Single-cell perturbation screens combine this idea with scRNA-seq. In these experiments, many perturbations (for example, a panel of drugs or a library of guide RNAs) are applied in a pooled format, and each cell carries a barcode that records which perturbation it received. Sequencing then yields individual expression profiles tagged with the perturbation, so that for each drug or genetic change, researchers can see which cell types are affected, how their gene programs shift, and whether rare resistant or tolerant cells persist or emerge.

As datasets from bulk and single-cell gene-expression experiments have grown, several research groups have started to train large AI models, often called foundation models, directly on these data. In these models, each gene, each cell, and sometimes each drug is turned into a list of numbers (an embedding) that summarizes how it tends to behave across many experiments. Models such as Geneformer, scGPT, and other large cellular models follow this pattern and are trained on millions to tens of millions of single-cell profiles.

This work aims to move towards a digital cell: a software model that can take a description of a starting cell and a defined change and then predict how the cell’s gene activity will shift. Current results suggest that foundation models are promising tools for working with complex single-cell and perturbation data, but they also have clear limits and often perform only similarly to simpler approaches on some benchmark tasks.

Tahoe Product Overview

Tahoe’s product stack consists of three linked components built on the single-cell perturbation science described earlier. Tahoe-100M is a large open dataset of single-cell drug-response measurements. Tx1 (Tahoe-x1) is a family of AI models trained on Tahoe-100M and other single-cell datasets to act as “foundation models” of cell state. TahoeDive is a web-based analysis application, developed with Kepler AI and TileDB, that allows users to interrogate Tahoe-100M via a browser using natural-language queries rather than custom code.

Tahoe-100M

Tahoe-100M is a dataset comprising a little over 100 million single-cell gene-expression profiles from 50 cancer cell lines exposed to approximately 1K small-molecule perturbations, yielding around 60K drug–cell combinations. The dataset was generated using Tahoe’s Mosaic high-throughput single-cell platform and is distributed via Hugging Face under a CC0 public-domain licence.

Tahoe and Arc state that Tahoe-100M is explicitly designed as perturbation-focused infrastructure for building and benchmarking AI models of cell behavior, and describe it as the largest publicly accessible single-cell drug-response dataset to date, “50x larger than all public drug-perturbed single-cell data combined.”

Tahoe’s materials highlight three main intended uses: training and comparing foundation models such as Tx1 on a common, large-scale perturbation resource; mapping how individual drugs affect diverse genetic backgrounds across the 50 cell lines; and relating drug perturbations to genetic perturbations to aid target deconvolution. A key caveat, noted in the manuscript and external coverage, is that Tahoe-100M is derived from cancer cell lines grown in vitro rather than primary human tumors, so results are positioned as preclinical discovery inputs that typically require validation in more complex models before being used for decisions closer to patients.

Tx1 (Tahoe-x1)

Tx1, or Tahoe-x1, is described as a family of perturbation-trained single-cell foundation models with up to 3 billion parameters. The models are reported to be pretrained on about 266 million single-cell profiles, including the Tahoe-100M perturbation compendium, and are released in several sizes (up to 3 billion parameters) with open weights, training code, and evaluation workflows. The documentation explains that Tx1 learns numerical embeddings for cells and genes and is trained to predict post-perturbation gene-expression profiles and related labels from an initial cell state and a specified drug or genetic change.

Tahoe characterizes Tx1 as the modeling layer built on Tahoe-100M, and as a domain-specific counterpart to more general language-based foundation models. In its technical blog and model materials, the company claims three- to thirty-fold improvements in computational efficiency over earlier cell-state models on a suite of cancer-relevant benchmarks, while matching or exceeding their predictive accuracy.

Model training cost efficiency chart against competitors

Source: Tahoe

Tahoe also claims to provide research groups with an off-the-shelf foundation model trained on perturbation data, to supply reusable cell and gene embeddings for downstream tasks (such as clustering or response prediction), and to serve as a reference point for benchmarking new model architectures on shared tasks. Tx1 is explicitly presented as a preclinical research tool; Tahoe notes that it is trained mainly on cell lines and related datasets, and that any hypotheses or candidate targets generated using Tx1 require experimental validation before further development.

TahoeDive

Example of how TahoeDive is used

Source: Tahoe

TahoeDive is described as a web-based analysis application that allows users to query and analyse Tahoe-100M through a natural-language interface. Users access TahoeDive through a browser, select the Tahoe-100M dataset, and pose questions in plain language; the system uses language and reasoning models to translate these questions into analysis steps, executes them against the underlying single-cell data, and returns results alongside the corresponding analysis code. TahoeDive runs on the Kepler AI platform with TileDB as the storage engine.

Tahoe and Kepler present TahoeDive as an access layer intended to lower the technical barrier to using Tahoe-100M. The company claims that bench scientists and non-specialist analysts can interact directly with the dataset, while more technical users can reuse or adapt the automatically generated code in their own environments. As with Tahoe-100M and Tx1, TahoeDive is framed as a research and analysis environment; the partnership materials do not position it as a clinical decision tool, and any findings derived from its use depend on the scope and limitations of the underlying datasets.

Market

Customer

Tahoe’s materials indicate that its near-term focus is on teams that already work with single-cell data and AI in oncology and drug discovery. The company describes TahoeDive as a way for cancer and computational biologists to query and analyse Tahoe-100M using natural-language questions, and notes that Tahoe-100M has been downloaded tens of thousands of times by bioinformaticians. The platform is described as enabling pharma and biotech teams to use Tahoe-100M alongside their own single-cell datasets to train foundation models and run agent-based queries.

On this basis, the primary ideal customers for Tahoe would be large academic cancer centres, major research hospitals, and biopharmaceutical R&D sites that already run substantial single-cell and gene-expression programmes. As a proxy for the size and concentration of this group, 10x Genomics reports that it had sold about 6K instruments globally as of December 2023, serving academic, translational, and biopharmaceutical researchers.

Meanwhile, the US National Cancer Institute lists 73 NCI-designated cancer centers supported by infrastructure grants for laboratory and clinical cancer research. The inference is based on the fact that these institutions already have the instruments, data, and specialised staff required to generate and interpret the types of measurements Tahoe’s products are built around, and therefore, the fit is primarily a matter of capability and workflow alignment rather than only cost reduction.

Market Size

Worldwide prescription drug sales are projected to reach $1.7 trillion by 2030. In 2022, the largest 50 pharmaceutical companies reinvest an estimated $167 billion of this revenue into R&D annually. Published cost breakdowns suggest that discovery and preclinical work account for roughly one-third of total per-drug development cost, with the remaining two-thirds concentrated in clinical phases.

Discovery and Preclinical Spending

The global drug discovery market was estimated at $72 billion in 2025 and is forecast to reach around $160.3 billion by 2034, implying a CAGR of about 9.3%. The United States segment is estimated at $25.2 billion in 2024, with projections of roughly $60.2 billion by 2034 (CAGR ~9.1%). Within this, discovery and preclinical services provided by contract research organizations (CROs) are estimated to be $27.4 billion in 2024, with forecasts of $70.3 billion by 2034 (CAGR 9.9%).

Global preclinical CRO revenues add approximately $6.2 billion in 2025, projected to reach about $12.4 billion by 2033 (CAGR roughly 8%). Markets directly connected to single-cell and high-content analysis are smaller but faster-growing. Recent estimates place the global single-cell analysis market at around $5.7 billion in 2024, with forecasts of approximately $22.8 billion by 2033 (CAGR ~16.1%).

Software, Data, and AI

Spending on software, data, and AI tools for discovery sits inside these budgets. The global bioinformatics market is estimated at $25.8 billion in 2024, with projections of $94.8 billion by 2032 (CAGR of 16.9%). Within bioinformatics, drug discovery informatics is estimated at about $3.6 billion in 2024 and projected to reach $8.4 billion by 2033. The adjacent segment of in-silico drug discovery tools (software that simulates or models discovery steps) is estimated at about $3.4 billion in 2024, with forecasts in the $12.8 billion range by 2034.

Tahoe’s TAM

On this basis, an effective software/data TAM relevant to Tahoe can be approximated by combining these two overlapping segments. Adding the 2024 estimates for drug discovery informatics (~$3.6 billion) and in-silico discovery tools (~$3.4 billion) yields an approximate $7 billion software/data TAM in 2024.

Using the same approach with the upper-end forecasts for 2033 – 2034 (drug discovery informatics ~$8.4 billion, in-silico tools ~$12.8 billion) implies a combined opportunity on the order of $21.2 billion by the early-to-mid 2030s. These combined figures are constructed estimates based on segment reports, not a single published number, but they serve as an order-of-magnitude view of budgets available for discovery-stage data, models, and analysis platforms.

Market studies on AI in drug discovery report overlapping but narrower figures. The AI-specific segment is estimated at roughly $1.7 billion in 2024, with projections of around $8.5 billion by 2030 (CAGR ~30.6%). This can be viewed as the AI-labelled subset of the broader software/data TAM above, rather than an independent pool of spend.

Customer Concentration

R&D and discovery software budgets are concentrated in a limited number of organizations. The top fifteen pharmaceutical companies account for most of the $138 billion (2022) in annual pharma R&D spend and are key purchasers of discovery informatics, in-silico tools, and AI platforms. On the academic and hospital side, the 73 NCI-designated cancer centers in the United States and a comparable set of major centers in Europe and East Asia represent a relatively small but high-spend cluster of potential buyers for single-cell–driven discovery tools.

Given this concentration, the growth in Tahoe’s addressable market is tied less to an increase in the number of eligible institutions and more to the growth of per-institution spend in the discovery software/data segments described above. The reported CAGRs of roughly 9% for drug discovery, single-cell analysis, and related informatics markets indicate that the underlying budgets for platforms like Tahoe-100M, Tx,1, and TahoeDive are expanding faster than the overall pharma market.

Competition

Competitive Landscape

The competitive landscape around Tahoe’s products spans five main categories: open-source single-cell resources, commercial single-cell analysis applications, enterprise omics data platforms, data-driven “platform biotech” companies, and literature/pathway tools used for target reasoning. These categories describe how potential customers currently access data and analysis for single-cell and perturbation work.

1. Open-Source Single-Cell Resources

Public (open-source) provides the lowest-cost way to work with single-cell data. General repositories such as GEO and ArrayExpress, and specialist portals such as the Human Cell Atlas data portal and the Single Cell Portal, host community-generated bulk and single-cell datasets that can be browsed online and downloaded for local analysis. Guides to public single-cell databases list resources like the Arc Virtual Cell Atlas, Human Cell Atlas, and cellxgene as key hubs, while noting that datasets are provided in formats intended for downstream analysis in statistical programming tools such as Seurat and Scanpy.

These resources give users access to large numbers of experiments, but they are fragmented across portals and typically require coding or specialist bioinformatics tools to integrate studies, run perturbation analyses, and build their own models. Public documentation for these portals emphasizes dataset exploration, download, and basic visualisation; it does not describe a single, unified atlas that combines both drug and gene-editing (CRISPR) perturbation experiments with natural-language querying. This absence is an inference based on the capabilities documented by these portals.

2. Commercial Single-Cell Analysis Applications

Commercial single-cell analysis applications provide graphical interfaces for exploring and analyzing user-supplied single-cell datasets. Representative examples include desktop or browser-based tools that offer clustering, marker detection, differential expression, and basic integration workflows via point-and-click rather than code; vendor documentation presents them as ways to visually analyse single-cell data generated on common platforms.

These applications lower the barrier to working with individual datasets or small collections, but they are generally tied to user-uploaded or platform-specific data and focus on analysis pipelines rather than on curating large shared perturbation resources. Public feature descriptions highlight interactive plots and workflow steps; they do not describe bundling a pre-built, large-scale single-cell drug-plus-CRISPR atlas or offering natural language search across such a resource. That comparison is an inference drawn from vendor documentation rather than an explicit positioning statement.

3. Enterprise Omics Data Platforms

Enterprise omics data platforms provide cloud infrastructure for storing, managing, and analyzing very large genomic and multi-omics datasets. Cloud solutions in this category emphasize secure, scalable storage, role-based access control, and workflow engines for bioinformatics pipelines, positioning themselves as central hubs for teams to store, analyze, and jointly interpret biomedical data and to co-locate analysis with large reference datasets.

These platforms compete at the infrastructure layer: they enable organizations to bring together internal and external datasets and run pipelines at scale, but they are data-agnostic by default. Their public materials describe capabilities for managing arbitrary omics projects; they do not indicate that they ship with a specific, pre-integrated single-cell atlas of drug and gene-editing perturbations or that they provide natural-language querying over such a combined resource out of the box. This is an inference based on how these platforms present their role as general-purpose infrastructure.

4. Platform Biotechs and AI-Enabled Discovery Vendors

Platform biotech companies and AI-enabled discovery vendors use proprietary data and models to run in-house or partnered drug discovery programmes. Public descriptions of such companies emphasize integrated operating systems that span target identification through clinical development, built on large proprietary datasets (for example, petabyte-scale phenotypic and multi-omics collections) and bespoke machine-learning models, with revenue generated through internal pipelines, co-development deals, and milestone-based partnerships rather than software licences.

These firms therefore compete for the same “AI for discovery” budgets at large pharma and biotech organisations, but their offering is typically a service or partnership model rather than a general-purpose data or analysis product that external teams can adopt directly. Their own materials focus on advancing in-house candidates and entering collaboration agreements, not on providing open or off-the-shelf single-cell perturbation atlases or self-service analysis environments. This positioning difference is based on how they describe their business models and is not an explicit comparison to Tahoe.

5. Literature, Pathway, and Target-Reasoning Tools

Literature and pathway tools support the interpretation of gene lists and other omics results by linking them to curated biological knowledge. Ingenuity Pathway Analysis (IPA), for example, is described as a web-based application that enables analysis, integration, and understanding of gene-expression and other omics data by mapping them onto pathways and causal networks. Reviews of pathway-based analysis tools highlight their role in visualizing gene sets, identifying enriched pathways, and supporting hypothesis generation from high-throughput genomics studies.

These tools operate primarily at the level of annotated genes, pathways, and regulatory networks and often take bulk or summarised expression data as inputs. They are widely used for target reasoning and literature-based interpretation, but do not function as repositories of cell-level drug and gene-editing perturbation data, nor do they present themselves as natural-language interfaces to such datasets.

Competitors

10x Genomics: 10x Genomics was founded in 2012 and develops instruments, consumables, and software for single-cell and spatial analysis of gene activity and chromatin in tissues and cell suspensions. Before going public, it raised just over $242 million in venture funding and had a market cap of $2.4 billion as of January 2026. The company reported an installed base of 6K instruments as of December 2023.

Among competitors in this space, 10x operates mainly as a life-science tools vendor: research groups buy its platforms (such as Chromium and Visium) and associated software like Loupe Browser to generate and analyse their own single-cell datasets. Tahoe, by contrast, does not sell instruments or wet-lab kits; its offering centers on Tahoe-100M, an open single-cell perturbation atlas covering 100 million cells, about 60K drug–model interactions and more than 1.1K drug treatments across 50 cancer models, together with the Tx1 foundation-model family and the TahoeDive analysis interface built on top of that shared dataset.

Cellarity: Cellarity, founded within Flagship Pioneering in 2017 and based in the Boston area, describes itself as a life-sciences company that uses computational modelling of cell behavior to discover medicines at the level of the whole cell rather than single molecular targets. The company raised $123 million in a Series B round in 2021 and $121 million in a Series C round in October 2022; its total funding was $294 million as of January 2025.

Cellarity applies its cell-state models internally to run drug-discovery programmes in areas such as metabolic and blood disorders and has announced collaborations with partners including Novo Nordisk; its business model is based on advancing proprietary and partnered pipelines rather than selling data or analysis software as stand-alone products. Tahoe’s approach is different: it has open-sourced Tahoe-100M as a general-purpose perturbation atlas and released Tx1 as a perturbation-trained foundation-model family with open weights and code, alongside TahoeDive as a natural-language analysis front end, all of which are designed to be used directly by outside groups rather than held exclusively for internal drug programmes.

Relation Therapeutics: Relation Therapeutics, which was founded in 2019, is a London-based TechBio company that combines machine learning with multi-omics data from human tissue to discover treatments for complex diseases, including osteoporosis, osteoarthritis, and fibrotic conditions. The company has raised a total of $95 million in funding as of January 2026. Relation has also signed a discovery collaboration with GSK that could pay up to $200 million per target in milestones, but the company’s standalone valuation has not been disclosed.

Relative to other competitors, Relation operates as a platform biotech: it builds disease-specific models on internal and partner datasets and then runs target discovery and preclinical work under collaboration agreements, rather than offering a broadly available data or software product. Tahoe differs in that it has made its core perturbation atlas and foundation models publicly accessible and has packaged TahoeDive as a general-purpose interface to that resource; Relation’s models and datasets remain proprietary to its own and its partners’ programmes.

Paradigm4: Paradigm4, founded in 2010, develops the SciDB database technology and the REVEAL analytics suite for large-scale scientific data, including REVEAL: SingleCell, which is designed to manage and analyze very large collections of single-cell datasets across studies and modalities. Third-party venture databases indicate that Paradigm4 has raised approximately $45 million across seed and Series A, led by investors such as Atlas Venture. However, individual round sizes and valuations are not consistently reported in public sources.

Within the competitive set, Paradigm4 is closest to an enterprise data-platform provider: customers bring their own and public single-cell datasets into REVEAL: SingleCell and use the platform for querying, annotating, and visualizing many experiments at once. Tahoe’s stack instead revolves around a specific, standardized perturbation atlas (Tahoe-100M) that the company itself generated, with Tx1 and TahoeDive built directly around that dataset; Paradigm4 does not ship a pre-built drug-plus-CRISPR perturbation atlas or a pre-trained cell-state foundation model as part of its product.

Partek Flow: Partek Inc., founded in 1993, is a bioinformatics software company that develops graphical tools such as Partek Flow and Partek Genomics Suite for the analysis of bulk and single-cell RNA sequencing, ATAC-seq, and other next-generation sequencing assays. In December 2023, Partek Inc. was acquired by Illumina for an undisclosed amount

Partek Flow functions as a point-and-click workbench: it lets users configure standard analysis pipelines, perform quality control, and generate plots on their own sequencing data produced on platforms such as 10x Genomics and Illumina. Unlike Tahoe, Partek does not provide a large, pre-curated perturbation atlas or an associated foundation model; instead, it focuses on workflow management for user-generated datasets. Tahoe’s offering is centred on Tahoe-100M plus Tx1 and TahoeDive, which together provide a specific shared dataset, pre-trained representations of cell state, and a natural-language layer for exploration, rather than a general-purpose pipeline environment.

BioTuring: BioTuring, which was founded in 2016 and is headquartered in San Diego, develops software for interactive exploration of high-throughput biological data, including BBrowserX for visualising and analysing single-cell datasets and a Talk2Data feature for querying integrated single-cell resources in natural language. BioTuring has raised approximately $1 million in a seed round in October 2017; no subsequent funding rounds or valuation figures are publicly reported.

Among competitors, BioTuring is closest to a specialized single-cell viewer and query tool built on top of many public datasets: its products re-curate and integrate existing single-cell studies so that users can search, annotate and run analyses without coding, but the underlying data are primarily observational rather than systematically perturbed across many drugs and models. Tahoe’s products, in contrast, are organised around a single, large perturbation atlas (Tahoe-100M) generated with controlled drug treatments, a perturbation-trained foundation-model family (Tx1) and the TahoeDive natural-language interface, all of which are released openly or via a freemium access model.

Business Model

As of January 2026, Tahoe’s three flagship products did not have public pricing. Tahoe-100M, a single-cell perturbation atlas of over 100 million profiles across 50 cancer cell lines and roughly 1.1K drug treatments, was released under a CC0-1.0 public-domain license and can be downloaded freely. The Tahoe-x1 (Tx1) family of perturbation-trained foundation models, with sizes up to 3 billion parameters, is similarly released with open weights, training code, and evaluation scripts on public repositories. TahoeDive, which is described in public materials as a natural-language agent that lets biologists query and analyse Tahoe-100M, is offered via a “try for free” web interface with login but no public pricing.

Tahoe presents large, perturbation-rich single-cell datasets as the core asset for future monetization. In August 2025, the company announced to use the funding to build the definitive foundational dataset for training Virtual Cell Models, specifying a plan to generate one billion single-cell datapoints mapping one million drug–patient interactions. In the same announcement, Tahoe stated that this new dataset would be shared with one partner, defined as a pharmaceutical or AI company that would collaborate with Tahoe to develop medicines based on these virtual cell models.

Potential Product Revenue Levers

No pricing has yet been announced for TahoeDive or for hosted access to Tx1, so likely product-led revenue levers can only be inferred by analogy to similar tools. Commercial single-cell analysis platforms such as Partek Flow are sold as subscription software with per-seat licensing for lab and enterprise editions, often via Illumina’s informatics catalogue, with optional add-ons for features like pathway analysis, single-cell toolkits, and extra storage.

Documentation for BioTuring’s Talk2Data and BBrowserX emphasises enterprise deployment with company single sign-on and, in some cases, private virtual private cloud (VPC) set-ups so that customer data stay behind the firewall. Given that TahoeDive is a web-based analysis app that requires login and is framed as an AI assistant over Tahoe-100M, a plausible path is a future tiered model in which individual users retain a free tier and organisations pay per seat for expanded capacity, team features or private deployments. This is an inference based on the way comparable single-cell tools are priced; Tahoe itself has not published any such plans.

Infrastructure and AI-platform vendors also provide a template for metered usage and API-based monetisation. TileDB, which partners with Tahoe and Kepler AI on the underlying multimodal database layer for TahoeDive, publishes pricing that separates user seats from vCPU capacity, charging annually per seat and per block of scalable compute.

Major AI-model providers like OpenAI and Mistral charge for hosted models on a usage basis, billing per million tokens processed, while still allowing self-hosting of some open-weight models. By analogy, Tahoe could in the future expose hosted Tx1 and virtual cell models via usage-metered APIs or metered large-batch analyses in TahoeDive, while keeping the base weights open for self-hosting. This is again an inference based on common patterns in adjacent markets, not a stated Tahoe policy.

Cost Structure and Scalability

Tahoe’s cost structure combines data-generation costs typical of a wet-lab biotech platform with compute and software costs typical of an AI infrastructure company. On the biology side, the Tahoe-100M preprint and related announcements describe a giga-scale perturbation experiment spanning 100 million single cells, around 60K drug–condition combinations, and 50 cancer models, implemented with industrial-scale single-cell workflows from Parse Biosciences and sequencing from Ultima Genomics. The planned billion-cell dataset similarly implies substantial ongoing spend on reagents, sequencing, lab automation, and specialist staff, funded by the recent $30 million round.

On the software side, Tahoe bears the cost of training and serving large models such as Tx1 across hundreds of millions of single-cell profiles, alongside cloud storage and bandwidth for Tahoe-100M and TahoeDive. In mature computational-drug-discovery businesses, software revenue built on similar infrastructure often achieves software gross margins in the 70–80 % range; for example, Schrödinger reports software gross margins of 80% as of 2024.

If Tahoe ultimately charges for hosted analysis, APIs, or enterprise data mirrors while keeping incremental compute and storage costs under control, the software components of its business are likely to be high-margin and scalable once the upfront investment in datasets and models has been made. The exact margin profile will depend on the eventual mix between partnership-driven dataset licensing, software access, and any downstream economics from therapeutic programmes, none of which have yet been disclosed.

Traction

Tahoe has not disclosed any paying customers or revenue as of January 2026. Public materials describing Tahoe-100M, Tx1, and TahoeDive focus on data releases, model availability, and partnerships, and do not include pricing information, customer lists, or revenue metrics; this memo therefore treats Tahoe as pre-revenue.

Tahoe-100M is the main observable traction signal. Tahoe’s open-sourcing announcement describes Tahoe-100M as an atlas of 100 million single-cell profiles from about 60K experiments, covering 1.2K drug treatments across 50 tumor models. Kepler AI’s launch post for TahoeDive states that Tahoe-100M has been downloaded more than 45K times, and Tahoe’s August 2025 funding announcement states that Tahoe-100M was downloaded around 100K times within a few months of release.

Tahoe-100M is also embedded in external benchmark and atlas efforts. The Arc Institute’s Virtual Cell Atlas lists Tahoe-100M as a core dataset generated on Tahoe’s Mosaic platform, with 100 million cells from roughly 60K drug perturbation experiments across 50 cancer models and more than 1.1K drug treatments, and notes that the Atlas as a whole is bootstrapped with Tahoe-100M and Arc’s scBaseCount dataset. The Virtual Cell Challenge site describes Tahoe-100M as the main perturbation dataset available to participants.

On the product side, TahoeDive is available as a web application that allows users to log in and query Tahoe-100M, with a “Try Now for Free” call-to-action and no public pricing tiers. Kepler AI describes TahoeDive as a joint product that brings Tahoe-100M into Kepler’s bioinformatics agent platform so that scientists can analyse it using natural-language prompts, and TileDB’s blog and press materials describe a three-way partnership in which Tahoe-100M is stored in TileDB’s multimodal database format and accessed via Kepler’s agent system.

Tahoe’s data-generation and modelling work is visible in vendor and scientific channels. Arc Institute and Parse Biosciences report that Tahoe-100M was generated using Tahoe’s Mosaic platform, with single-cell sample preparation performed via Parse’s GigaLab service and sequencing carried out on Ultima Genomics instruments. A bioRxiv preprint describes Tahoe-100M as a 100-million-cell single-cell perturbation atlas suitable for training predictive models of cell responses to small-molecule interventions, and Databricks lists a Tahoe session in its Data + AI Summit, indicating engagement with both academic and AI-infrastructure communities.

Valuation

Tahoe has raised $42 million in total funding to date, including a $30 million Series A in August 2025 led by Amplify Partners with participation from Databricks Ventures, General Catalyst, Mubadala Capital, and others, at a reported valuation of $120 million.

Prior to this, in December 2022, Tahoe (then operating as Vevo Therapeutics) had closed an oversubscribed seed round of about $12 million, co-led by Wing Venture Capital and General Catalyst, with participation from Mubadala Capital, AIX Ventures, and Camford Capital.

Key Opportunities

Enterprise Deployments

Tahoe-100M is already integrated into the Arc Institute’s Virtual Cell Atlas as its inaugural perturbation dataset and is exposed via enterprise-grade platforms from Kepler AI and TileDB. Arc describes Tahoe-100M as a 100-million-cell dataset of around 60K drug perturbation experiments across 50 cancer cell lines and more than 1.1K drug treatments, generated on Tahoe’s Mosaic platform. Kepler AI and TileDB jointly describe TahoeDive as a public-facing platform where AI agents can query Tahoe-100M at full single-cell resolution on cloud infrastructure.

Given this setup, there is a clear opportunity for Tahoe to formalise TahoeDive as an enterprise analysis product: adding features such as organisation logins, audit trails, and private cloud or on-premises deployments would align with how regulated R&D teams already use Kepler and TileDB for omics workloads. This is a single-step inference based on the existing integrations and on standard enterprise requirements for data tools handling research data.

Licensing Datasets and Models

Tahoe-100M was released under a CC0 public-domain licence and had been downloaded roughly 100K times within months of launch. In August 2025, Tahoe announced a plan to generate a new dataset of one billion single-cell profiles covering around one million drug–patient interactions and stated that this dataset would be shared with a single pharmaceutical or AI partner.

The combination of an open introductory dataset (Tahoe-100M) and an exclusive, larger dataset earmarked for one partner creates an opportunity for Tahoe to make proprietary virtual-cell datasets and associated models the main paid product. The future one-billion-cell resource is already framed as a partnership asset rather than an open release; treating it and any follow-on datasets as licensed products for a small number of pharma or AI partners is a direct extension of this stated plan.

Hosted Models and APIs

The Tx1 model family is released with open weights and code, trained on Tahoe-100M to produce embeddings and predictions for genes, cells, and drugs in a perturbation setting. In parallel, cloud and AI providers commonly combine open model weights with paid, usage-based hosted endpoints, billing per unit of compute, while allowing self-hosting for users with their own infrastructure.

This pattern creates an opportunity for Tahoe to offer hosted Tx1 and later virtual-cell models as metered APIs or large-batch analysis services, while keeping the base models open. The inference is that Tahoe can follow the same model-hosting conventions that are already in use for other open foundation models, using the existing open-source Tx1 release as the entry point and charging only for managed, high-capacity use.

Regulatory Shifts

In December 2022, the FDA Modernization Act 2.0 removed the statutory requirement to rely on animal testing in certain submissions by expanding nonclinical tests to include validated in vitro and in silico approaches. In April 2025, the US FDA published a roadmap to reduce animal testing in preclinical safety studies, highlighting computational modelling and advanced in-vitro assays as new approach methodologies, and in April 2025 and September 2025 public statements and reporting described pilot programmes to replace some antibody and small-molecule safety tests with AI-based and human-cell-based methods.

Tahoe-100M and Tx1 are built entirely on human cancer cell models and computational models of cell response. This regulatory context creates an opportunity for Tahoe to package its perturbation data and model outputs in ways that fit emerging expectations for non-animal evidence in discovery and early development, for example, by adding validation documentation and traceable outputs suitable for inclusion in regulatory briefing materials. This is a single-step inference from the alignment between the human-cell, model-based nature of Tahoe’s platform, and the documented shift in regulatory guidance.

Key Risks

Technical Risks

Tahoe’s core assets – Tahoe-100M and the Tx1 model family – rely on large-scale single-cell RNA-seq perturbation data. Tahoe-100M has been described as a giga-scale atlas of about 100 million single-cell transcriptomic profiles measuring responses of 50 cancer cell lines to ~1.1K small-molecule perturbations and ~60K drug–cell interactions. Single-cell RNA-seq data are known to suffer from dropout events, batch effects, and protocol-dependent biases that can distort gene-expression measurements and require explicit correction. This means Tahoe is exposed to the risk that model performance depends strongly on how well these artifacts are handled, and that Tx1’s behavior may differ when customers apply it to data generated with alternative protocols or in different labs.

Benchmarks of foundation models for single-cell perturbation tasks show that current models, such as scGPT and scFoundation, do not consistently outperform simple baselines for predicting post-perturbation gene expression, with linear models using structured biological features often performing as well or better. Because Tx1 sits in the same class of large neural models trained on transcriptomic data, Tahoe faces the risk that, on some drug-discovery tasks, uplift over strong internal baselines at pharma companies may be modest until clear, independent benchmarks are published.

Market and Buyer Risks

The FDA has created a dedicated AI-in-drug-development resource and issued guidance on using AI to support regulatory decision-making, emphasising a risk-based “credibility assessment” framework with a clearly defined context of use, verification, and validation. At the same time, a separate roadmap on reducing animal testing encourages the use of non-animal “New Approach Methodologies,” including computational models, but treats them as emerging tools that require supporting evidence. For Tahoe, this means enterprise customers will typically require fit-for-purpose validation studies, audit trails, and documentation aligned with these frameworks before using TahoeDive outputs or Tx1 scores in regulated decisions, which lengthens evaluation and procurement.

Pharma and large biotechs also operate under strict information-security, data-residency, and governance constraints for research data. AI-related regulatory programmes in pharmacovigilance and R&D highlight requirements around data lineage, model monitoring, and secure infrastructure for AI tools handling drug-development data. Since TahoeDive and hosted Tx1 variants depend on cloud compute over large biological datasets, Tahoe faces the risk that many buyers will insist on private-cloud or on-premises deployments with formal certifications, increasing implementation cost and extending sales cycles compared with lightweight desktop tools.

Competitive Risks

Vendors already embedded in single-cell and bioinformatics workflows provide overlapping functionality. 10x Genomics sells single-cell instruments and consumables and distributes Loupe Browser, a GUI tool for exploring 10x single-cell and spatial datasets, which is widely used in core facilities and translational labs. Paradigm4’s REVEAL SingleCell platform is designed to store and query multi-study single-cell datasets at scale, providing interfaces for R, Python, and GUI-based analysis. These existing tools mean that many potential Tahoe customers already have established pipelines for single-cell data management and analysis, so Tahoe risks being treated as an optional overlay rather than a central platform.

At the same time, AI-driven drug-discovery companies are raising very large rounds to build proprietary models and data, including platform plays such as Isomorphic Labs and newer AI–biotech entrants with valuations in the hundreds of millions to billions of dollars. Open-source work on single-cell foundation models and perturbation benchmarks is also accelerating. Because Tahoe has open-sourced Tahoe-100M and Tx1, other groups can train and benchmark models on the same data, so Tahoe faces the risk that better-resourced competitors or open projects narrow the technical differentiation around perturbation-aware modelling.

User and Workflow Risks

TahoeDive is positioned as an AI agent that lets biologists query Tahoe-100M with natural-language prompts and automatically generates code and figures. Large language models are documented to produce hallucinations (fluent but incorrect outputs) in scientific and medical contexts, and recent surveys and clinical guides flag this as a primary risk for LLM use in research and healthcare. This means Tahoe must assume that some automatically generated narratives or analyses will require expert checking, and that enterprise users may initially restrict TahoeDive to exploratory analysis unless robust safeguards and review processes are in place.

Drug-discovery teams typically adopt new tools only when they show clear, quantitative improvement on internal datasets, such as better hit rates or reduced experimental burden, and industry reviews of AI in drug development emphasise this requirement for head-to-head comparisons. Until Tahoe can demonstrate performance gains from Tx1 and related models on proprietary pharma data, there is a risk that some prospective customers will limit pilots or delay broader roll-outs.

Macro and Operational Risks

Training and serving Tx1-class models on Tahoe-100M-scale data requires access to high-end GPUs and substantial storage. As of August 2025, pricing surveys report Nvidia H100 GPUs costing around $25K per card for direct purchase, with cloud rental rates ranging roughly from $2.8–$10 per hour depending on provider, and note that GPU prices and availability remain volatile as demand surges. Tahoe, therefore, faces cost and capacity risk: changes in GPU pricing or supply can affect gross margins on any usage-based TahoeDive or API products and constrain how quickly new model versions are trained.

Venture funding data show that global startup investment fell by about 38% between 2022 and 2023, with biotech IPO proceeds dropping from $16.0 billion in the first three quarters of 2021 to $3.4 billion in the same period of 2023, before a partial rebound driven by AI-focused deals. Reports in 20242025 describe ongoing layoffs and funding pressure in parts of the biotech sector, even as AI-biotech remains attractive to some investors. For Tahoe, which sells R&D tools and is still pre-revenue, this environment creates a risk that potential customers delay software purchases during downturns and that Tahoe may need to raise additional capital in a competitive “AI for drug discovery” funding landscape.

Weekly Newsletter

Subscribe to the Research Rundown

Summary

Drug development remains highly inefficient, with only ~7–8% of candidates reaching approval and particularly low success in oncology, often due to late-stage safety or efficacy failures that reflect limited understanding of drug effects across diverse human cell types. Tahoe Therapeutics addresses this gap by generating and modeling large-scale single-cell drug perturbation data, exemplified by its open-source Tahoe-100M dataset (~100 million single-cell profiles across ~60K experiments and 1.1K+ drugs), its Tx1 foundation models trained on these data, and TahoeDive, a no-code natural-language analysis tool.

While Tahoe has raised ~$42 million in funding (including a $30 million Series A at a reported ~$120 million post-money valuation) and operates in a competitive AI-driven drug discovery market, its strategy hinges on translating broad open adoption into monetizable proprietary assets: most notably a planned 1B-cell, 1M drug–patient dataset to be shared with a single partner. Against a backdrop of uneven biotech capital markets and GPU-intensive operating costs, Tahoe’s success depends on securing that partner, demonstrating clear performance gains over pharma baselines, and managing compute economics as it scales.

Important Disclosures

This material has been distributed solely for informational and educational purposes only and is not a solicitation or an offer to buy any security or to participate in any trading strategy. All material presented is compiled from sources believed to be reliable, but accuracy, adequacy, or completeness cannot be guaranteed, and Contrary LLC (Contrary LLC, together with its affiliates, “Contrary”) makes no representation as to its accuracy, adequacy, or completeness.

The information herein is based on Contrary beliefs, as well as certain assumptions regarding future events based on information available to Contrary on a formal and informal basis as of the date of this publication. The material may include projections or other forward-looking statements regarding future events, targets or expectations. Past performance of a company is no guarantee of future results. There is no guarantee that any opinions, forecasts, projections, risk assumptions, or commentary discussed herein will be realized. Actual experience may not reflect all of these opinions, forecasts, projections, risk assumptions, or commentary.

Contrary shall have no responsibility for: (i) determining that any opinions, forecasts, projections, risk assumptions, or commentary discussed herein is suitable for any particular reader; (ii) monitoring whether any opinions, forecasts, projections, risk assumptions, or commentary discussed herein continues to be suitable for any reader; or (iii) tailoring any opinions, forecasts, projections, risk assumptions, or commentary discussed herein to any particular reader’s objectives, guidelines, or restrictions. Receipt of this material does not, by itself, imply that Contrary has an advisory agreement, oral or otherwise, with any reader.

Contrary is registered with the Securities and Exchange Commission as an investment adviser under the Investment Advisers Act of 1940. The registration of Contrary in no way implies a certain level of skill or expertise or that the SEC has endorsed Contrary. Investment decisions for Contrary clients are made by Contrary. Please note that, although Contrary manages assets on behalf of Contrary clients, Contrary clients may take any position (whether positive or negative) with respect to the company described in this material. The information provided in this material does not represent any investment strategy that Contrary manages on behalf of, or recommends to, its clients.

Different types of investments involve varying degrees of risk, and there can be no assurance that the future performance of any specific investment, investment strategy, company or product made reference to directly or indirectly in this material, will be profitable, equal any corresponding indicated performance level(s), or be suitable for your portfolio. Due to rapidly changing market conditions and the complexity of investment decisions, supplemental information and other sources may be required to make informed investment decisions based on your individual investment objectives and suitability specifications. All expressions of opinions are subject to change without notice. Investors should seek financial advice regarding the appropriateness of investing in any security of the company discussed in this presentation.

Please see www.contrary.com/legal for additional important information.

Authors

Keerthikan Thirukkumar

Fellow

See articles

© 2026 Contrary Research · All rights reserved

Privacy Policy

By navigating this website you agree to our privacy policy.