.png)
Since Recursion was founded in 2013, the company has focused on fundamentally shifting the way drugs are made using data and AI – from an emphasis on generating proprietary data using real-world experimentation, to an increasing reliance on in silico modeling and simulation. From those early days, Recursion has been building the necessary components to virtualize key stages of the drug discovery process. So-called virtual cells, computational systems that can accurately simulate cellular and patient-level responses to therapeutic interventions, are core to this vision and built on top of the massive, proprietary biological and chemical datasets, AI models, and one of the industry’s most powerful supercomputers, BioHive-2, available to us at Recursion.
Why are virtual cells needed? Because traditional drug discovery is still plagued by high uncertainty, high costs, long timelines, and 90% failure rates. Virtualizing key components of the drug discovery process would allow us to navigate biology’s enormous complexity – to better understand what is driving diseases and to identify completely new ways to treat them – with maximum speed and efficiency.
First, we needed better biological data to train patient-relevant AI models. There are a number of public datasets, of course, and many companies building AI models rely on them. These include the Cancer Genome Atlas, the UK Biobank, ChEMBL, and the Protein Data Bank. While all of these datasets provide important insights, they lack standardization and are often biased toward certain populations or protein types.
To train the best models able to generate novel biological insights that could meaningfully shift outcomes in drug discovery and create a true competitive moat, we needed to create our own datasets in an automated, standardized, and repeatable fashion.
Initially, Recursion’s efforts focused on phenomics – capturing images of different types of human cells in different states of perturbation.
Using automated labs, researchers at Recursion developed new ways to grow, freeze, thaw, and experiment with cells in large quantities. Over the course of more than a decade, we have produced hundreds of billions of cells across more than 50 different cell types.
We perturb these cells using CRISPR-Cas9 editing and by adding chemical compounds, and run many replicates of each combination to create a more robust experimental signal. Each week, Recursion’s automated labs run up to 2.2 million of these experiments – collecting not just one data type, but multiple types of data needed to help reconstruct what’s happening at a cellular level. In addition to Cell Painting and Brightfield imaging data, this includes:
To turn these data layers into real insights about diseases and how to treat them, we need AI models that are capable of processing and analyzing multiple data types at massive scale and finding connections. AI models help us create cell-specific virtual maps for our partners in areas like neuroscience that can quickly surface novel targets and hits. They also, ultimately, are helping us to build a true virtual cell.
First, we use our deep resources of phenomics, transcriptomics, and other data to train models. Over time, those models are able to simulate the outcomes of phenomic and transcriptomic experiments across relevant cell types. Already, our models can accurately simulate phenomics experiments and the outcomes of large-scale drug screening. Teams at Recursion are now actively working on the next step in the evolution – the “explanation” function.
This explanation piece is a defining feature of how Recursion is approaching the virtual cell, as outlined in a recent paper, Virtual Cells: Predict, Explain, Discover. We don’t want a virtual cell to only be able to predict how human cells will respond to perturbations – we want them to explain to us, mechanistically, the reasoning behind their predictions. This will help expand our understanding of biology and build trust in the model.
Some of the models that Recursion is using to lay the foundation for the virtual cell include:
Lab experiments still play a critical role – but primarily in instances where Recursion needs additional confidence in the model’s predictions.
This active learning paradigm creates a constant, iterative feedback loop:
We employ the same resource-efficient approach when it comes to our predictive chemistry workflows. Our models are equipped with uncertainty estimation – and during the closed-loop design-make-test-analyze (DMTA) workflows, the model autonomously suggests new compounds to test, prioritizing those that will most improve its understanding and performance.
Only the molecules that are most likely to yield the most learning are selected for experimental validation, maximizing the value of each experiment.
“Ideally, our computational models are giving us enough confidence so that we can rule experiments in or out based on those predictions,” says Dan Cohen, President of Valence Labs, Recursion’s AI research engine.
Recursion sees virtual cells as the first step toward even more complex predictions – ultimately scaling up to virtual organs and patients. It’s an ambitious goal, but by using the same approach – predict, explain, discover – there are already signs that we can create simulations that not only allow us to better understand diseases and how they arise – but can point us to more effective ways to treat them.
------------------------
1. How does Recursion define the virtual cell? Recursion defines the virtual cell as a computational system that can accurately simulate patients’ responses to therapeutic interventions.
2. Why did Recursion choose to build its own datasets instead of relying on public ones? While public datasets provide valuable insights, they often lack standardization and can be biased toward certain populations or protein types. To train AI models capable of generating novel biological insights and create a competitive moat, Recursion needed fit-for-purpose data that was relatable, standardized, and repeatable. This led to the creation of their own massive, proprietary datasets, starting with phenomics and expanding to other layers, including transcriptomics, ADME, and patient data.
3. What are the different data layers Recursion is using to build a virtual cell? Recursion collects multiple layers of data to reconstruct cellular activity. These include:
4. How does Recursion define a successful virtual cell? According to their recent paper, a virtual cell shouldn't just predict how cells respond to perturbations; it must also explain the reasoning behind those predictions in mechanistic terms. This "Predict, Explain, Discover" framework is crucial for building trust in the model and expanding biological understanding, moving beyond simple black-box predictions.
5. What role do lab experiments play if the goal is virtual modeling? Lab experiments remain critical for validating predictions but are used more strategically in an "active learning" loop. Instead of testing everything, the AI models identify areas where their predictions are uncertain. Experiments are then run specifically to address those uncertainties. The results are fed back into the model to retrain it, making it smarter and reducing the need for future physical testing.
------------------------
Author: Brita Belli , Senior Communications Manager, Recursion