A Deep Dive into Screening 36 Billion Compounds: Q&A with Stephen MacKinnon

Written By:
Stephen MacKinnon
Read the post ›

As an exciting milestone in Recursion’s drug discovery journey, we have successfully predicted the protein target interactions for approximately 36 billion chemical compounds in the Enamine REAL Space library, taking an important step to bridge the gap between the protein universe and the chemical universe. This was achieved using NVIDIA’s DGX Cloud supercomputing power and our own NVIDIA-based supercomputer, BioHive-1.

We sat down with Stephen MacKinnon, Recursion’s Vice President of Digital Chemistry and one of the chief inventors of the MatchMaker technology that we used to achieve this milestone following our acquisition of Cyclica earlier this year.

With this exciting scientific accomplishment, why are these findings valuable and what are the implications for Recursion’s drug discovery efforts?

Bridging the chemical universe and the protein universe is the beginning of a unique moment - one that enables us to explore protein targets of interest and starting points for potential new medicines. Other computational screening approaches have shown that the larger the library screened, the higher the hit rate. So, by virtually screening billions of molecules within the Enamine REAL Space (reported to be the world’s largest searchable chemical library), we’ll be able to efficiently search across massive chemical spaces and rapidly identify the most promising compounds for drug discovery profiling. It’s as if we’ve surveyed all rivers on Earth for mineral deposits using satellite imagery, such that we can prioritize the resource-intensive work of prospecting for gold.

We can also use this library to examine our drug discovery portfolio more widely, allowing us to look for biological signals that are more unique to individual proteins and see how these compare across other targets of interest. Similar to Recursion’s phenomics platform, the scalability of MatchMaker enables a “high-dimensional” view of biochemistry: activity is predicted not just for a single target, but for many at the same time. This accomplishment positions us to more efficiently advance our programs going forward, filling in gaps to prioritize the most promising therapeutic compounds early in the research process.

This accomplishment was made possible by MatchMaker, a digital chemistry tool developed by Cyclica, which Recursion acquired in May. What are the capabilities of MatchMaker and how have these tools been integrated into Recursion’s datasets?

MatchMaker is a machine learning model trained to predict ligand-protein interactions, identifying potential binding between any small molecule and any protein in the human proteome. The model is particularly powerful in that it can make these predictions even with low or no data on a given protein target, because it relies on the structure of individual protein pockets, rather than that of the entire protein. As an AI-enabled platform trained on millions of known ligand-protein interactions, MatchMaker can also be used to learn biophysical patterns of proteins and ligands, adding chemical insights and improving our ability to predict a drug’s mechanism of action. As announced, we used MatchMaker and NVIDIA’s DGX Cloud supercomputing power to successfully screen the Enamine REAL Space, predicting the protein target for approximately 36 billion compounds. In total, this screen digitally evaluated more than 2.8 quadrillion small molecule-target pairs.

What sets MatchMaker apart from other drug-target interaction (DTI) prediction models?

In contrast to other literature-reported DTI prediction models, MatchMaker differs by introducing structural context to its protein representations, using 3D protein structures (both experimentally determined and modeled) as a way to extrapolate novel protein-ligand systems across the proteome. In addition, both training and evaluation are significantly less computationally intensive and much more scalable using MatchMaker than most other DTI models.

This enables three core advantages: First, this predicted data layer can be used to determine which wet-lab experiments should be executed to advance programs faster across a wide range of targets and chemical space. Second, this predicted data layer can be used as part of Recursion’s multi-modal dataset to better understand biological activity across programs quickly and at scale. Finally, this approach can pre-screen for more computationally expensive precision modeling techniques implemented by Recursion’s computational and digital chemistry teams, to more efficiently advance programs.

What is the Enamine REAL Space, and how are libraries like this changing drug discovery?

Ultra-large virtual chemistry databases like the Enamine REAL Space are defined by collections of tens of thousands of chemical building blocks, along with the rules that help define the different ways these blocks can be combined to create molecules. It’s a bit like a catalog of Lego parts with instructions – in the case of the Enamine REAL Space, if you put the right Legos together in the right order, you should get about 36 billion different combinations. Most importantly, almost all of these 36 billion can be made-to-order in a short period of time.

These ultra-large chemistry spaces have become increasingly more useful in recent years as the computational techniques used to navigate them become more and more sophisticated. At Recursion, we have the advantage of linking this very large chemical space to a very large protein space in a way that allows us to explore different molecules on the basis of their biological activity. We can do these searches at a massive scale very efficiently before ordering and testing the most promising compounds, which in turn feeds more data back into our operating system.

In addition to Recursion’s BioHive-1, why is it important that we now have access to NVIDIA’s supercomputing resources?

Having access to supercomputing hardware is a must for conducting large screens like this, and NVIDIA has been a phenomenal partner in helping us access the best-of-the-best of these capabilities.

But beyond the hardware itself, computations at this scale require every part of the process to be optimized to reduce any bottlenecks. Remember we are talking about simulating 2.8 quadrillion interactions – that’s a massive number. If we evenly distributed $2.8 quadrillion to everyone on Earth, it would amount to more than $350,000 per person. It's not just about putting it on a bigger computer, rather we must strategically reshape where and how calculations occur.

For example, one solution involved transitioning from CPUs (our primary tool at Cyclica) to GPUs (the primary tool at Recursion and with our partners at Nvidia). Having done this, we were able to reconfigure the MatchMaker neural network and make use of special NVIDIA GPU features like mixed-precision Tensor Cores to accelerate the calculation. Addressing these kinds of bottlenecks dramatically increased our scaling ability, and now we’re able to run screens over 1 million times larger than Cyclica’s previous capacity.

This milestone occurred less than 90 days after Cyclica joined Recursion. How have the teams been working together to pull this off?

It was clear there was a shared mindset across the team that “nothing is impossible” – people embraced the challenge head-on. It was also impressive to watch folks come together who were deep experts in their respective fields – neural networks, digital chemistry, engineering, etc. As we tackled one challenge after the next, tapping into that expertise became essential to helping us optimize the computations that made this possible.

Recursion’s newly integrated teams of people, experience and abilities gave a fresh perspective on the way calculations happen and how we can optimize these processes, helping to accelerate our drug discovery programs. It was thrilling to see the combined team embodying Recursion’s values of “We Are One Recursion” and “We Act Boldly With Integrity” to bring to life another value – “We Deliver.”