The Problem with Drug Discovery
It’s no secret that developing drugs is a difficult process. The cost of developing a new drug and bringing it to market is anywhere between $1.3 billion USD to nearly $2.9 billion USD depending on who you ask. Many of the low-hanging fruit are already picked, and the changing health landscape of modern society compounds the challenge. Our aging society is now more heavily affected by heart disease, dementia, and cancer than our progenitors were.
Simulated docking of the inhibitory ligand N3 and COVID-19 coronavirus protease Mpro. Ligand with green color is the actual structure (from Jin et al. 2020), and the light pink N3 ligand represents docking simulation (using open source software Smina, a fork of Autodock Vina). Volumetric convolutional neural networks offer an efficient alternative to costly physics simulations of biomolecules, and deep learning drug discovery startups are taking full advantage of CNN architectures.
Thanks to vaccines and antibiotics, mortality attributed to infectious disease has decreased substantially over the last 100 years. Even so, as we are being reminded during the COVID-19 pandemic, the connectedness of modern society yields distinct challenges in the speed and ubiquity with which a novel pathogen can spread. Drug developers also have an increasingly strong placebo effect to deal with; many of the antidepressants and pain relief drugs in use today might not meet the threshold of efficacy if they were to enter double-blinded clinical trials today, and even invasive cardiac interventions can be confounded by the placebo effect.
Deep learning, the use of many-layered artificial neural networks loosely based on the connectivity of the central nervous systems of animals, has found growing success thanks to the capability of deep networks to approximate and recognize complex patterns. As deep learning models continue to improve they can oftentimes deliver results faster than humans, and with greater (or, at least equal) accuracy. As a result there are plenty of options in deciding how to apply neural networks to a given problem.
In drug discovery, computer vision for microscopy is one area where deep learning can be used to create a vast representational space for studying cellular models of disease. But other drug discovery startups are taking this one step further, eschewing cells completely, and even skipping over biochemical assays to concentrate on virtual drug screening.
The physics at the scale of biochemistry, where small molecules interact with biomolecules, operates entirely differently than the world of our everyday experience. Thermal motion, electrostatic repulsion and attraction, and yes, a great deal of quantum effects all contribute to a counterintuitive environment, at least to humans used to the macro-scale world.
Not to mention that the complexity of biomolecular machinery surpasses that of humanity’s most complicated engineering projects. Just predicting the folded, static structures of proteins based on their sequence blueprints is a grand challenge of biology (also being disrupted by deep learning), intractable to a brute force solution within the span of our universe’s projected lifetime.
In his lecture “Simulating Physics with Computers” one of Feynman’s solutions to the exponential explosion in the computational requirements of simulating physics is to use a quantum computer. But for all practical purposes we are still waiting for quantum supremacy, the threshold at which a quantum computer can provably compute what a non-quantum computer cannot. Deep learning is likewise a technology with its own combinatorial explosions, and we’ve seen deep neural networks applied to approximating classic physics problems with massive speed-ups and varying degrees of success. As we’ll explore in this post, when handled properly deep learning is a good option for approximating the exponentially complex physics of molecular bio-activity.
Virtual Screening Glossary
|VS/LBVS/SBVS: Virtual Screening/Ligand-Based Virtual Screening/Structure-Based Virtual Screening: Computational screening of drug candidates based on molecular characteristics or structure.
HTS: High Throughput Screening: The use of automation to perform highly parallel laboratory screening experiments in cell culture or biochemical assays.
Biomolecules: In this post biomolecules refer to biological macromolecules, which include lipids (fats), proteins, carbohydrates, and nucleic acid polymers. For our discussion a typical biomolecule drug target is a protein, which can function as enzymes, mechanical actuators, and transport or signalling molecules in cells.
Small Molecule: Unlike biological macromolecules, small molecules are organic compounds with a low molecular weight, such as ligands, composed of a few tens to hundreds of atoms. These differ from protein drugs such as monoclonal antibodies and fusion proteins, which are much larger and have hundreds of residues, each made up of some tens of atoms.
Ligand: a small molecule that can bind to a specific biomolecule, potentially altering its activity or blocking it’s function
QSAR: Quantitative Structure-Activity Relationship: The relationship between the structures of a biomolecule and small molecule drug candidate and their activity, e.g. the propensity for a small molecule to bind and block the normal activity of a viral protease.
The Approach to Using Deep Learning in Drug Discovery
Drug discovery is typically a matter of screening vast chemical libraries for activity against a specific target molecule or phenotype. The modern approach is a marked departure from the days of serendipitous drug discovery in bioprospecting, where a world-changing discovery was a matter of keeping a somewhat unkempt lab.
The conventional way to find promising drug candidates is via high-throughput screening (HTS), and we’ve already touched on how automation and deep learning data science is being used to revolutionize that approach. To speed things up further, even beyond what can be accomplished with automation in lab experiments, virtual screening (VS) offers a computational approach to finding drug candidates.
The workflow for virtual drug screening involves the familiar steps of training, evaluation, and deployment. A filtered dataset is split, trained, evaluated, and used to train a model using supervised learning. The trained model is then deployed as a screen on a dataset of interest, replacing the arduous process of physically screening compounds in a laboratory. The positives found in the screen are then verified by lab work and, if successful, move on to clinical trials.
The virtual screening workflow will be familiar to data scientist and deep learning practitioners in other practical applications, with the ubiquitous “data-munging” or data cleaning step replaced by filtering a small molecule library to exclude likely false positives and unrealistic drug candidates (e.g. molecules that are too large to realistically fit in the biomolecule target’s binding pocket).
After filtering, familiar training, test, and validation splits are used to train and evaluate the model before deploying on an unseen virtual small molecule library. Hits found in the machine learning screen may then be verified by chemical, cellular, and/or model organism assays before entering clinical trials. A dataset for virtual screening may consist of either a library of chemical and molecular properties of small molecules with known activity for a given target, or in the more general case, structural information of both the target and drug candidates.
The 2 Major Strategies for Virtual Screening
The two major strategies for virtual screening are called ligand based and structure based, or LBVS and SBVS, respectively. Ligand-based VS takes molecular and chemical properties of small molecules as inputs and predicts whether the compounds will be active against a target, based on the similarities with known active compounds for that target. Structure-based virtual screening relies on structure information for both the drug target and the small molecule, placing the two together in simulation and predicting whether they will bind.
While both approaches to virtual screening are vastly less resource intensive than high-throughput screening with laboratory assays, LBVS is typically an easier problem and faster to compute. On the other hand, LBVS requires a training dataset that includes known active compounds. This may seem counterintuitive because it means performing a drug screen for a target that already has known active compounds, but it may be desirable in order to find alternative treatments with decreased side effects, for example.
Structure-based screening is more general, and SBVS is conventionally accomplished in physics simulations of small molecules situated in the binding pocket of a protein target. The goodness of fit is evaluated in terms of traits like distances between atoms and their electrostatic interactions. In terms that may be more familiar to deep learning practitioners, LBVS is analogous to training on a vector of input features, and presents a scaled up version of the problem of classification based on the petal widths and lengths of the iris dataset.
SBVS is more closely related to an image classification task with raw pixels as inputs, such as the MNIST hand-written digits dataset. As a result SBVS is well-suited to machine learning models that can learn salient features in an end-to-end fashion, and not surprisingly deep artificial neural networks are well-suited to the task. While deep learning models can be used to fit a LBVS dataset, overfitting is a major concern and shallower and more traditional machine learning techniques, such as support vector machines tend to excel at LBVS.
Based on the analogy to computer vision from raw pixels, we can expect SBVS to benefit from the spatial invariance and locality characteristics of sharing weights in convolutional neural networks. Indeed, using 3D convolutional kernels to match the dimensionality of biomolecular structure data, convolutional neural networks like AtomNet from Atomwise and DeeplyTough from Benevolent AI can screen millions to billions of potential drug candidates by evaluating small molecules as they fit into the binding pockets of protein drug targets.
Like more common 2D convolutional networks that multiply input data with a sliding square window, 3D convolutional networks build up a hidden layer of features by multiplying input data and hidden layers with a sliding cube window.
Supervised Learning is the Dominant Approach Right Now in Virtual Screening Drug Discovery
Convolution for structural data, in which the 2D pixel-wise convolutional kernels are replaced by voxel-wise 3D kernels.
Supervised learning is the dominant approach in industry ML, and it’s the paradigm most relevant to virtual screening drug discovery. In practice it’s not enough to develop a brilliant neural network architecture if you do not have a dataset of sufficient size and quality. To overcome the significant hurdle of overfitting, a mismatch in the fitting power of the model and the size and complexity of the training dataset, VS startups have turned to partnerships with established pharmaceutical companies. Public datasets like ZINC and Pubchem can be used for VS, but pharmaceutical companies tend to have their own massive proprietary datasets, amplifying the value proposition of collaboration between traditional drug companies and machine learning startups.
For example, Atomwise fosters a multitude of pharma partnerships including Charles River Laboratories, Eli Lilly, and Bayer. Similarly twoXAR partnerships include SK Biopharmaceuticals and Ono, while Beneveolent AI has agreements with Novartis and AstraZeneca. Academic collaborations are also important, and Atomwise academic partners include a headline project with the lab of Professor Xinnan Wang at Stanford University, an endeavor that made major strides in 2019 by showing a leading conv-net predicted drug candidate alleviated symptoms and improved biomarkers in a fruit fly model of Parkinson’s disease.
What’s in Store for the Future of AI-Enabled/Deep Learning in Drug Discovery?
|Company||Drug Discovery Strategy||ML Tools|
|Benevolent AI||SBVS||3D CNNs|
|twoXAR||Proprietary VS||Likely CNNs for SBVS/LBVS|
|Recursion Pharmaceuticals||HTS in cell culture||CNNs and multivariate ML|
|InSitro||HTS and ML bioinformatics||CNNs and ML for bioinformatics|
The flexibility of modern machine learning, aka deep learning, means that there are many different areas to apply a deep conv-net. From 3D convolutional networks applied to structure data, to multi-layer perceptrons trained on LBVS datasets, and even models leveraging generative adversarial training and reinforcement learning, there’s no shortage of different approaches to machine learning based drug discovery.
For most of these applications, deep neural networks take a predictive shortcut that would otherwise entail costly laboratory assays or complex multi-threaded physics simulations. In terms of computational resources, neural networks can not only speed up prediction, but also shifts the computational requirements to depend more heavily on the high-performance GPUs that neural network primitives have been optimized for.
This shift enables drug discovery and development to take advantage of the mature development of tensor-based deep learning libraries like PyTorch and Tensorflow. Like the exponentially complex problem of fitting molecules together, designing an AI-enabled workflow involves choosing from an overabundance of options.