Trends in modern hit discovery: How your ultra-large screens can benefit from machine learning


Matt Repasky
Senior Vice President


While traditional structure-based virtual screening has been successful in finding diverse hits to advance projects there is significant room for improvement of hit rates, diversity of hit chemotypes, available IP space explored, and the potency of unoptimized hits. Ultra-large, on-demand synthesizable libraries from vendors have enabled ~100x expansion of purchasable compound space, now billions of compounds, while DNA encoded libraries (DEL) can be even larger. In order to screen these much larger chemical spaces in the billions of compounds, results of two machine learning enabled approaches are described that make it easy and cost effective to find novel hits through virtual and DEL screens of billion compound plus libraries. DNA encoded libraries (DEL) enable screening billions of synthesized compounds but are limited due to high rates of experimental false negatives and positives. Employing machine learning trained to experimental DEL results we demonstrate significantly reduced false negative rates while identifying byproducts in a more favorable property space. To enable efficient, extrapolative chemical space exploration with an accurate docking scoring function, we have developed an active learning-based method employing AutoQSAR/DC machine learning and Glide SP docking as the learner. Results from Active Learning Glide screening of 100 million to billion compound screens show increased chemical diversity and GlideScore of hits relative to brute force screening of subsets of the libraries. Results and costs from these two new methods suggest billion compound library screens could replace smaller, traditional screens commonly employed today.