JUN 12, 2025

BitBIRCH: Efficiently Clustering 1 Billion Molecules

Abstract:

The widespread use of Machine Learning (ML) techniques in chemical applications has come with the pressing need to analyze extremely large molecular libraries. In particular, clustering remains one of the most common tools to dissect the chemical space. Unfortunately, most current approaches present unfavorable time and memory scaling, which makes them unsuitable to handle million- and billion-sized sets. Here, we propose to bypass these problems with a time- and memory-efficient clustering algorithm, BitBIRCH. This method uses a tree structure similar to the one found in the Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) algorithm to ensure O(N) time scaling. BitBIRCH leverages the instant similarity (iSIM) formalism to process binary fingerprints, allowing the use of Tanimoto similarity and reducing memory requirements. Our tests show that BitBIRCH is already > 1,000 times faster than standard implementations of the Taylor-Butina clustering for libraries with 1,500,000 molecules. Furthermore, we explore strategies to handle tremendously larger sets, which we applied in the clustering of one billion molecules under 5 hours using a parallel/iterative BitBIRCH approximation. To the best of our knowledge, this is the first time a billion molecules have been clustered. BitBIRCH increases efficiency without compromising the quality of the resulting clusters, which we were able to conclude by assessing for the Calinski-Harabasz and Davies-Bouldin indices. We also explore further applications of the BitBIRCH algorithm in chemical space exploration, segmentation, and in drug-design pipelines. The problem of clustering billions of molecules was thought to be unsurmountable with current algorithms, but BitBIRCH not only makes this possible, but accomplishes this task in just a couple of hours.

Speaker:

Vicky Jung, University of Florida

Vicky Jung is an undergraduate student at the University of Florida majoring in Data Science with a minor in Bioinformatics. She works in the Miranda-Quintana Lab where she mainly assisted with BitBIRCH, a novel approach to clustering in cheminformatics.