AutoQSAR: An Automated Tool for Best-Practice QSAR Modeling

The development of a drug candidate can be thought of as a multi-parameter optimization problem where ADME/Tox liabilities are balanced against the need for high protein affinity and desired target selectivity. One common approach for in silico modeling of many of these dependent properties is to employ Quantitative Structure-Activity Relationships (QSAR), where classifier or continuous models are trained to reproduce experimental data using the relationships between a number of independent descriptors for each compound.

QSAR models have traditionally been manually created and validated, requiring significant human time and effort for each property to be modeled. Creation of high-quality models has also generally been an art, often requiring QSAR expertise. To increase productivity and thus lower costs, there is the need to speed creation and validation of high-quality, predictive QSAR models. For QSAR experts, one approach is to automatically explore descriptor and fitting methodology space first, enabling them to bring their experience to bear with minimal loss of time following approaches that are unlikely to be successful. For those most familiar with the data to be modeled, generally not QSAR experts, there is the need to generate and apply high-quality, predictive QSAR models with confidence, avoiding common problems such as overfitting or high correlation in independent descriptors.

Several trends in the drug discovery and QSAR modeling fields have aligned to realize significant productivity gains from automation of QSAR model creation and validation. The process to be automated is now well-characterized thanks to Organization for Economic Co-operation and Development (OECD) recommendations[1] for QSAR models used for regulatory purposes. They state models must have:

  • A defined end point
  • An unambiguous algorithm
  • A defined domain of applicability
  • Appropriate measures for goodness of fit, robustness, and productivity
  • If possible, a mechanistic interpretation

Ever larger amounts of public data have become available in recent years[2] including those of pharmaceutical interest in repositories such as ChEMBL, PubChem and ChemSpider. This enables the creation of models for more end points with broader applicability domains than ever before though care must be taken to ensure data adequacy. Retrospective analysis of QSAR model performance over time in drug discovery projects has shown that updating QSAR models as more data becomes available leads to more accurate and useful predictions[3,4]. Finally, evidence suggests there is no single best descriptor set and fitting technology for all data sets. Automation is well suited to identify via brute force evaluation of many possibilities the ideal combination of descriptors and fitting techniques for a given dataset.

The AutoQSAR methodology was designed from the ground up to enable both QSAR experts and non-experts to efficiently and reliably create high-quality, predictive QSAR models. It was born out of the need to efficiently create QSAR models for use by Schrödinger’s various research activities. We’ve demonstrated the effectiveness and accuracy of AutoQSAR models by comparing their performance against those from the literature, where the same datasets were used in developing models for both approaches.[5] The goal of that work isn’t to highlight the models, rather it’s to demonstrate that an automated approach to QSAR model creation can reliably generate models with comparable performance to published QSAR models.

By default AutoQSAR systematically applies a consistent set of descriptors that the method knows how to generate and a consistent set of machine learning approaches that it knows how to apply. Thus, making predictions for new compounds given an AutoQSAR model requires no understanding of the descriptors required to apply the model or how to make the prediction with a given machine learning method. Only a QSAR model file and a ligand structure in 1D, 2D, or 3D format is required to make a prediction for that ligand. This makes integration of AutoQSAR models into existing informatics workflows very easy and cost effective. Furthermore, it facilitates automatic updating of QSAR models as more end point data become available during the project's lifecycle. This has been shown in lead optimization projects to improve QSAR model prediction quality.[3,4]

While AutoQSAR was developed with small molecules in mind, it can be used to create QSAR models for a broad range of endpoints including protein viscosity and solubility and properties of interest to material science. Users can provide their own chemical descriptors to use in addition to or instead of the default set applied by AutoQSAR. For example, QSAR modeling of polypeptides can be performed using AutoQSAR along with a set of protein-based fingerprints.

AutoQSAR Methodology

An overview of the AutoQSAR workflow is shown in Figure 1 and has been fully described in ref. [5]. Given a learning set of chemical structures and an activity property, or other dependent variable, 497 physicochemical and topological descriptors are computed, along with a variety of Canvas fingerprints,[6] yielding a large pool of independent variables from which to build models. Only 2D information is required to compute the descriptors and fingerprints; SMILES strings are a suitable source of input.

Figure 1. A schematic of the AutoQSAR workflow

It is strongly recommended that datasets be curated and standardized before performing QSAR. AutoQSAR does not modify input structures in the learning or external validation sets. We recommend performing these steps with a tool such as LigPrep followed by Canvas for duplicate elimination or by providing input structures in SMILES format.

Since the descriptors typically contain a high degree of redundancy, a feature selection procedure is carried out to identify a smaller subset of descriptors. Both this subset of descriptors and the fingerprints are then used independently with a variety of machine learning methods that train against either continuous or categorical independent variables. For continuous data QSAR models are created using kernel-based PLS, principal components regression, partial least squares regression, or best subsets multiple linear regression. For categorical data naïve Bayes classification or ensemble recursive partitioning are employed. In either case, a large number of models are built and validated using many different random training/test set splits of the learning set. The performance of a given model on its training and test sets is used to compute a score (see Fig. 2) that assesses the quality of the model. We strongly recommend use of an external validation (hold out) set for model validation.

Figure 2. Graphical representation of the scoring function used to rank AutoQSAR models. The z-axis (vertical) represents the AutoQSAR score based on the accuracy of the training and test set predictions.

Once the top n models have been collected, predictions for new structures are generated, with the necessary descriptors and fingerprints automatically created by the application. Predictions may be obtained from an individual model, or from a consensus of two or more models in the top n. For a continuously valued dependent variable, a consensus prediction is the arithmetic mean of the predictions from the different models. For a categorical dependent variable, the category receiving the most votes among all models is assigned, with ties being broken by the average probability per category.

To recognize potential outliers, an estimate of whether a predicted compound falls within the applicability domain of an AutoQSAR model is made for each prediction. It flags compounds that are outside the similarity range of 95% of training set compounds as potential outliers. Here the Tanimoto similarity to a modal dendritic fingerprint from all training set structures (logical OR of all on bits) is employed. Generally, this approach is most useful when developing local models and of less utility in global models where the structural diversity in training set compounds can be very large.

Performance of AutoQSAR Models

To evaluate the ability of AutoQSAR to generate predictive QSAR models for diverse properties, six data sets were taken directly from published QSAR models that satisfied the following criteria, (1) end point of interest to pharmaceutical industry, (2) all molecules provided or specified non-ambiguously, and (3) sufficient information to compare model performance must be available, preferably in the form of an external validation set. Data sets covering six end points from the literature have been identified: three for toxicity (bioconcentration factor in fish, mutagenicity, and carcinogenicity), two for traditional ADME (blood-brain barrier permeability (logBB) and solubility (logS)), and binding affinities for ten lead optimization series spanning seven protein targets.

As shown in Fig. 3, the average Q2 for binding affinity prediction across the external validation sets for data in the original paper, [7] regenerated in Canvas KPLS with the latest fingerprint and descriptor calculators, and with AutoQSAR, are quite consistent at 0.55, 0.54, and 0.54 respectively. Six of the 10 models yielded a Q2 greater than 0.5, which is a generally accepted threshold for satisfactory performance.

Figure 3. Correlations to experimental binding data for AutoQSAR and literature KPLS-based models.

For the comparison of categorical data we use the accuracy, selectivity, and specificity defined for a two-class model in terms of correct predictions of class, i.e. true and false positives (TP, FP), and incorrect predictions of class, i.e. true and false negatives (TN, FN) as follows.

Accuracy = (TP+TN)/(TP+TN+FP+FN)*100
Sensitivity = TP/(TP+FN)*100
Specificity = TN/(TN+FP)*100

As was done in ref. [8], a categorical model was first created to predict blood-brain barrier permeability and compounds were then classified as poor (logBB<-1.0) or good (logBB≥-1.0) penetrators based on the quantitative model predictions. Results for the external (held out) validation set are shown in Figure 4. In short, both the top-ranked and consensus models from AutoQSAR performed well in separating BBB permeable compounds with good selectivity and specificity. While not quite as accurate as the best model reported in [8], the AutoQSAR models show an excellent balance of selectivity and specificity, avoiding the poor performance in this aspect of the published SVM-MOE model.

Figure 4. Accuracy, sensitivity, and specificity for Blood Brain Barrier models. The accuracy for kNN-MOE is estimated from Figure 1a. in [8].

Performance of the consensus AutoQSAR solubility model for the learning and external validation sets is shown in Figure 5. Both the consensus and top-ranked AutoQSAR models perform well with R2 of 0.88 and 0.89 and RMSEs of 0.70 and 0.69, respectively, for the learning set. Performance of the consensus and top-ranked AutoQSAR models on the external validation set are similar to the learning set, with Q2 of 0.89 and 0.87 and RMSEs of 0.70 and 0.77, respectively. Both compare favorably to the models published in [9] where R2 of 0.88 for the ASMS model and 0.90 for the ASMS-LOGP model were reported when training against all 1708 compounds.

Figure 5. Plots of calculated versus experimental solubility of 1708 molecules for the AutoQSAR consensus model and the ASMS-LOGP model from the literature.

A comparison of the selectivity, sensitivity, and accuracy of carcinogenicity models from [10] is shown in Figure 6. Using the default mixture of fingerprint and topological-based descriptors, consensus, and the top-ranked AutoQSAR models with good sensitivity were obtained. However, their specificities are low. A closer examination showed that the consensus model utilized the topological-based descriptors only. Better performing models were obtained when generating models using only fingerprint descriptors. The two models from [10] show slightly better accuracy, better sensitivity, and mixed specificity; They differ in the descriptors employed and show accuracies of 0.73 for MDL descriptors and 0.69 for Dragon descriptors. The importance of sensitivity in models used for regulatory purposes has previously been reported [11]. In that context, AutoQSAR models employing topological descriptors only are preferred. Their performance is as good as the models from [10]. However, the models from [10] show slightly superior accuracy to either the topological descriptor or only fingerprint AutoQSAR models.

Figure 6. Cooper statistics for AutoQSAR and published models of carcinogenicity.

The performance of AutoQSAR in classifying compounds as mutagenic or non-mutagenic is compared to models from [12] in Figure 7. Good performance in separating mutagenic from non-mutagenic compounds is obtained with the consensus models generated with all descriptors (topological-based and fingerprints) and those without fingerprints. The performance of all models is similar to the roughly 85% reliability of the experimental mutagenicity Ames test [13]. The AutoQSAR model’s performance is as good as or better than the models in [12] on this external validation set though the published approaches are far more sophisticated including multi-stage modeling techniques and undoubtedly required more investment of human capital to create.

Figure 7. Cooper statistics for AutoQSAR and published models of mutagenicity.

A comparison of Q2 values for the external validation set for fish BCF is provided in Figure 8. Using topological descriptors and fingerprints, the consensus model found one significant outlier where the predicted logBCF was clearly off the scale (See ref. [5] for details). Once the outlier was removed, the consensus model produced a very reasonable model with a Q2 of 0.5 that is close to the performance of CAESAR model (Q2 = 0.57). Although the AutoQSAR models performed somewhat worse than the published models, they were generated with minimal human intervention while the published model employed powerful information technology for feature selection and required programming knowledge in R and MATLAB.

Figure 8. Comparison of Q2 values of different models between the model predictions and experimental log BCF.

In terms of predictive power, we find AutoQSAR models performed as well as models from the literature for binding affinity, solubility, blood-brain barrier permeability, and mutagenicity. For carcinogenicity and BCF the literature models were somewhat superior, though they required significantly more human capital to create than the AutoQSAR models.

Conclusions

AutoQSAR facilitates the creation of high-quality, predictive QSAR models and is of use by QSAR experts and non-experts alike. It encodes current QSAR best practices methods in a fully automated workflow with demonstrated ability to create predictive models of continuously valued and categorical data for a diverse range of end points. It’s easily integrated into existing informatics platforms and enables significant cost savings by boosting productivity in QSAR modeling applications.

References

  1. Organisation for Economic Co-Operation and Development OECD Principles for the Validation, for Regulatory Purposes, of (Quantitative) Structure-Activity Relationship Models. Accessed Accessed Apr 21, 2016
  2. Cherkasov A, Muratov EN, Fourches D, Varnek A, Baskin, II, Cronin M, Dearden J, Gramatica P, Martin YC, Todeschini R, Consonni V, Kuz'min VE, Cramer R, Benigni R, Yang C, Rathman J, Terfloth L, Gasteiger J, Richard A, Tropsha A (2014) J Med Chem 57(12):4977 
  3. Cumming JG, Davis AM, Muresan S, Haeberlein M, Chen H (2013) Nat Rev Drug Discov 12(12):948 
  4. Rodgers SL, Davis AM, Tomkinson NP, van de Waterbeemd H (2011) Molecular Informatics 30(2-3):256 
  5. Dixon SL, Duan J, Smith E, Von Bargen CD, Sherman W, Repasky MP (2016) Future Med Chem 8(15):1825 
  6. Sastry M, Lowrie JF, Dixon SL, Sherman W (2010) J Chem Inf Model 50(5):771 
  7. An Y, Sherman W, Dixon SL (2013) J Chem Inf Model 53(9):2312 
  8. Zhang L, Zhu H, Oprea TI, Golbraikh A, Tropsha A (2008) Pharm Res 25(8):1902 
  9. Wang J, Krudy G, Hou T, Zhang W, Holland G, Xu X (2007) J Chem Inf Model 47(4):1395 
  10. Fjodorova N, Vracko M, Novic M, Roncaglioni A, Benfenati E (2010) Chem Cent J 4 Suppl 1:S3
  11. Benfenati E, Benigni R, Demarini DM, Helma C, Kirkland D, Martin TM, Mazzatorta P, Ouedraogo-Arras G, Richard AM, Schilter B, Schoonen WG, Snyder RD, Yang C (2009) J Environ Sci Health C Environ Carcinog Ecotoxicol Rev 27(2):57 
  12. Ferrari T, Gini G (2010) Chem Cent J 4 Suppl 1:S2 13. Piegorsch WW, Zeiger E (1991) Measuring Intra-Assay Agreement for the Ames Salmonella Assay. Springer Berlin Heidelberg, Berlin, Heidelberg