Artificial neural networks & random forest classification of druggable molecules and disease targets via scoring functions (sfs)

I. L. Hudson, S. Y. Leemaqz, A. D. Abell

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Citation (Scopus)
1 Downloads (Pure)

Abstract

In recent years, machine learning has played an increasing role to help identify druggable molecules. In particular research has shown that random forests (RFs), recursive partitioning (RP), support vector machines (SVMs) and artificial neural networks (ANNs) have been commonly employed in this arena. Expanding disease modifying targets to pharmacological manipulation is vital to human health. Modelling disease targets allow for prediction and prioritisation based on their molecular characteristics and druggability. The aim of this current paper is 2 fold: (i) to propose a computational method to identify druggable disease targets using combinations molecular parameters (MPs) and (ii) to establish which of ANN or RF procedures and which scoring functions best partition molecular and disease target space. Classifications by Artificial Neural Networks (ANNs) and Random Forest (RF) based on 8 molecular parameters (MPs) were performed to classify disease targets with high or low violator scores (using cutpoints 3, 4 or 5), and the 4 traditional parameters of Lipinski's rule of five (Ro5), plus 4 extra parameters (polar surface area (PSA), number of rotatable bonds and rings, N and O atoms, and a choice between 2 alternatives for lipophilicity, the distribution coefficient (log D) and the partition coefficient (log P) (Hudson et al., (2017), Zafar et al., (2013, 2016)).For the molecule parameter (MP) data RF performed better than ANNs and the log D model of either score 4 or score 5 was optimal compared to the log P model. ANNs however, were superior to the RF models for MP sets containing both log D and log P. For the RF score 4 log D model the most important variables were log D, molecular weight (MW) and number of rotatable bonds (ROT). The next best model via RF was score 5 log D, with its most important variables being PSA, log D and MW, according to mean decrease in gini scores. Overall, for the target data the RF models performed better than ANNs, with inclusion of log D being important. For the RF target models the score 5 partition performed best, AUC (95% CI) of 0.88 (0.63, 1.0) for all 3 models; with the higher mean decrease gini values (MDGs) attributable to MPs (MW, NATOM, ROT, PSA Hacceptors, NRING). The MP variables then chosen with lower MDGs were (log D, NATOM, NRING, log P, Hdonors), indicating log D is superior to log P (VIs, 2.14 > 1.47). Also the RF score 4 log D, and log P models performed equally well, AUC (95% CI) of 0.85 (0.70, 1.00)-closely followed by the RF score 3 target models, score 3 log D and score 3(log D+log P), which both did well with AUC (95% CI) of 0.84 (0.73, 0.94). The ANN target based score 4 log D model, achieved best classification, with AUC (95% CI) of 0.89 (0.77, 1.0). In contrast the score 4 log D+Log P model performed the worst, with AUC (95% CI) of 0.69 (0.51, 0.86). Similarly for the RF analysis, the score 4 log D+log P performed worse with AUC (95% CI) of 0.83 (0.68, 0.92), whilst separate score 4 log D or log P models classified equally well (0.85, (0.707, 1.0)). All 3 cutpoint 3 ANN target models, showed PSA to be highly important compared to the MW. In contrast MW is the most important variable for all RF target models and all cutpoints. Log D has greater variable importance (VI) compared to MW in the score 3 log D+log P ANN model (17.31 >12.60). Also in the score 3 log P ANN model, MW has least VI of 6.46 compared to log P's VI of 17.15. Log D is more important than log P in the score 3 log D+log P model.. For the optimal score 4 log D, model top VIs are attributable to (PSA, log D, NRING, Hacceptors, MW), showing strong influence of PSA and Log D compared to the traditional MW. The RP and ANN rules to classify the high score violators from the low confirmed the value of log D in the scoring function, validating Zafar et al. (2016, 2013) and the original MC/DA cutpoints for each MP by Hudson et al. (2017). Score functions of violations and best cutpoints to identify druggable molecules and targets were confirmed and shown to be associated with specific diseases. Our simple scoring functions of counts of violations partitioned chemospace well, identifying both good/poor druggable molecules and targets.

Original languageEnglish
Title of host publicationMODSIM2019, 23rd International Congress on Modelling and Simulation
EditorsS. Elsawah
PublisherModelling and Simulation Society of Australia and New Zealand Inc. (MSSANZ)
Pages28-34
Number of pages7
ISBN (Electronic)9780975840092
Publication statusPublished - Dec 2019
Externally publishedYes
Event23rd International Congress on Modelling and Simulation - Supporting Evidence-Based Decision Making: The Role of Modelling and Simulation, MODSIM 2019 - Canberra, Australia
Duration: 1 Dec 20196 Dec 2019

Publication series

Name23rd International Congress on Modelling and Simulation - Supporting Evidence-Based Decision Making: The Role of Modelling and Simulation, MODSIM 2019

Conference

Conference23rd International Congress on Modelling and Simulation - Supporting Evidence-Based Decision Making: The Role of Modelling and Simulation, MODSIM 2019
Country/TerritoryAustralia
CityCanberra
Period1/12/196/12/19

Bibliographical note

These proceedings are licensed under the terms of the Creative Commons Attribution 4.0 International CC BY License (http://creativecommons.org/licenses/by/4.0), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you attribute MSSANZ and the original author(s) and source, provide a link to the Creative Commons licence and indicate if changes were made. Images or other third party material are included in this licence, unless otherwise indicated in a credit line to the material.

Keywords

  • Disease targets
  • Machine learning
  • Score function druggability rules

Fingerprint

Dive into the research topics of 'Artificial neural networks & random forest classification of druggable molecules and disease targets via scoring functions (sfs)'. Together they form a unique fingerprint.

Cite this