Nathan Brown nathan.brown@novartis.com The Application of Consensus Modelling and Genetic Algorithms to Interpretable Discriminant Analysis Workshop Chemoinformatics in Europe: Research and Teaching 30 th May 2006
Discriminant Analysis Using a GA Predictive vs. Diagnostic Modelling Discriminant Analysis with a Genetic Algorithm Consensus and Splice Modelling Experimental Studies MDDR: 1130 renin and 636 COX inhibitors 1 Oral drugs: 1082 FDA-approved drugs 2 1. Hert, J.; Willet, P.; Wilton, D. J.; Acklin, P.; Azzaoui, K.; Jacoby, E.; Schuffenhauer, A.; Comparison of Fingerprint-Based Methods for Virtual Screening Using Multiple Bioactive Reference Structures. J. Chem. Inf. Comput. Sci. 2004, 44, 1177-1185. 2. Vieth, M.; Siegel, M. G.; Higgs, R. E.; Watson, I. A.; Robertson, D. H.; Savin, K. A.; Durst, G. L.; Hipskind, P. A. Characteristic Physical Properties and Structural Fragments of Marketed Oral Drugs. J. Med. Chem. 2004, 47, 224-232. 2 6 June 2006
Predictive versus Diagnostic Models Highly predictive models tend to obfuscate what is important for the property being modelled Highly interpretable models tend to be less effective in prediction power However, both objectives are very important * Adapted from a diagram by Richard Lewis We want highly predictive models that can also guide our decision-making processes 3 6 June 2006
Discriminant Analysis Supervised learning Dependent variable is known for dataset and used in training the model with the independent variables Optimize separation of classes Evolve weights for binned descriptors Score solutions according to ability to separate objects Discover descriptor ranges that are important for discrimination which can then be applied to make informed decisions 1. Gillet, V. J.; Willett, P.; Bradshaw, J. Identification of Biological Activity Profiles Using Substructural Analysis and Genetic Algorithms. J. Chem. Inf. Comput. Sci. 1998, 38, 165 179. 4 6 June 2006
Chromosome Encoding Selection of N descriptors Each descriptor partitioned into B i bins Each bin can take any value in the range {0 W} Chromosome length is then (N B i ) PSA N = 3 B = {4, 7, 5} MW ClogP 5 6 June 2006
Descriptor Selection Calculate physicochemical descriptors Cluster descriptors (not objects) Select descriptors that are: more orthogonal, and more interpretable for the medicinal chemist Some dataset dependency 6 6 June 2006
Fitness Functions 1 Initial Enhancement (IE) Emphasises enrichment in top NACT% of recalled molecules i.e. mean rank of all actives recalled after NACT Global Enhancement (GE) Emphasises enrichment of all actives in recalled molecules i.e. mean rank of all actives Maximum Difference Enhancement (MDE) Emphasises maximum difference in scores between the two classes 7 6 June 2006
Fitness Functions 2 Existing fitness function used a combination of evaluations: Number of actives in the top N% Average rank of actives over entire rank Maximised Difference Enhancement (MDE) The difference of the average rank of the two classes being discriminated MDE will tend to result in molecules where the separation between the two classes is maximised globally and rewarding ranks with more interesting molecules in the initial part of the rank 8 6 June 2006
Consensus Models Aim to reduce stochastic effects of using a single chromosome 9 6 June 2006
Splice Models Essentially a manual recombination operator to effect a more optimal solution model based on feedback and intuition 10 6 June 2006
Renin Consensus Discrimination Model 6 5 4 3 2 Global Enhancement 1 0 M1TR M1TS1 M1TS2 CM1TS1 CM1TS2 M5TR M5TS1 M5TS2 CM5TS1 CM5TS2 M6TR M6TS1 M6TS2 CM6TS1 CM6TS2 Model Single Model Result Consensus Model Result 11 6 June 2006
Renin Consensus Discrimination Model 6 5 4 3 2 Global Enhancement 1 0 M1TR M1TS1 M1TS2 CM1TS1 CM1TS2 M5TR M5TS1 M5TS2 CM5TS1 CM5TS2 M6TR M6TS1 M6TS2 CM6TS1 CM6TS2 Model Single Model Result Consensus Model Result 12 6 June 2006
SM6TS2 Renin Splice Discrimination Model 7 6 5 4 3 Global Enhancement 2 1 0 M1TS1 M1TS2 CM1TS1 CM1TS2 M5TS1 M5TS2 CM5TS1 CM5TS2 M6TS1 M6TS2 CM6TS1 CM6TS2 SM6TR SM6TS1 Model Single Model Result Consensus Model Result Splice Model Result 13 6 June 2006
SM6TS2 Renin Splice Discrimination Model 7 6 5 4 3 Global Enhancement 2 1 0 M1TS1 M1TS2 CM1TS1 CM1TS2 M5TS1 M5TS2 CM5TS1 CM5TS2 M6TS1 M6TS2 CM6TS1 CM6TS2 SM6TR SM6TS1 Model Single Model Result Consensus Model Result Splice Model Result 14 6 June 2006
COX Splice Discrimination Model 3.5 3 2.5 Global Enhancement 2 1.5 1 0.5 0 M1TR M1TS1 M1TS2 CM1TS1 CM1TS2 M2TR M2TS1 M2TS2 CM2TS1 CM2TS2 Model Single Model Result Consensus Model Result 15 6 June 2006
COX Splice Discrimination Model 3.5 3 2.5 Global Enhancement 2 1.5 1 0.5 0 M1TR M1TS1 M1TS2 CM1TS1 CM1TS2 M2TR M2TS1 M2TS2 CM2TS1 CM2TS2 Model Single Model Result Consensus Model Result 16 6 June 2006
Comparative Study: Oral vs. Non-Oral Drugs Oral vs. non-oral drugs dataset 1 GA model compared with models generated with Naïve Bayes Classifier (NBC) Support Vector Machines (SVM) Investigating: Consistency of results Interpretation of models 1. Vieth, M.; Siegel, M. G.; Higgs, R. E.; Watson, I. A.; Robertson, D. H.; Savin, K. A.; Durst, G. L.; Hipskind, P. A. Characteristic Physical Properties and Structural Fragments of Marketed Oral Drugs. J. Med. Chem. 2004, 47, 224-232. 17 6 June 2006
Oral Drug Discrimination Model 1.4 1.2 1.0 Global Enhancement 0.8 0.6 0.4 0.2 0.0 Training Set Test Set 1 Test Set 2 GA NBC SVM 18 6 June 2006
Model Interpretability Model weights indicate Important descriptors Important ranges Used to guide decisionmaking processes Similarity searching Filtering rules Rules are focused on domain of interest 19 6 June 2006
Conclusions Consensus and splice models provide consistently improved results GA models provide greater or similar interpretability than other methods applied here Models are transparent as to which descriptors and their ranges are of greatest importance in discriminating Indications that the GA and NBC methods could be applied in combination Investigation of complementarity 1. Ganguly, M.; Brown, N.; Schuffenhauer, A.; Ertl, P.; Gillet, V. J.; Greenidge, P. A. Introducing the Consensus Modeling Concept in Genetic Algorithms: Application to Interpretable Discrimination Analysis. Submitted to J. Chem. Inf. Mod. 20 6 June 2006
Areas the Student Covered Cluster analysis Druglikeness Discriminant analysis Variable selection Genetic algorithms Statistical learning methods Java programming Method development 21 6 June 2006
What does the student gain? Coding and adapting software Tackling everyday challenges of research Performing research in industry Application-context drug research Empowered to pursue their own research 22 6 June 2006
What do the mentors gain? Freedom to pursue an avenue of interest Developing skills in student mentoring A new viewpoint with new ideas Assisting in training the next generation of scientists 23 6 June 2006
Acknowledgements University of Sheffield Milan Ganguly Val Gillet Peter Willett UCSF Jérôme Hert Cheminformatics Peter Ertl Stephen Jelfs Computer-Aided Drug Discovery Paulette Greenidge Richard Lewis Nikolaus Stiefl Molecular & Library Informatics Kamal Azzaoui Edgar Jacoby Ansgar Schuffenhauer 24 6 June 2006