Nathan Brown. The Application of Consensus Modelling and Genetic Algorithms to Interpretable Discriminant Analysis. nathan.brown@novartis.



Similar documents
Fingerprint-Based Virtual Screening Using Multiple Bioactive Reference Structures

Cheminformatics and its Role in the Modern Drug Discovery Process

Data Visualization in Cheminformatics. Simon Xi Computational Sciences CoE Pfizer Cambridge

KNIME Enterprise server usage and global deployment at NIBR

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification

Cheminformatics and Pharmacophore Modeling, Together at Last

A Statistician s View of Big Data

The Data Mining Process

Integrating Medicinal Chemistry and Computational Chemistry: The Molecular Forecaster Approach

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

speed thought Getting the most of CHEMAXON Integration June 2006 of The Power of at the

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Use of Predictive ADME in Library Profiling and Lead Optimization

Ph.D. in Bioinformatics and Computational Biology Degree Requirements

Working with telecommunications

We use Reaxys intensively for hit identification, hit-to-lead and lead optimization.

Data Mining - Evaluation of Classifiers

Consensus Scoring to Improve the Predictive Power of in-silico Screening for Drug Design

A Survey on Intrusion Detection System with Data Mining Techniques

Molecular descriptors and chemometrics: a powerful combined tool for pharmaceutical, toxicological and environmental problems.

Experiments in Web Page Classification for Semantic Web

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

Learning from Diversity

Pre-Masters. Science and Engineering

Structure of Presentation. The Role of Programming in Informatics Curricula. Concepts of Informatics 2. Concepts of Informatics 1

Mining a Corpus of Job Ads

Machine Learning with MATLAB David Willingham Application Engineer

Final Project Report

Big Data Analytics for Healthcare

DATA MINING TECHNIQUES AND APPLICATIONS

An Introduction to Data Mining

Careers in Management Consulting, Pharma and Biotech 9 JUL 2009

LUCKY AHMED Department of Chemistry and Biochemistry Yale University, New Haven, CT

Feature Subset Selection in Spam Detection

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

The following module is compulsory for students who do not have an A-level pass in Mathematics. CH1M Chemistry M 20 4

TRTML - A Tripleset Recommendation Tool based on Supervised Learning Algorithms

STRUCTURE-GUIDED, FRAGMENT-BASED LEAD GENERATION FOR ONCOLOGY TARGETS

Predictive Data modeling for health care: Comparative performance study of different prediction models

How To Change Medicine

Health Spring Meeting May 2008 Session # 42: Dental Insurance What's New, What's Important

Learning outcomes. Knowledge and understanding. Competence and skills

Chapter 6. The stacking ensemble approach

Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata

MA2823: Foundations of Machine Learning

Putting IBM Watson to Work In Healthcare

De novo design in the cloud from mining big data to clinical candidate

Program Overview. Updated 06/13

Machine Learning using MapReduce

THE CAMBRIDGE CRYSTALLOGRAPHIC DATA CENTRE (CCDC)

Data Integration. Lectures 16 & 17. ECS289A, WQ03, Filkov

Program Overview. Updated 06/13

An Overview of Knowledge Discovery Database and Data mining Techniques

The Need for Training in Big Data: Experiences and Case Studies

Prof. Elizabeth Raymond Department of Chemistry Western Washington University

Big Data Challenges. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

The Artificial Prediction Market

Data Mining Solutions for the Business Environment

Analysis Tools and Libraries for BigData

Scoring Functions and Docking. Keith Davies Treweren Consultants Ltd 26 October 2005

A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering

Knowledge-based systems and the need for learning

High-Throughput Screening at The University of Chicago Cellular Screening Center. Sam Bettis Technical Director

Detecting client-side e-banking fraud using a heuristic model

1 Topic. 2 Scilab. 2.1 What is Scilab?

How To Understand Protein-Protein Interaction And Inhibitors

Mammoth Scale Machine Learning!

A Case of Study on Hadoop Benchmark Behavior Modeling Using ALOJA-ML

Personalized Predictive Medicine and Genomic Clinical Trials

Making Sense of the Mayhem: Machine Learning and March Madness

Role of Social Networking in Marketing using Data Mining

High Performance Computing Initiatives

Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm

Microarray Data Mining: Puce a ADN

MS1b Statistical Data Mining

Use of social media data for official statistics

Machine learning for algo trading

BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology

: Introduction to Machine Learning Dr. Rita Osadchy

Machine Learning Introduction

Predictive Analytics Certificate Program

Original article: A SIMPLE CLICK BY CLICK PROTOCOL TO PERFORM DOCKING: AUTODOCK 4.2 MADE EASY FOR NON-BIOINFORMATICIANS

Active Learning SVM for Blogs recommendation

Azure Machine Learning, SQL Data Mining and R

A Logistic Regression Approach to Ad Click Prediction

Electronic Health Records: An introduction to openehr and archetypes

The INFUSIS Project Data and Text Mining for In Silico Modeling

Computational Drug Repositioning by Ranking and Integrating Multiple Data Sources

Artificial Intelligence and Machine Learning Models

Transcription:

Nathan Brown nathan.brown@novartis.com The Application of Consensus Modelling and Genetic Algorithms to Interpretable Discriminant Analysis Workshop Chemoinformatics in Europe: Research and Teaching 30 th May 2006

Discriminant Analysis Using a GA Predictive vs. Diagnostic Modelling Discriminant Analysis with a Genetic Algorithm Consensus and Splice Modelling Experimental Studies MDDR: 1130 renin and 636 COX inhibitors 1 Oral drugs: 1082 FDA-approved drugs 2 1. Hert, J.; Willet, P.; Wilton, D. J.; Acklin, P.; Azzaoui, K.; Jacoby, E.; Schuffenhauer, A.; Comparison of Fingerprint-Based Methods for Virtual Screening Using Multiple Bioactive Reference Structures. J. Chem. Inf. Comput. Sci. 2004, 44, 1177-1185. 2. Vieth, M.; Siegel, M. G.; Higgs, R. E.; Watson, I. A.; Robertson, D. H.; Savin, K. A.; Durst, G. L.; Hipskind, P. A. Characteristic Physical Properties and Structural Fragments of Marketed Oral Drugs. J. Med. Chem. 2004, 47, 224-232. 2 6 June 2006

Predictive versus Diagnostic Models Highly predictive models tend to obfuscate what is important for the property being modelled Highly interpretable models tend to be less effective in prediction power However, both objectives are very important * Adapted from a diagram by Richard Lewis We want highly predictive models that can also guide our decision-making processes 3 6 June 2006

Discriminant Analysis Supervised learning Dependent variable is known for dataset and used in training the model with the independent variables Optimize separation of classes Evolve weights for binned descriptors Score solutions according to ability to separate objects Discover descriptor ranges that are important for discrimination which can then be applied to make informed decisions 1. Gillet, V. J.; Willett, P.; Bradshaw, J. Identification of Biological Activity Profiles Using Substructural Analysis and Genetic Algorithms. J. Chem. Inf. Comput. Sci. 1998, 38, 165 179. 4 6 June 2006

Chromosome Encoding Selection of N descriptors Each descriptor partitioned into B i bins Each bin can take any value in the range {0 W} Chromosome length is then (N B i ) PSA N = 3 B = {4, 7, 5} MW ClogP 5 6 June 2006

Descriptor Selection Calculate physicochemical descriptors Cluster descriptors (not objects) Select descriptors that are: more orthogonal, and more interpretable for the medicinal chemist Some dataset dependency 6 6 June 2006

Fitness Functions 1 Initial Enhancement (IE) Emphasises enrichment in top NACT% of recalled molecules i.e. mean rank of all actives recalled after NACT Global Enhancement (GE) Emphasises enrichment of all actives in recalled molecules i.e. mean rank of all actives Maximum Difference Enhancement (MDE) Emphasises maximum difference in scores between the two classes 7 6 June 2006

Fitness Functions 2 Existing fitness function used a combination of evaluations: Number of actives in the top N% Average rank of actives over entire rank Maximised Difference Enhancement (MDE) The difference of the average rank of the two classes being discriminated MDE will tend to result in molecules where the separation between the two classes is maximised globally and rewarding ranks with more interesting molecules in the initial part of the rank 8 6 June 2006

Consensus Models Aim to reduce stochastic effects of using a single chromosome 9 6 June 2006

Splice Models Essentially a manual recombination operator to effect a more optimal solution model based on feedback and intuition 10 6 June 2006

Renin Consensus Discrimination Model 6 5 4 3 2 Global Enhancement 1 0 M1TR M1TS1 M1TS2 CM1TS1 CM1TS2 M5TR M5TS1 M5TS2 CM5TS1 CM5TS2 M6TR M6TS1 M6TS2 CM6TS1 CM6TS2 Model Single Model Result Consensus Model Result 11 6 June 2006

Renin Consensus Discrimination Model 6 5 4 3 2 Global Enhancement 1 0 M1TR M1TS1 M1TS2 CM1TS1 CM1TS2 M5TR M5TS1 M5TS2 CM5TS1 CM5TS2 M6TR M6TS1 M6TS2 CM6TS1 CM6TS2 Model Single Model Result Consensus Model Result 12 6 June 2006

SM6TS2 Renin Splice Discrimination Model 7 6 5 4 3 Global Enhancement 2 1 0 M1TS1 M1TS2 CM1TS1 CM1TS2 M5TS1 M5TS2 CM5TS1 CM5TS2 M6TS1 M6TS2 CM6TS1 CM6TS2 SM6TR SM6TS1 Model Single Model Result Consensus Model Result Splice Model Result 13 6 June 2006

SM6TS2 Renin Splice Discrimination Model 7 6 5 4 3 Global Enhancement 2 1 0 M1TS1 M1TS2 CM1TS1 CM1TS2 M5TS1 M5TS2 CM5TS1 CM5TS2 M6TS1 M6TS2 CM6TS1 CM6TS2 SM6TR SM6TS1 Model Single Model Result Consensus Model Result Splice Model Result 14 6 June 2006

COX Splice Discrimination Model 3.5 3 2.5 Global Enhancement 2 1.5 1 0.5 0 M1TR M1TS1 M1TS2 CM1TS1 CM1TS2 M2TR M2TS1 M2TS2 CM2TS1 CM2TS2 Model Single Model Result Consensus Model Result 15 6 June 2006

COX Splice Discrimination Model 3.5 3 2.5 Global Enhancement 2 1.5 1 0.5 0 M1TR M1TS1 M1TS2 CM1TS1 CM1TS2 M2TR M2TS1 M2TS2 CM2TS1 CM2TS2 Model Single Model Result Consensus Model Result 16 6 June 2006

Comparative Study: Oral vs. Non-Oral Drugs Oral vs. non-oral drugs dataset 1 GA model compared with models generated with Naïve Bayes Classifier (NBC) Support Vector Machines (SVM) Investigating: Consistency of results Interpretation of models 1. Vieth, M.; Siegel, M. G.; Higgs, R. E.; Watson, I. A.; Robertson, D. H.; Savin, K. A.; Durst, G. L.; Hipskind, P. A. Characteristic Physical Properties and Structural Fragments of Marketed Oral Drugs. J. Med. Chem. 2004, 47, 224-232. 17 6 June 2006

Oral Drug Discrimination Model 1.4 1.2 1.0 Global Enhancement 0.8 0.6 0.4 0.2 0.0 Training Set Test Set 1 Test Set 2 GA NBC SVM 18 6 June 2006

Model Interpretability Model weights indicate Important descriptors Important ranges Used to guide decisionmaking processes Similarity searching Filtering rules Rules are focused on domain of interest 19 6 June 2006

Conclusions Consensus and splice models provide consistently improved results GA models provide greater or similar interpretability than other methods applied here Models are transparent as to which descriptors and their ranges are of greatest importance in discriminating Indications that the GA and NBC methods could be applied in combination Investigation of complementarity 1. Ganguly, M.; Brown, N.; Schuffenhauer, A.; Ertl, P.; Gillet, V. J.; Greenidge, P. A. Introducing the Consensus Modeling Concept in Genetic Algorithms: Application to Interpretable Discrimination Analysis. Submitted to J. Chem. Inf. Mod. 20 6 June 2006

Areas the Student Covered Cluster analysis Druglikeness Discriminant analysis Variable selection Genetic algorithms Statistical learning methods Java programming Method development 21 6 June 2006

What does the student gain? Coding and adapting software Tackling everyday challenges of research Performing research in industry Application-context drug research Empowered to pursue their own research 22 6 June 2006

What do the mentors gain? Freedom to pursue an avenue of interest Developing skills in student mentoring A new viewpoint with new ideas Assisting in training the next generation of scientists 23 6 June 2006

Acknowledgements University of Sheffield Milan Ganguly Val Gillet Peter Willett UCSF Jérôme Hert Cheminformatics Peter Ertl Stephen Jelfs Computer-Aided Drug Discovery Paulette Greenidge Richard Lewis Nikolaus Stiefl Molecular & Library Informatics Kamal Azzaoui Edgar Jacoby Ansgar Schuffenhauer 24 6 June 2006