Data Visualization in Cheminformatics Simon Xi Computational Sciences CoE Pfizer Cambridge
My Background Professional Experience Senior Principal Scientist, Computational Sciences CoE, Pfizer Cambridge 9-year experience in pharmaceutical research with a focused on developing cheminformatics and bioinformatics applications for research scientists Education MSc in Molecular Cell Biology in UTDallas MSc in Software Engineering in SMU Finishing Ph.D in Bioinformatics in Boston University
What we will cover today Introduction to drug discovery Cheminformatics basics Encoding of the chemical structures Visualizing data and structures Design and optimization of compound library A case study
The Billion Dollar Molecules Drug Name 2006 World- Wide Sales Primary Use Lipitor $14,385M cholesterol Nexium $5,182M heartburn Advair $6,129M asthma Prevacid $3,425M heartburn Plavix $6,057M anticoagulant Singulair $3,579M asthma Seroquel $3,560M depression Effexor $3,722M depression Norvasc $4,866M hypertension Lipitor 14 billion annual sales
Industry Productivity vs. Investment The Challenge Total R&D Investment ($ Billions) NME/$ $25 $20 $15 $10 $5 $0 1970 1972 1974 1976 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 # NMEs 60 40 20 0 Source: PhRMA annual survey, 2000 Nature Reviews Drug Discovery 3, 451-456 (2004)
~100 Discovery Approaches Attrition On The R&D Process Millions of Compounds Screened Preclinical Pharmacology Preclinical Safety Clinical Pharmacology & Safety 1-2 Products Discovery Exploratory Development Full Development Phase I Phase II Phase III 0 15 5 10 Idea 11-15 Years Drug
Nat Rev Drug Discov. 2007 6:636-49.
What is Chemoinformatics? Use of computer and informational techniques, applied to a range of problems in the field of chemistry. These in silico techniques are commonly used in pharmaceutical companies in the process of drug discovery. Chemistry is a visual science. Data visualization is a key component of cheminformatics.
What is Chemoinformatics?
Encoding Chemical Structures SD format Lipitor Atoms Bonds SMILES format CC(C)C1=C(C(=O)NC2=CC=CC=C2)C( C3=CC=CC=C3)=C(N1CCC(O)CC(O)C C(O)=O)C4=CC=C(F)C=C4
Representing Structure as Fingerprints 010 0 100 0 1001 00000 1 00
Compound Similarity Search
Compound Properties/Descriptors 1D, 2D, 3D, multi-dimensional properties 1D: Molecular Weight, clogp, #of Atoms, charge, #H-Bond donors and acceptors 2D: Atom pairs, substructures functional groups 3D: Shape, pharmacophores nd: Fingerprints, etc.. Chemical series compounds sharing the same core structures 3D
Series Classifications Wards Clustering Iteratively merging a pair of nodes until all nodes are merged. At each merging step, two nodes that give minimal variance are chosen and merged into one new node. Once the tree hierarchy is generated, clusters can be defined by cutting the tree at certain dissimilarity threshold
What makes a drug? Primary pharmacology In vitro potency Cell based potency Functional assays Selectivity against other targets Toxicity Properties Inhibition of CYP450 isozymes PXR transactivation Human hepatocyte toxicity Mutagenicity Mitochondria toxicity Covalent protein binding Inhibition of HERG ADME/Physicochemical Properties Solubility Chemical stability Hydrophobicity/hydrogen bonding potential Intestinal mucosal cell permeation Liver and kidney clearance Metabolism Transporters Charge Size Protein binding Blood-brain barrier permeation Target cell permeation
Drug-Likeness: Rule of Five Proposed by C. Lipinski to describe drug-like molecules. Molecules displaying good oral absorption and /or distribution properties are likely to possess the following characteristics: Molecular Weight < 500 logp < 5.0 H-donors < 5 H-acceptors (number of N and O atoms) < 10
Data Visualization Grid View Table View Plot View Heatmap View Software Relevance Software Usability Software Management
Building Predictive Models using Machine Learning Techniques Use computational models to understand Structure-Activitive Relationship (SAR) Use computational models to run virtual screen to guide compound selection for synthesis
Interpretability of Predictive Models The good part The not so good part Can we derive this for non-linear models?
Multiple Parameter Optimization in Combinatorial Library Design Given a 100x100x100 virtual library space and a set of predictive models for various properties (e.g. potency, ADME, selectivity), select the best 300 compounds for synthesis with the highest probability of being potent and drug-like and with diverse sampling of the chemical space N R3 N R1 N N R2 For example diaminopyrimidine library
The problem of Multiple Parameters Optimzation The chemical space is huge Predictive models are not very predictive Many parameters to optimize and sometime contradictory to each other
MPO a case study with kinase selectivity ~200 cmpds from a library tested against 40 kinases, can we design another 100 cmpds that are highly selective N F F F N R1 N N R2 Trifluoro-diaminopyrimidine series (~200 cmpds) Identify compounds with desired seletivity profile in the expanded virtual chemical space Virtual Library Profile R1 Tested compounds Model Building R1 R2 FW Predictable Virtual Chemical Space Solving R-groups contribution using linear regression R2 Only few combination Rgroup- Kinase have been previously tested Enumeration 5-50x expansion R1 R2
Predictive models - Leave-One-Out Validations
Experimental Validation of Predictions KSS pic50 vs. FW pic50 r 2 =0.45 r 2 =0.59 r 2 =0.92 r 2 =0.86 r 2 =0.74 r 2 =0.83 r 2 =0.63 r 2 =0.88 ~40 cmpds in two series were selected for KSS testing More promiscuous r 2 =0.85 r 2 =0.81 r 2 =0.81 r 2 =0.85 More selective
Cheminformatics Challenges for Drug Discovery Information retrieval and knowledge managment - rapidly and efficiently present all relevant data/knowledge to scientists at the right time and right place Predictive models - drastically improve the accuracy and interpretability of in silico models for potency and ADME endpoints Computer-aided design provide easy to use software applications to help scientists analyze/visualize their data and make efficient use of prior knowledge during compound designs
References 1. Agrafiotis, D. K., Lobanov, V. S. and Salemme, F. R. (2002) Combinatorial informatics in the post-genomics ERA. Nat Rev Drug Discov. 1, 337-346 2. Lipinski, C. and Hopkins, A. (2004) Navigating chemical space for biology and medicine. Nature. 432, 855-861 3. Paolini, G. V., Shapland, R. H., van Hoorn, W. P., Mason, J. S. and Hopkins, A. L. (2006) Global mapping of pharmacological space. Nat Biotechnol. 24, 805-815