Fingerprint-Based Virtual Screening Using Multiple Bioactive Reference Structures



Similar documents
Nathan Brown. The Application of Consensus Modelling and Genetic Algorithms to Interpretable Discriminant Analysis.

Computational Drug Repositioning by Ranking and Integrating Multiple Data Sources

Cheminformatics and its Role in the Modern Drug Discovery Process

Scoring Functions and Docking. Keith Davies Treweren Consultants Ltd 26 October 2005

Cheminformatics and Pharmacophore Modeling, Together at Last

Preview of Period 2: Forms of Energy

Chapter 4: Computer Codes

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Efficient Similarity Search over Encrypted Data


Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification

Statistical Machine Learning

Technical Information

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

An Information Retrieval using weighted Index Terms in Natural Language document collections

How to create and interpret the predictive analysis of a compound

Austin Peay State University Department of Chemistry Chem The Use of the Spectrophotometer and Beer's Law

Chapter 8 How to Do Chemical Calculations

Privacy Aspects in Big Data Integration: Challenges and Opportunities

Several Views of Support Vector Machines

Data Visualization in Cheminformatics. Simon Xi Computational Sciences CoE Pfizer Cambridge

Molecular descriptors and chemometrics: a powerful combined tool for pharmaceutical, toxicological and environmental problems.

1 o Semestre 2007/2008

Azure Machine Learning, SQL Data Mining and R

Classification of Fingerprints. Sarat C. Dass Department of Statistics & Probability

! E6893 Big Data Analytics Lecture 5:! Big Data Analytics Algorithms -- II

Signature verification using Kolmogorov-Smirnov. statistic

Chapter 3. Mass Relationships in Chemical Reactions

A Biologically Inspired Approach to Network Vulnerability Identification

SYMBOLS, FORMULAS AND MOLAR MASSES

= amu. = amu

Risk & Fraud Management Solutions

Is 20TB really Big Data?

A Non-Linear Schema Theorem for Genetic Algorithms

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

The Mole Notes. There are many ways to or measure things. In Chemistry we also have special ways to count and measure things, one of which is the.

How to create a web-based molecular structure database with free software

DE 088 DISTANCE EDUCATION B.A.(PA) DEGREE EXAMINATION, DECEMBER BUSINESS COMMUNICATIONS. PART A (5 8 = 40 marks) Answer any FIVE questions.

The BBP Algorithm for Pi

Section 1.4 Place Value Systems of Numeration in Other Bases

Klaus-Robert Müller et al. Big Data and Machine Learning

Data, Measurements, Features

Document Image Retrieval using Signatures as Queries

INTERNSHIP REPORT CSC410. Shantanu Chaudhary 2010CS50295

CHARGED PARTICLES & MAGNETIC FIELDS - WebAssign

Comparison of Standard and Zipf-Based Document Retrieval Heuristics

Guide to Performance and Tuning: Query Performance and Sampled Selectivity

Traceability System for Manufacturer Accountability

RSA Question 2. Bob thinks that p and q are primes but p isn t. Then, Bob thinks Φ Bob :=(p-1)(q-1) = φ(n). Is this true?

Distances, Clustering, and Classification. Heatmaps

Adaptive Business Intelligence

Hydrogen Bonds The electrostatic nature of hydrogen bonds

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Data Mining and Visualization

ENHANCEMENTS TO SQL SERVER COLUMN STORES. Anuhya Mallempati #

Automated Collaborative Filtering Applications for Online Recruitment Services

Integrating Medicinal Chemistry and Computational Chemistry: The Molecular Forecaster Approach

Pre-Masters. Science and Engineering

NUMBER SYSTEMS. William Stallings

Lecture 2: Data Structures Steven Skiena. skiena

DRUG repositioning, i.e. the prediction of novel therapeutic


1. Oxidation number is 0 for atoms in an element. 3. In compounds, alkalis have oxidation number +1; alkaline earths have oxidation number +2.

2x + y = 3. Since the second equation is precisely the same as the first equation, it is enough to find x and y satisfying the system

Energy transformations

Big Data Analytics CSCI 4030

Understanding the Impact of Weights Constraints in Portfolio Theory

AP CHEMISTRY 2007 SCORING GUIDELINES. Question 2

Evaluation of educational open-source software using multicriteria decision analysis methods

COMP 250 Fall 2012 lecture 2 binary representations Sept. 11, 2012

CREATIVE EUROPE - MEDIA. Distribution and Sales Agents

Chemical Calculations: Formula Masses, Moles, and Chemical Equations

HOMEWORK # 2 SOLUTIO

COMP6053 lecture: Relationship between two variables: correlation, covariance and r-squared.

Lecture 1: Course overview, circuits, and formulas

1. Give the 16 bit signed (twos complement) representation of the following decimal numbers, and convert to hexadecimal:

Paper 1 (7404/1): Inorganic and Physical Chemistry Mark scheme

How To Understand The Chemistry Of A 2D Structure

Fast Contextual Preference Scoring of Database Tuples

Lecture 3: Models of Solutions


1 Solving LPs: The Simplex Algorithm of George Dantzig

Notes on Complexity Theory Last updated: August, Lecture 1

A LEVEL H446 COMPUTER SCIENCE. Code Challenges (1 20) August 2015

Exploiting the Amazon.com People Who Bought Also Bought Algorithm in Reagent Selection. Christian Tyrchan, Niklas Falk and Jonas Boström

CHAPTER 3 SECURITY CONSTRAINED OPTIMAL SHORT-TERM HYDROTHERMAL SCHEDULING

A survey on click modeling in web search

In mathematics, there are four attainment targets: using and applying mathematics; number and algebra; shape, space and measures, and handling data.

Sample Exercise 8.1 Magnitudes of Lattice Energies

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Transcription:

Fingerprint-Based Virtual Screening Using Multiple Bioactive Reference Structures Jérôme Hert, Peter Willett and David J. Wilton (University of Sheffield, Sheffield, UK) Pierre Acklin, Kamal Azzaoui, Edgar Jacoby and Ansgar Schuffenhauer (Novartis Institutes for BioMedical Research, Basel, Switzerland)

Outlines Introduction to similarity-based virtual screening Extension to multiple reference structures Comparison of ranking methods Comparison of descriptors Turbo similarity searching Conclusions

Virtual screening Virtual screening involves scanning databases of compounds to find molecules that may exhibit some bioactivity of interest, so as to prioritise a screening programme

Similarity searching (1) Use of similarity measure to determine the degree of similarity between an active reference structure and each structure in the database Similarity property principle means that high-ranked structures are likely to have a similar activity to that of the reference structure

Similarity searching (2) How can existing methods be used when several diverse structures are available?

Project overview (1) Given a set of molecules of known activity, how can they be used to rank a database in order of decreasing probability of their exhibiting that activity? The various approaches have been evaluated in simulated virtual screening experiments on the MDDR database (ca. 102K molecules) using a range of different activity classes

Project overview (2) Use just 2D fingerprints (of various sorts) Use of the Tanimoto coefficient (will not consider the effect of different similarity coefficients)

Comparison of Methods Three distinct approaches have been investigated Single fingerprint methods Data fusion methods Substructural analysis methods

Single Fingerprint Methods (1) Two methods derived from the Stigmata approach Modal approach Weighted approach Modal / weighted fingerprint used as a query, with the Tanimoto coefficient being used to score molecules in the database

Single Fingerprint Methods (2) Training set of actives: mol 1: 100101100001010 mol 2: 001101000011000 mol 3: 110101001111101 mol 4: 101101101010010 mol 5: 010011100011101 Modal at 40%: 111101101011111 Weighted: 322415302144222

Data Fusion Methods (1) Combination of different rankings of the same sets of molecules with the expectation of improving decision This basic idea has been used previously with considerable success (also consensus scoring) by generating different rankings from the same molecule, using different similarity measures

Data Fusion Methods (2) Here, the different rankings come from different molecules but use the same, Tanimoto-based similarity measure Fusion of ranks Fusion of scores

Data Fusion Methods (3) Two fusion rules investigated SUM : N i = 1 s i MAX : Max ( ) s i

Substructural Analysis Methods (1) Classic substructural analysis (SSA): Training-set containing actives and inactives Weights calculated for each bit from the trainingset using a weighting scheme Sum of the weights for each bit present in a compound gives the score

Substructural Analysis Methods (2) Approximate substructural analysis without known inactives: Reference structures as training-set actives Approximate the training-set by the entire database Use of an appropriate weighting scheme that does not make explicit use of information about the inactives R1: log A T j j N N A T

Substructural Analysis Methods (3) Binary Kernel Discrimination (BKD): ( ) N d i ( ), j d i j i, j λ = 1 λ K, λ λ is a smoothing parameter that is optimised using the training-set active and inactive compounds Compounds are scored by: L A ( j) = i actives K K λ λ i inactives ( i, j) ( i, j)

Substructural Analysis Methods (4) Approximate BKD without known inactives Reference structures as training-set actives Set of 100 randomly chosen compounds from the database as training-set inactives (cf SSA approximation)

Experimental Details MDL Drug Data Report (MDDR) Database 11 activity classes selected 10 sets of 10 randomly chosen compounds from each activity 3 fingerprints investigated: 988-bit Unity, 1052-bit BCI, 2048-bit Daylight Results are average recalls at 5% over the 10 different trials

Results (1) Initial Experiments Best threshold for modal approach is 40% Best combination for data fusion is the combination of scores using the MAX fusion rule The three types of fingerprint give broadly comparable results

Results (2) The basic results obtained with the various methods have been compared with the average and the maximum of all individual, single-molecule similarity searches

Results (3) 70 Average recall at 5% (%) 60 50 40 30 20 Modal Weighted SSA Data fusion BKD Single Max Single Mean

Comparison Of Descriptors Four different types of 2D descriptor investigated using the two best approaches from the initial part of the study Structural keys 1052-bit BCI fingerprints Hashed fingerprints 988-bit Unity fingerprints 2048-bit Daylight fingerprints 2048-bit Avalon fingerprints (internal Novartis system) Circular substructures ECFP_2, ECFP_4, FCFP_2, FCFP_4 Pharmacophore vectors

Pharmacophore Vectors: Similog Similog keys Atom typing scheme based on four properties: hydrogen-bond donor, hydrogen-bond acceptor, bulkiness and electropositivity Atom triplets of strings encoding absence and presence of properties, plus distance encoding form a DABE key Vector contains a count for each of the 8031 possible DABE keys

Pharmacophore Vectors: CATS CATS vectors Based on five atom types: hydrogen-bond acceptor, hydrogen-bond donor, positive, negative, lipophilic Correlation vector representation calculated by C V R T d = 1 A A A i= 1 j = 1 δ T ij, d

Results 70 BKD Data Fusion Average Recall at 5% (%) 60 50 40 30 BCI Daylight Unity Avalon ECFP_2ECFP_4FCFP_2FCFP_4 Similog CATS

Turbo Similarity Searching (1) Similarity property principle: nearest neighbours are likely to exhibit the same activity as the reference structure Data fusion of multiple bioactive compounds is an effective way of improving the identification of active compounds

Turbo Similarity Searching (2) Reference structure Ranked list Nearest neighbours

Probability of being active vs Rank Probability of being active (%) 100 90 80 70 60 50 40 30 20 10 0 95 90 85 80 75 70 65 60 55 50 45 0 5 10 15 20 0 200 400 600 800 1000 Rank

Results 46 45 Average recall at 5% (%) 44 43 42 41 40 39 38 37 36 SS TSS-5 TSS-10 TSS-20 TSS-50 TSS-100 TSS-200

General conclusions This work has demonstrated two effective ways of using multiple active structures in 2D similarity searching (BKD and Data fusion) This work has also demonstrated the general effectiveness of the circular substructure descriptors (ECFP_4 in particular) Turbo similarity searching is a simple way of increasing the cost effectiveness of similarity searching

Acknowledgements Novartis Institutes for BioMedical Research for funding CINF Division for funding MDL Information Systems Inc. for the provision of the MDDR database Barnard Chemical Information Ltd., Daylight Chemical Information Systems Inc., the Royal Society, Tripos Inc., Scitegic Inc. and the Wolfson Foundation for software and laboratory support.