CÉLINE LE BAILLY DE TILLEGHEM. Institut de statistique Université catholique de Louvain Louvain-la-Neuve (Belgium)

Size: px
Start display at page:

Download "CÉLINE LE BAILLY DE TILLEGHEM. Institut de statistique Université catholique de Louvain Louvain-la-Neuve (Belgium)"

Transcription

1 STATISTICAL CONTRIBUTION TO THE VIRTUAL MULTICRITERIA OPTIMISATION OF COMBINATORIAL MOLECULES LIBRARIES AND TO THE VALIDATION AND APPLICATION OF QSAR MODELS CÉLINE LE BAILLY DE TILLEGHEM Institut de statistique Université catholique de Louvain Louvain-la-Neuve (Belgium) Journée Jeunes Chercheurs - September 21st, 2007 p. 1/23

2 Context of the research Lead optimisation using combinatorial chemistry diabetes library provided by Eli Lilly and Company: combinatorial library composed of 3 R-groups and = compounds. Objective: select the most promising compounds Journée Jeunes Chercheurs - September 21st, 2007 p. 2/23

3 Proposed methodology: It gathers in a coherent framework existing and new tools of statistics and chemometrics, mainly: (-) the development and validation of QSAR models to predict drugability properties, (-) the definition of a desirability index to summarise those properties and assessment of the propagation of QSAR models predictions, and (-) an efficient algorithm to screen the combinatorial library and select the most promising compounds. Journée Jeunes Chercheurs - September 21st, 2007 p. 3/23

4 Problem description Construction of the combinatorial library chemists divide the lead and select reagents to add on each part: = compounds Definition of the objective - select the best combinatorial sublibrary of size n 1 n 2 n 3 (5 5 5) or - select the sublibrary with the m best compounds (m = 100) Definition of the optimised drugability properties (Y) - min Y 1 = quantity of substance to inject around the receptor R 1 to have a binding - min Y 2 = quantity of substance to inject around the receptor R 2 to have a binding - max Y 3 = quantity of substance to inject around the receptor R 3 to have a binding - max Y 4 = quantity of substance to inject around the receptor R 4 to have a binding Journée Jeunes Chercheurs - September 21st, 2007 p. 4/23

5 Problem description Definition of the available chemical descriptors (x) - descriptors are computed using a specific software on the basis of SMILES - groups of descriptors: describing the molecule as a whole (number of atoms, number of rings, molecular weight...), quantifying the overall charge distribution (total absolute charge, total positive charge, total negative charge...), measuring electrotopological properties, molecular surface properties, connectivity properties,... - More than 9000 molecular descriptors can be computed at Eli Lilly!!! Journée Jeunes Chercheurs - September 21st, 2007 p. 5/23

6 Proposed methodology: Journée Jeunes Chercheurs - September 21st, 2007 p. 6/23

7 QSAR models development QSARs (Quantitative Structure-Activity Relationships) are mathematical models approximating the link between chemical properties (x) and biological activities (Y) of compounds. Models assumptions For each optimised response, different QSAR models are assumed: - Mutliple Linear Regression (forward regression minimising BIC), - PLS Regression (minimise the bias-corrected 10-fold CV estimate of the MSEP), - binary Regression Tree + pruning (minimising a cost complexity measure based on the RSS and the splits number) + bagging Journée Jeunes Chercheurs - September 21st, 2007 p. 7/23

8 QSAR models development Data collection and models fit - 4 training sets after pretreatment and cleaning of the collected data : Observed molecules Available descriptors Descriptors kept after cleaning Y Y Y Y MLR, PLSR and RT are fitted on those 4 training sets, selecting entered explanatory variables as explained before. Journée Jeunes Chercheurs - September 21st, 2007 p. 8/23

9 QSAR models development Model selection and assessment - goodness-of-fit criteria MLR N K 1 R 2 Radj 2 S F-test p-value Y < Y < Y < Y < PLSR N K RY 2 RX 2 S Y Y Y Y Journée Jeunes Chercheurs - September 21st, 2007 p. 9/23

10 QSAR models development Model selection and assessment - goodness-of-fit criteria RT Bagging Bagging No pruning Pruning No pruning Pruning N K R 2 S K R 2 S R 2 S R 2 S Y Y Y Y Journée Jeunes Chercheurs - September 21st, 2007 p. 10/23

11 QSAR models development Model selection and assessment - Fitted vs observed MLR Y 1 Y 2 Y 3 Y 4 Y 1 Y 2 Y 3 Y 4 PLSR Journée Jeunes Chercheurs - September 21st, 2007 p. 11/23

12 QSAR models development Model selection and assessment - Fitted vs observed RT-no pruning Y 1 Y 2 Y 3 Y RT-pruning Y 1 Y 2 Y 3 Y Journée Jeunes Chercheurs - September 21st, 2007 p. 12/23

13 QSAR models development Model selection and assessment - Fitted vs observed RT-no pruning-bag Y 1 Y 2 Y 3 Y RT-pruning-bag Y 1 Y 2 Y 3 Y Journée Jeunes Chercheurs - September 21st, 2007 p. 13/23

14 QSAR models development Model selection and assessment - Internal predictive power : Q 2 = cross-validated R 2 Y 1 Y 2 Y 3 Y 4 RT bagging - no pruning RT bagging - pruning MLR PLSR RT pruning RT no pruning MLR models are selected - External validation if possible! Journée Jeunes Chercheurs - September 21st, 2007 p. 14/23

15 QSAR models development Applicability domain - Definition: the applicability domain is the set of molecules for which the QSAR model is valid. - Computation: descriptors ranges, convex hull, leverages, other distance measurements (Euclidean, Mahalanobis or L 1 distance), the Hotteling T 2, density measurements... Y 1 Y 2 Y 3 Y 4 LEVERAGE OBSERVATION NUMBER LEVERAGE OBSERVATION NUMBER LEVERAGE OBSERVATION NUMBER LEVERAGE OBSERVATION NUMBER Journée Jeunes Chercheurs - September 21st, 2007 p. 15/23

16 Proposed methodology: Journée Jeunes Chercheurs - September 21st, 2007 p. 16/23

17 Molecules optimisation Definition of the optimised criterion (DF and DI) - Multicriteria optimisation!!! - Desirability Functions: d 1 (Y 1 ) d 2 (Y 2 ) d 3 (Y 3 ) d 4 (Y 4 ) d 1 (Y 1 ) d 2 (Y 2 ) d 3 (Y 3 ) d 4 (Y 4 ) Y 1 Y 2 Y 3 Y 4 - Desirability Index of 1 molecule: E[D(Y x)] = E[ Q 4 i=1 (d i(y i x)) 1/4 ] - Loss of a sublibrary with m molecules: P m i=1 (1 E[D(Y x i)]) 2 /m - The best sublibrary is the sublibrary with the smallest loss Journée Jeunes Chercheurs - September 21st, 2007 p. 17/23

18 Molecules optimisation WEALD - WEALD (Weighted Exchanges Algorithm for Library Design) is an efficient algorithm to screen combinatorial libraries of molecules - Principle: select a sublibrary at random and perform exchanges between reagents to decrease the loss - Application of WEALD to select the 100 best compounds in the diabetes library: by exploring 4729 molecules (only 4.28% of the whole library), WEALD selects 100 compounds that are within the 105 best compounds of the library LOSS NUMBER OF EXPLORED MOLECULES Journée Jeunes Chercheurs - September 21st, 2007 p. 18/23

19 Molecules optimisation Uncertainty analysis - For all molecules explored by WEALD, drugability properties are estimated by the fitted QSAR models. Check for any explored molecule if it is in the applicability domains of the QSARs. Among the 4729 explored molecules, 1948 molecules (more than 41%) are outside at least one applicability domain. B QSAR models are often extrapolating! Journée Jeunes Chercheurs - September 21st, 2007 p. 19/23

20 Molecules optimisation Uncertainty analysis - For a given molecule with descriptors x 0, the desirability index is estimated: Ê[D(Y x 0 )]. Construct a confidence interval for E[D(Y x 0 )]. For the 4729 explored molecules, the average CI length is 0.12 but may vary from 0.04 to nearly 1! B Desirability indexes cannot be compared as if they were exact! Journée Jeunes Chercheurs - September 21st, 2007 p. 20/23

21 Molecules optimisation Uncertainty analysis - As the desirability indexes are estimated, some molecules are not significantly worse than the optimal one. (Indistinguishable Optimal Zone) For any explored molecules with descriptors x, test H 0 : E[D(Y x)] E[D(Y x opt )] against H 1 : E[D(Y x)] < E[D(Y x opt )]. Among the 4729 explored molecules, 230 molecules are not significantly worse than the optimal one. B Desirability indexes of two molecules are compared taking QSAR models prediction error into account. Journée Jeunes Chercheurs - September 21st, 2007 p. 21/23

22 Molecules optimisation Uncertainty analysis D^ M(x) TOP 100 : molecule out of at least one applicability domain : molecule included in all applicability domains. Green CI for E[D(Y x)]: molecules equivalent to the optimal one and Black CI for E[D(Y x)]: molecules significantly worse than the optimal one Journée Jeunes Chercheurs - September 21st, 2007 p. 22/23

23 Conclusion Integrated methodology to virtually screen combinatorial molecules libraries - QSAR models development - Desirability index - WEALD QSAR models should be validated - Goodness-of-fit - Internal and external predictivity - Applicability domain The uncertainty of the desirability indexes should be quantified - Confidence interval - Indistinguishable Optimal Zone Journée Jeunes Chercheurs - September 21st, 2007 p. 23/23

Molecular descriptors and chemometrics: a powerful combined tool for pharmaceutical, toxicological and environmental problems.

Molecular descriptors and chemometrics: a powerful combined tool for pharmaceutical, toxicological and environmental problems. Molecular descriptors and chemometrics: a powerful combined tool for pharmaceutical, toxicological and environmental problems. Roberto Todeschini Milano Chemometrics and QSAR Research Group - Dept. of

More information

Statistical Models in R

Statistical Models in R Statistical Models in R Some Examples Steven Buechler Department of Mathematics 276B Hurley Hall; 1-6233 Fall, 2007 Outline Statistical Models Linear Models in R Regression Regression analysis is the appropriate

More information

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96 1 Final Review 2 Review 2.1 CI 1-propZint Scenario 1 A TV manufacturer claims in its warranty brochure that in the past not more than 10 percent of its TV sets needed any repair during the first two years

More information

A Statistician s View of Big Data

A Statistician s View of Big Data A Statistician s View of Big Data Max Kuhn, Ph.D (Pfizer Global R&D, Groton, CT) Kjell Johnson, Ph.D (Arbor Analytics, Ann Arbor MI) What Does Big Data Mean? The advantages and issues related to Big Data

More information

Running Large Workflows in the Cloud

Running Large Workflows in the Cloud Running Large Workflows in the Cloud Paul Watson School of Computing Science & Digital Institute Newcastle University, UK Paul.Watson@ncl.ac.uk The team: Jacek Cala, Hugo Hiden, Simon Woodman, David Leahy

More information

Analysis and Interpretation of Clinical Trials. How to conclude?

Analysis and Interpretation of Clinical Trials. How to conclude? www.eurordis.org Analysis and Interpretation of Clinical Trials How to conclude? Statistical Issues Dr Ferran Torres Unitat de Suport en Estadística i Metodología - USEM Statistics and Methodology Support

More information

Multiple Linear Regression

Multiple Linear Regression Multiple Linear Regression A regression with two or more explanatory variables is called a multiple regression. Rather than modeling the mean response as a straight line, as in simple regression, it is

More information

Classification/Decision Trees (II)

Classification/Decision Trees (II) Classification/Decision Trees (II) Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Right Sized Trees Let the expected misclassification rate of a tree T be R (T ).

More information

Simple Linear Regression Inference

Simple Linear Regression Inference Simple Linear Regression Inference 1 Inference requirements The Normality assumption of the stochastic term e is needed for inference even if it is not a OLS requirement. Therefore we have: Interpretation

More information

Studying Auto Insurance Data

Studying Auto Insurance Data Studying Auto Insurance Data Ashutosh Nandeshwar February 23, 2010 1 Introduction To study auto insurance data using traditional and non-traditional tools, I downloaded a well-studied data from http://www.statsci.org/data/general/motorins.

More information

Data Mining Techniques Chapter 6: Decision Trees

Data Mining Techniques Chapter 6: Decision Trees Data Mining Techniques Chapter 6: Decision Trees What is a classification decision tree?.......................................... 2 Visualizing decision trees...................................................

More information

Evaluation of Quantitative Data (errors/statistical analysis/propagation of error)

Evaluation of Quantitative Data (errors/statistical analysis/propagation of error) Evaluation of Quantitative Data (errors/statistical analysis/propagation of error) 1. INTRODUCTION Laboratory work in chemistry can be divided into the general categories of qualitative studies and quantitative

More information

Model Validation Techniques

Model Validation Techniques Model Validation Techniques Kevin Mahoney, FCAS kmahoney@ travelers.com CAS RPM Seminar March 17, 2010 Uses of Statistical Models in P/C Insurance Examples of Applications Determine expected loss cost

More information

Multiple Linear Regression in Data Mining

Multiple Linear Regression in Data Mining Multiple Linear Regression in Data Mining Contents 2.1. A Review of Multiple Linear Regression 2.2. Illustration of the Regression Process 2.3. Subset Selection in Linear Regression 1 2 Chap. 2 Multiple

More information

Chemical Risk Assessment in Absence of Adequate Toxicological Data

Chemical Risk Assessment in Absence of Adequate Toxicological Data Chemical Risk Assessment in Absence of Adequate Toxicological Data Mark Cronin School of Pharmacy and Chemistry Liverpool John Moores University England m.t.cronin@ljmu.ac.uk The Problem Risk Analytical

More information

Prospective Life Tables

Prospective Life Tables An introduction to time dependent mortality models by Julien Antunes Mendes and Christophe Pochet TRENDS OF MORTALITY Life expectancy at birth among early humans was likely to be about 20 to 30 years.

More information

Generalized Linear Models

Generalized Linear Models Generalized Linear Models We have previously worked with regression models where the response variable is quantitative and normally distributed. Now we turn our attention to two types of models where the

More information

NISS. Technical Report Number 105 March, 2000

NISS. Technical Report Number 105 March, 2000 NISS A Sequential Approach for Identifying Lead Compounds in Large Chemical Databases Markus Abt, Yong Bin Lim, Jerome Sacks, Minge Xie, and S. Stanley Young Technical Report Number 105 March, 2000 National

More information

Detection of changes in variance using binary segmentation and optimal partitioning

Detection of changes in variance using binary segmentation and optimal partitioning Detection of changes in variance using binary segmentation and optimal partitioning Christian Rohrbeck Abstract This work explores the performance of binary segmentation and optimal partitioning in the

More information

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression Objectives: To perform a hypothesis test concerning the slope of a least squares line To recognize that testing for a

More information

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries Basic Statistics and Data Analysis for Health Researchers from Foreign Countries Volkert Siersma siersma@sund.ku.dk The Research Unit for General Practice in Copenhagen Dias 1 Content Quantifying association

More information

We discuss 2 resampling methods in this chapter - cross-validation - the bootstrap

We discuss 2 resampling methods in this chapter - cross-validation - the bootstrap Statistical Learning: Chapter 5 Resampling methods (Cross-validation and bootstrap) (Note: prior to these notes, we'll discuss a modification of an earlier train/test experiment from Ch 2) We discuss 2

More information

The Mole Concept. The Mole. Masses of molecules

The Mole Concept. The Mole. Masses of molecules The Mole Concept Ron Robertson r2 c:\files\courses\1110-20\2010 final slides for web\mole concept.docx The Mole The mole is a unit of measurement equal to 6.022 x 10 23 things (to 4 sf) just like there

More information

Reporting Low-level Analytical Data

Reporting Low-level Analytical Data W. Horwitz, S. Afr. J. Chem., 2000, 53 (3), 206-212, , . [formerly: W. Horwitz, S. Afr. J. Chem.,

More information

The risks of mesothelioma and lung cancer in relation to relatively lowlevel exposures to different forms of asbestos

The risks of mesothelioma and lung cancer in relation to relatively lowlevel exposures to different forms of asbestos WATCH/2008/7 Annex 1 The risks of mesothelioma and lung cancer in relation to relatively lowlevel exposures to different forms of asbestos What statements can reliably be made about risk at different exposure

More information

QsarDB first 100 DOIs for predictive models

QsarDB first 100 DOIs for predictive models QsarDB first 100 DOIs for predictive models Uko Maran Institute of chemistry, University of Tartu, Estonia LOD: Content Data Predictive (and descriptive) models? Goal Components Persistent digital identifiers

More information

A Comparison of Variable Selection Techniques for Credit Scoring

A Comparison of Variable Selection Techniques for Credit Scoring 1 A Comparison of Variable Selection Techniques for Credit Scoring K. Leung and F. Cheong and C. Cheong School of Business Information Technology, RMIT University, Melbourne, Victoria, Australia E-mail:

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Henrik Boström School of Humanities and Informatics University of Skövde P.O. Box 408, SE-541 28 Skövde

More information

Cross Validation. Dr. Thomas Jensen Expedia.com

Cross Validation. Dr. Thomas Jensen Expedia.com Cross Validation Dr. Thomas Jensen Expedia.com About Me PhD from ETH Used to be a statistician at Link, now Senior Business Analyst at Expedia Manage a database with 720,000 Hotels that are not on contract

More information

Cheminformatics and Pharmacophore Modeling, Together at Last

Cheminformatics and Pharmacophore Modeling, Together at Last Application Guide Cheminformatics and Pharmacophore Modeling, Together at Last SciTegic Pipeline Pilot Bridging Accord Database Explorer and Discovery Studio Carl Colburn Shikha Varma-O Brien Introduction

More information

11. Analysis of Case-control Studies Logistic Regression

11. Analysis of Case-control Studies Logistic Regression Research methods II 113 11. Analysis of Case-control Studies Logistic Regression This chapter builds upon and further develops the concepts and strategies described in Ch.6 of Mother and Child Health:

More information

2.500 Threshold. 2.000 1000e - 001. Threshold. Exponential phase. Cycle Number

2.500 Threshold. 2.000 1000e - 001. Threshold. Exponential phase. Cycle Number application note Real-Time PCR: Understanding C T Real-Time PCR: Understanding C T 4.500 3.500 1000e + 001 4.000 3.000 1000e + 000 3.500 2.500 Threshold 3.000 2.000 1000e - 001 Rn 2500 Rn 1500 Rn 2000

More information

HOW TO USE MINITAB: DESIGN OF EXPERIMENTS. Noelle M. Richard 08/27/14

HOW TO USE MINITAB: DESIGN OF EXPERIMENTS. Noelle M. Richard 08/27/14 HOW TO USE MINITAB: DESIGN OF EXPERIMENTS 1 Noelle M. Richard 08/27/14 CONTENTS 1. Terminology 2. Factorial Designs When to Use? (preliminary experiments) Full Factorial Design General Full Factorial Design

More information

Q-edit: Documentation

Q-edit: Documentation Q-edit: Documentation Scope Q-edit is a new QPRF editor developed under OpenTox which aims at exploiting implemented web services to provide functionalities that facilitate the creation of QPRF reports

More information

Monitoring chemical processes for early fault detection using multivariate data analysis methods

Monitoring chemical processes for early fault detection using multivariate data analysis methods Bring data to life Monitoring chemical processes for early fault detection using multivariate data analysis methods by Dr Frank Westad, Chief Scientific Officer, CAMO Software Makers of CAMO 02 Monitoring

More information

Univariate Regression

Univariate Regression Univariate Regression Correlation and Regression The regression line summarizes the linear relationship between 2 variables Correlation coefficient, r, measures strength of relationship: the closer r is

More information

Data Mining and Visualization

Data Mining and Visualization Data Mining and Visualization Jeremy Walton NAG Ltd, Oxford Overview Data mining components Functionality Example application Quality control Visualization Use of 3D Example application Market research

More information

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar

More information

Data Mining Analysis of HIV-1 Protease Crystal Structures

Data Mining Analysis of HIV-1 Protease Crystal Structures Data Mining Analysis of HIV-1 Protease Crystal Structures Gene M. Ko, A. Srinivas Reddy, Sunil Kumar, and Rajni Garg AP0907 09 Data Mining Analysis of HIV-1 Protease Crystal Structures Gene M. Ko 1, A.

More information

Fuzzy Modeling of Labeled Point Cloud Superposition for the Comparison of Protein Binding Sites

Fuzzy Modeling of Labeled Point Cloud Superposition for the Comparison of Protein Binding Sites Fuzzy Modeling of Labeled Point Cloud Superposition for the Comparison of Protein Binding Sites Thomas Fober Eyke Hüllermeier Knowledge Engineering & Bioinformatics Group Mathematics and Computer Science

More information

Lecture 10: Regression Trees

Lecture 10: Regression Trees Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,

More information

STATISTICA Formula Guide: Logistic Regression. Table of Contents

STATISTICA Formula Guide: Logistic Regression. Table of Contents : Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

More information

Cheminformatics and its Role in the Modern Drug Discovery Process

Cheminformatics and its Role in the Modern Drug Discovery Process Cheminformatics and its Role in the Modern Drug Discovery Process Novartis Institutes for BioMedical Research Basel, Switzerland With thanks to my colleagues: J. Mühlbacher, B. Rohde, A. Schuffenhauer

More information

LUCKY AHMED Department of Chemistry and Biochemistry Yale University, New Haven, CT 06511 Email: lucky.ahmed@yale.edu

LUCKY AHMED Department of Chemistry and Biochemistry Yale University, New Haven, CT 06511 Email: lucky.ahmed@yale.edu LUCKY AHMED Department of Chemistry and Biochemistry Yale University, New Haven, CT 06511 Email: lucky.ahmed@yale.edu EDUCATION PhD in Computational Chemistry Spring- Dissertation Title: Computational

More information

Data Mining Methods: Applications for Institutional Research

Data Mining Methods: Applications for Institutional Research Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014

More information

Combinatorial Chemistry and solid phase synthesis seminar and laboratory course

Combinatorial Chemistry and solid phase synthesis seminar and laboratory course Combinatorial Chemistry and solid phase synthesis seminar and laboratory course Topic 1: Principles of combinatorial chemistry 1. Introduction: Why Combinatorial Chemistry? Until recently, a common drug

More information

The INFUSIS Project Data and Text Mining for In Silico Modeling

The INFUSIS Project Data and Text Mining for In Silico Modeling The INFUSIS Project Data and Text Mining for In Silico Modeling Henrik Boström 1,2, Ulf Norinder 3, Ulf Johansson 4, Cecilia Sönströd 4, Tuve Löfström 4, Elzbieta Dura 5, Ola Engkvist 6, Sorel Muresan

More information

MTH 140 Statistics Videos

MTH 140 Statistics Videos MTH 140 Statistics Videos Chapter 1 Picturing Distributions with Graphs Individuals and Variables Categorical Variables: Pie Charts and Bar Graphs Categorical Variables: Pie Charts and Bar Graphs Quantitative

More information

Chapter 7: Simple linear regression Learning Objectives

Chapter 7: Simple linear regression Learning Objectives Chapter 7: Simple linear regression Learning Objectives Reading: Section 7.1 of OpenIntro Statistics Video: Correlation vs. causation, YouTube (2:19) Video: Intro to Linear Regression, YouTube (5:18) -

More information

E(y i ) = x T i β. yield of the refined product as a percentage of crude specific gravity vapour pressure ASTM 10% point ASTM end point in degrees F

E(y i ) = x T i β. yield of the refined product as a percentage of crude specific gravity vapour pressure ASTM 10% point ASTM end point in degrees F Random and Mixed Effects Models (Ch. 10) Random effects models are very useful when the observations are sampled in a highly structured way. The basic idea is that the error associated with any linear,

More information

Decision Trees from large Databases: SLIQ

Decision Trees from large Databases: SLIQ Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values

More information

Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza

Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza Handling missing data in large data sets Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza The problem Often in official statistics we have large data sets with many variables and

More information

Real-time PCR: Understanding C t

Real-time PCR: Understanding C t APPLICATION NOTE Real-Time PCR Real-time PCR: Understanding C t Real-time PCR, also called quantitative PCR or qpcr, can provide a simple and elegant method for determining the amount of a target sequence

More information

Virtual Met Mast verification report:

Virtual Met Mast verification report: Virtual Met Mast verification report: June 2013 1 Authors: Alasdair Skea Karen Walter Dr Clive Wilson Leo Hume-Wright 2 Table of contents Executive summary... 4 1. Introduction... 6 2. Verification process...

More information

An Overview and Evaluation of Decision Tree Methodology

An Overview and Evaluation of Decision Tree Methodology An Overview and Evaluation of Decision Tree Methodology ASA Quality and Productivity Conference Terri Moore Motorola Austin, TX terri.moore@motorola.com Carole Jesse Cargill, Inc. Wayzata, MN carole_jesse@cargill.com

More information

Atomic Masses. Chapter 3. Stoichiometry. Chemical Stoichiometry. Mass and Moles of a Substance. Average Atomic Mass

Atomic Masses. Chapter 3. Stoichiometry. Chemical Stoichiometry. Mass and Moles of a Substance. Average Atomic Mass Atomic Masses Chapter 3 Stoichiometry 1 atomic mass unit (amu) = 1/12 of the mass of a 12 C atom so one 12 C atom has a mass of 12 amu (exact number). From mass spectrometry: 13 C/ 12 C = 1.0836129 amu

More information

Simple linear regression

Simple linear regression Simple linear regression Introduction Simple linear regression is a statistical method for obtaining a formula to predict values of one variable from another where there is a causal relationship between

More information

4.2 Bias, Standards and Standardization

4.2 Bias, Standards and Standardization 4.2 Bias, Standards and Standardization bias and accuracy, estimation of bias origin of bias and the uncertainty in reference values quantifying by mass, chemical reactions, and physical methods standard

More information

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore. CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes

More information

Validation of measurement procedures

Validation of measurement procedures Validation of measurement procedures R. Haeckel and I.Püntmann Zentralkrankenhaus Bremen The new ISO standard 15189 which has already been accepted by most nations will soon become the basis for accreditation

More information

Data Visualization in Cheminformatics. Simon Xi Computational Sciences CoE Pfizer Cambridge

Data Visualization in Cheminformatics. Simon Xi Computational Sciences CoE Pfizer Cambridge Data Visualization in Cheminformatics Simon Xi Computational Sciences CoE Pfizer Cambridge My Background Professional Experience Senior Principal Scientist, Computational Sciences CoE, Pfizer Cambridge

More information

Unsupervised learning: Clustering

Unsupervised learning: Clustering Unsupervised learning: Clustering Salissou Moutari Centre for Statistical Science and Operational Research CenSSOR 17 th September 2013 Unsupervised learning: Clustering 1/52 Outline 1 Introduction What

More information

List the 3 main types of subatomic particles and indicate the mass and electrical charge of each.

List the 3 main types of subatomic particles and indicate the mass and electrical charge of each. Basic Chemistry Why do we study chemistry in a biology course? All living organisms are composed of chemicals. To understand life, we must understand the structure, function, and properties of the chemicals

More information

Recall this chart that showed how most of our course would be organized:

Recall this chart that showed how most of our course would be organized: Chapter 4 One-Way ANOVA Recall this chart that showed how most of our course would be organized: Explanatory Variable(s) Response Variable Methods Categorical Categorical Contingency Tables Categorical

More information

Use the Force! Noncovalent Molecular Forces

Use the Force! Noncovalent Molecular Forces Use the Force! Noncovalent Molecular Forces Not quite the type of Force we re talking about Before we talk about noncovalent molecular forces, let s talk very briefly about covalent bonds. The Illustrated

More information

A Comparison of Decision Tree and Logistic Regression Model Xianzhe Chen, North Dakota State University, Fargo, ND

A Comparison of Decision Tree and Logistic Regression Model Xianzhe Chen, North Dakota State University, Fargo, ND Paper D02-2009 A Comparison of Decision Tree and Logistic Regression Model Xianzhe Chen, North Dakota State University, Fargo, ND ABSTRACT This paper applies a decision tree model and logistic regression

More information

Two- and Three-Dimensional Quantitative Structure-Activity Relationships Studies on a Series of Diuretics

Two- and Three-Dimensional Quantitative Structure-Activity Relationships Studies on a Series of Diuretics Latin American Journal of Pharmacy (formerly Acta Farmacéutica Bonaerense) Lat. Am. J. Pharm. 28 (6): 927-31 (2009) Short Communication Received: January 20, 2009 Accepted: July 31, 2009 Two- and Three-Dimensional

More information

Science Stage 6 Skills Module 8.1 and 9.1 Mapping Grids

Science Stage 6 Skills Module 8.1 and 9.1 Mapping Grids Science Stage 6 Skills Module 8.1 and 9.1 Mapping Grids Templates for the mapping of the skills content Modules 8.1 and 9.1 have been provided to assist teachers in evaluating existing, and planning new,

More information

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics. Course Catalog In order to be assured that all prerequisites are met, students must acquire a permission number from the education coordinator prior to enrolling in any Biostatistics course. Courses are

More information

How To Understand The Theory Of Probability

How To Understand The Theory Of Probability Graduate Programs in Statistics Course Titles STAT 100 CALCULUS AND MATR IX ALGEBRA FOR STATISTICS. Differential and integral calculus; infinite series; matrix algebra STAT 195 INTRODUCTION TO MATHEMATICAL

More information

The Bi-Objective Pareto Constraint

The Bi-Objective Pareto Constraint The Bi-Objective Pareto Constraint Renaud Hartert and Pierre Schaus UCLouvain, ICTEAM, Place Sainte Barbe 2, 1348 Louvain-la-Neuve, Belgium {renaud.hartert,pierre.schaus}@uclouvain.be Abstract. Multi-Objective

More information

Binary Image Reconstruction

Binary Image Reconstruction A network flow algorithm for reconstructing binary images from discrete X-rays Kees Joost Batenburg Leiden University and CWI, The Netherlands kbatenbu@math.leidenuniv.nl Abstract We present a new algorithm

More information

Finite Differences Schemes for Pricing of European and American Options

Finite Differences Schemes for Pricing of European and American Options Finite Differences Schemes for Pricing of European and American Options Margarida Mirador Fernandes IST Technical University of Lisbon Lisbon, Portugal November 009 Abstract Starting with the Black-Scholes

More information

Single item inventory control under periodic review and a minimum order quantity

Single item inventory control under periodic review and a minimum order quantity Single item inventory control under periodic review and a minimum order quantity G. P. Kiesmüller, A.G. de Kok, S. Dabia Faculty of Technology Management, Technische Universiteit Eindhoven, P.O. Box 513,

More information

Microsoft Azure Machine learning Algorithms

Microsoft Azure Machine learning Algorithms Microsoft Azure Machine learning Algorithms Tomaž KAŠTRUN @tomaz_tsql Tomaz.kastrun@gmail.com http://tomaztsql.wordpress.com Our Sponsors Speaker info https://tomaztsql.wordpress.com Agenda Focus on explanation

More information

A hierarchical multicriteria routing model with traffic splitting for MPLS networks

A hierarchical multicriteria routing model with traffic splitting for MPLS networks A hierarchical multicriteria routing model with traffic splitting for MPLS networks João Clímaco, José Craveirinha, Marta Pascoal jclimaco@inesccpt, jcrav@deecucpt, marta@matucpt University of Coimbra

More information

Efficiency in Software Development Projects

Efficiency in Software Development Projects Efficiency in Software Development Projects Aneesh Chinubhai Dharmsinh Desai University aneeshchinubhai@gmail.com Abstract A number of different factors are thought to influence the efficiency of the software

More information

Pharmacology skills for drug discovery. Why is pharmacology important?

Pharmacology skills for drug discovery. Why is pharmacology important? skills for drug discovery Why is pharmacology important?, the science underlying the interaction between chemicals and living systems, emerged as a distinct discipline allied to medicine in the mid-19th

More information

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d. EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER ANALYTICS LIFECYCLE Evaluate & Monitor Model Formulate Problem Data Preparation Deploy Model Data Exploration Validate Models

More information

Chapter 3 Quantitative Demand Analysis

Chapter 3 Quantitative Demand Analysis Managerial Economics & Business Strategy Chapter 3 uantitative Demand Analysis McGraw-Hill/Irwin Copyright 2010 by the McGraw-Hill Companies, Inc. All rights reserved. Overview I. The Elasticity Concept

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

More information

ACST829 CAPITAL BUDGETING AND FINANCIAL MODELLING. Semester 1, 2011. Department of Applied Finance and Actuarial Studies

ACST829 CAPITAL BUDGETING AND FINANCIAL MODELLING. Semester 1, 2011. Department of Applied Finance and Actuarial Studies ACST829 CAPITAL BUDGETING AND FINANCIAL MODELLING Semester 1, 2011 Department of Applied Finance and Actuarial Studies MACQUARIE UNIVERSITY FACULTY OF BUSINESS AND ECONOMICS UNIT OUTLINE Study Period:

More information

De novo design in the cloud from mining big data to clinical candidate

De novo design in the cloud from mining big data to clinical candidate De novo design in the cloud from mining big data to clinical candidate Jérémy Besnard Data Science For Pharma Summit 28 th January 2016 Overview the 3 bullet points Cloud based data platform that can efficiently

More information

How to Biotinylate with Reproducible Results

How to Biotinylate with Reproducible Results How to Biotinylate with Reproducible Results Introduction The Biotin Streptavidin system continues to be used in many protein based biological research applications including; ELISAs, immunoprecipitation,

More information

Three Aspects of Predictive Modeling

Three Aspects of Predictive Modeling Three Aspects of Predictive Modeling Max Kuhn, Ph.D Pfizer Global R&D Groton, CT max.kuhn@pfizer.com Outline Predictive modeling definition Some example applications Ashortoverviewandexample How is this

More information

Ridge Regression. Patrick Breheny. September 1. Ridge regression Selection of λ Ridge regression in R/SAS

Ridge Regression. Patrick Breheny. September 1. Ridge regression Selection of λ Ridge regression in R/SAS Ridge Regression Patrick Breheny September 1 Patrick Breheny BST 764: Applied Statistical Modeling 1/22 Ridge regression: Definition Definition and solution Properties As mentioned in the previous lecture,

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Gerard Mc Nulty Systems Optimisation Ltd gmcnulty@iol.ie/0876697867 BA.,B.A.I.,C.Eng.,F.I.E.I

Gerard Mc Nulty Systems Optimisation Ltd gmcnulty@iol.ie/0876697867 BA.,B.A.I.,C.Eng.,F.I.E.I Gerard Mc Nulty Systems Optimisation Ltd gmcnulty@iol.ie/0876697867 BA.,B.A.I.,C.Eng.,F.I.E.I Data is Important because it: Helps in Corporate Aims Basis of Business Decisions Engineering Decisions Energy

More information

Inductive Data Mining: Automatic Generation of Decision Trees from Data for QSAR Modelling and Process Historical Data Analysis

Inductive Data Mining: Automatic Generation of Decision Trees from Data for QSAR Modelling and Process Historical Data Analysis 18 th European Symposium on Computer Aided Process Engineering ESCAPE 18 Bertrand Braunschweig and Xavier Joulia (Editors) 2008 Elsevier B.V./Ltd. All rights reserved. Inductive Data Mining: Automatic

More information

OMCL Network of the Council of Europe QUALITY MANAGEMENT DOCUMENT

OMCL Network of the Council of Europe QUALITY MANAGEMENT DOCUMENT OMCL Network of the Council of Europe QUALITY MANAGEMENT DOCUMENT PA/PH/OMCL (12) 77 7R QUALIFICATION OF EQUIPMENT ANNEX 8: QUALIFICATION OF BALANCES Full document title and reference Document type Qualification

More information

Data Mining and Neural Networks in Stata

Data Mining and Neural Networks in Stata Data Mining and Neural Networks in Stata 2 nd Italian Stata Users Group Meeting Milano, 10 October 2005 Mario Lucchini e Maurizo Pisati Università di Milano-Bicocca mario.lucchini@unimib.it maurizio.pisati@unimib.it

More information

Lecture 6: Logistic Regression

Lecture 6: Logistic Regression Lecture 6: CS 194-10, Fall 2011 Laurent El Ghaoui EECS Department UC Berkeley September 13, 2011 Outline Outline Classification task Data : X = [x 1,..., x m]: a n m matrix of data points in R n. y { 1,

More information

5. Multiple regression

5. Multiple regression 5. Multiple regression QBUS6840 Predictive Analytics https://www.otexts.org/fpp/5 QBUS6840 Predictive Analytics 5. Multiple regression 2/39 Outline Introduction to multiple linear regression Some useful

More information

Data Mining: An Overview. David Madigan http://www.stat.columbia.edu/~madigan

Data Mining: An Overview. David Madigan http://www.stat.columbia.edu/~madigan Data Mining: An Overview David Madigan http://www.stat.columbia.edu/~madigan Overview Brief Introduction to Data Mining Data Mining Algorithms Specific Eamples Algorithms: Disease Clusters Algorithms:

More information

itesla Project Innovative Tools for Electrical System Security within Large Areas

itesla Project Innovative Tools for Electrical System Security within Large Areas itesla Project Innovative Tools for Electrical System Security within Large Areas Samir ISSAD RTE France samir.issad@rte-france.com PSCC 2014 Panel Session 22/08/2014 Advanced data-driven modeling techniques

More information

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-19-B &

More information

CHEMICAL FORMULA COEFFICIENTS AND SUBSCRIPTS. Chapter 3: Molecular analysis 3O 2 2O 3

CHEMICAL FORMULA COEFFICIENTS AND SUBSCRIPTS. Chapter 3: Molecular analysis 3O 2 2O 3 Chapter 3: Molecular analysis Read: BLB 3.3 3.5 H W : BLB 3:21a, c, e, f, 25, 29, 37,49, 51, 53 Supplemental 3:1 8 CHEMICAL FORMULA Formula that gives the TOTAL number of elements in a molecule or formula

More information

Integrating Benders decomposition within Constraint Programming

Integrating Benders decomposition within Constraint Programming Integrating Benders decomposition within Constraint Programming Hadrien Cambazard, Narendra Jussien email: {hcambaza,jussien}@emn.fr École des Mines de Nantes, LINA CNRS FRE 2729 4 rue Alfred Kastler BP

More information

AP Physics 1 and 2 Lab Investigations

AP Physics 1 and 2 Lab Investigations AP Physics 1 and 2 Lab Investigations Student Guide to Data Analysis New York, NY. College Board, Advanced Placement, Advanced Placement Program, AP, AP Central, and the acorn logo are registered trademarks

More information