CÉLINE LE BAILLY DE TILLEGHEM. Institut de statistique Université catholique de Louvain Louvain-la-Neuve (Belgium)

Size: px

Start display at page:

Download "CÉLINE LE BAILLY DE TILLEGHEM. Institut de statistique Université catholique de Louvain Louvain-la-Neuve (Belgium)"

Francis Page
8 years ago
Views:

1 STATISTICAL CONTRIBUTION TO THE VIRTUAL MULTICRITERIA OPTIMISATION OF COMBINATORIAL MOLECULES LIBRARIES AND TO THE VALIDATION AND APPLICATION OF QSAR MODELS CÉLINE LE BAILLY DE TILLEGHEM Institut de statistique Université catholique de Louvain Louvain-la-Neuve (Belgium) Journée Jeunes Chercheurs - September 21st, 2007 p. 1/23

MODELS CÉLINE LE BAILLY DE TILLEGHEM Institut de statistique Université

2 Context of the research Lead optimisation using combinatorial chemistry diabetes library provided by Eli Lilly and Company: combinatorial library composed of 3 R-groups and = compounds. Objective: select the most promising compounds Journée Jeunes Chercheurs - September 21st, 2007 p. 2/23

composed of 3 R-groups and 47 50 47 = 110 450 compounds.

3 Proposed methodology: It gathers in a coherent framework existing and new tools of statistics and chemometrics, mainly: (-) the development and validation of QSAR models to predict drugability properties, (-) the definition of a desirability index to summarise those properties and assessment of the propagation of QSAR models predictions, and (-) an efficient algorithm to screen the combinatorial library and select the most promising compounds. Journée Jeunes Chercheurs - September 21st, 2007 p. 3/23

to summarise those properties and assessment of the propagation of QSAR models predictions, and (-) an efficient algorithm to

4 Problem description Construction of the combinatorial library chemists divide the lead and select reagents to add on each part: = compounds Definition of the objective - select the best combinatorial sublibrary of size n 1 n 2 n 3 (5 5 5) or - select the sublibrary with the m best compounds (m = 100) Definition of the optimised drugability properties (Y) - min Y 1 = quantity of substance to inject around the receptor R 1 to have a binding - min Y 2 = quantity of substance to inject around the receptor R 2 to have a binding - max Y 3 = quantity of substance to inject around the receptor R 3 to have a binding - max Y 4 = quantity of substance to inject around the receptor R 4 to have a binding Journée Jeunes Chercheurs - September 21st, 2007 p. 4/23

quantity of substance to inject around the receptor R 1 to have a binding - min Y 2 = quantity of substance to inject around the receptor R 2 to have a binding - max Y 3 = quantity of substance

5 Problem description Definition of the available chemical descriptors (x) - descriptors are computed using a specific software on the basis of SMILES - groups of descriptors: describing the molecule as a whole (number of atoms, number of rings, molecular weight...), quantifying the overall charge distribution (total absolute charge, total positive charge, total negative charge...), measuring electrotopological properties, molecular surface properties, connectivity properties,... - More than 9000 molecular descriptors can be computed at Eli Lilly!!! Journée Jeunes Chercheurs - September 21st, 2007 p. 5/23

..), quantifying the overall charge distribution (total absolute charge, total positive charge, total negative charge.

6 Proposed methodology: Journée Jeunes Chercheurs - September 21st, 2007 p. 6/23

7 QSAR models development QSARs (Quantitative Structure-Activity Relationships) are mathematical models approximating the link between chemical properties (x) and biological activities (Y) of compounds. Models assumptions For each optimised response, different QSAR models are assumed: - Mutliple Linear Regression (forward regression minimising BIC), - PLS Regression (minimise the bias-corrected 10-fold CV estimate of the MSEP), - binary Regression Tree + pruning (minimising a cost complexity measure based on the RSS and the splits number) + bagging Journée Jeunes Chercheurs - September 21st, 2007 p. 7/23

Models assumptions For each optimised response, different QSAR models are assumed: - Mutliple Linear Regression (forward regression minimising BIC), -

8 QSAR models development Data collection and models fit - 4 training sets after pretreatment and cleaning of the collected data : Observed molecules Available descriptors Descriptors kept after cleaning Y Y Y Y MLR, PLSR and RT are fitted on those 4 training sets, selecting entered explanatory variables as explained before. Journée Jeunes Chercheurs - September 21st, 2007 p. 8/23

519 3394 1769 Y 3 985 2030 1964 Y 4 1037 1914 1844 - MLR, PLSR and RT are fitted on those 4 training sets,

9 QSAR models development Model selection and assessment - goodness-of-fit criteria MLR N K 1 R 2 Radj 2 S F-test p-value Y < Y < Y < Y < PLSR N K RY 2 RX 2 S Y Y Y Y Journée Jeunes Chercheurs - September 21st, 2007 p. 9/23

392 < 2.2 10 16 Y 4 1037 60 0.836 0.825 0.548 < 2.2 10 16 PLSR N K RY 2 RX 2 S Y 1 558 54 83.28 100 0.

10 QSAR models development Model selection and assessment - goodness-of-fit criteria RT Bagging Bagging No pruning Pruning No pruning Pruning N K R 2 S K R 2 S R 2 S R 2 S Y Y Y Y Journée Jeunes Chercheurs - September 21st, 2007 p. 10/23

412 Y 2 519 95 0.950 0.218 15 0.621 0.513 0.896 0.332 0.763 0.445 Y 3 985 116 0.979 0.193 4 0.818 0.509 0.962 0.258 0.

11 QSAR models development Model selection and assessment - Fitted vs observed MLR Y 1 Y 2 Y 3 Y 4 Y 1 Y 2 Y 3 Y 4 PLSR Journée Jeunes Chercheurs - September 21st, 2007 p. 11/23

4 Y 1 Y 2 Y 3 Y 4 PLSR 4 5 6 7 8 9 10 4 5 6 7 8 9

12 QSAR models development Model selection and assessment - Fitted vs observed RT-no pruning Y 1 Y 2 Y 3 Y RT-pruning Y 1 Y 2 Y 3 Y Journée Jeunes Chercheurs - September 21st, 2007 p. 12/23

6 7 8 9 10 RT-pruning Y 1 Y 2 Y 3 Y 4 0 4 5 6 7 8 9 10 4 5 6 7

13 QSAR models development Model selection and assessment - Fitted vs observed RT-no pruning-bag Y 1 Y 2 Y 3 Y RT-pruning-bag Y 1 Y 2 Y 3 Y Journée Jeunes Chercheurs - September 21st, 2007 p. 13/23

5 6 7 8 9 10 RT-pruning-bag Y 1 Y 2 Y 3 Y 4 0 4 5 6 7 8 9 10 4 5

14 QSAR models development Model selection and assessment - Internal predictive power : Q 2 = cross-validated R 2 Y 1 Y 2 Y 3 Y 4 RT bagging - no pruning RT bagging - pruning MLR PLSR RT pruning RT no pruning MLR models are selected - External validation if possible! Journée Jeunes Chercheurs - September 21st, 2007 p. 14/23

613 0.897 0.835 PLSR 0.542 0.359 0.804 0.727 RT pruning 0.403 0.357 0.795 0.714 RT no pruning 0.307 0.232 0.762 0.

15 QSAR models development Applicability domain - Definition: the applicability domain is the set of molecules for which the QSAR model is valid. - Computation: descriptors ranges, convex hull, leverages, other distance measurements (Euclidean, Mahalanobis or L 1 distance), the Hotteling T 2, density measurements... Y 1 Y 2 Y 3 Y 4 LEVERAGE OBSERVATION NUMBER LEVERAGE OBSERVATION NUMBER LEVERAGE OBSERVATION NUMBER LEVERAGE OBSERVATION NUMBER Journée Jeunes Chercheurs - September 21st, 2007 p. 15/23

measurements... Y 1 Y 2 Y 3 Y 4 LEVERAGE 0.0 0.2 0.4 0.6 0.8 1.0 0 100 300 500 OBSERVATION NUMBER LEVERAGE 0.0 0.2 0.4 0.6 0.8 1.0 0 100 300 500 OBSERVATION NUMBER LEVERAGE 0.0 0.2 0.4 0 200 600 1000 OBSERVATION NUMBER LEVERAGE 0.

16 Proposed methodology: Journée Jeunes Chercheurs - September 21st, 2007 p. 16/23

17 Molecules optimisation Definition of the optimised criterion (DF and DI) - Multicriteria optimisation!!! - Desirability Functions: d 1 (Y 1 ) d 2 (Y 2 ) d 3 (Y 3 ) d 4 (Y 4 ) d 1 (Y 1 ) d 2 (Y 2 ) d 3 (Y 3 ) d 4 (Y 4 ) Y 1 Y 2 Y 3 Y 4 - Desirability Index of 1 molecule: E[D(Y x)] = E[ Q 4 i=1 (d i(y i x)) 1/4 ] - Loss of a sublibrary with m molecules: P m i=1 (1 E[D(Y x i)]) 2 /m - The best sublibrary is the sublibrary with the smallest loss Journée Jeunes Chercheurs - September 21st, 2007 p. 17/23

18 Molecules optimisation WEALD - WEALD (Weighted Exchanges Algorithm for Library Design) is an efficient algorithm to screen combinatorial libraries of molecules - Principle: select a sublibrary at random and perform exchanges between reagents to decrease the loss - Application of WEALD to select the 100 best compounds in the diabetes library: by exploring 4729 molecules (only 4.28% of the whole library), WEALD selects 100 compounds that are within the 105 best compounds of the library LOSS NUMBER OF EXPLORED MOLECULES Journée Jeunes Chercheurs - September 21st, 2007 p. 18/23

best compounds in the diabetes library: by exploring 4729 molecules (only 4.

19 Molecules optimisation Uncertainty analysis - For all molecules explored by WEALD, drugability properties are estimated by the fitted QSAR models. Check for any explored molecule if it is in the applicability domains of the QSARs. Among the 4729 explored molecules, 1948 molecules (more than 41%) are outside at least one applicability domain. B QSAR models are often extrapolating! Journée Jeunes Chercheurs - September 21st, 2007 p. 19/23

Check for any explored molecule if it is in the applicability domains of the QSARs.

20 Molecules optimisation Uncertainty analysis - For a given molecule with descriptors x 0, the desirability index is estimated: Ê[D(Y x 0 )]. Construct a confidence interval for E[D(Y x 0 )]. For the 4729 explored molecules, the average CI length is 0.12 but may vary from 0.04 to nearly 1! B Desirability indexes cannot be compared as if they were exact! Journée Jeunes Chercheurs - September 21st, 2007 p. 20/23

For the 4729 explored molecules, the average CI length is 0.12 but may vary from 0.04 to nearly 1!

21 Molecules optimisation Uncertainty analysis - As the desirability indexes are estimated, some molecules are not significantly worse than the optimal one. (Indistinguishable Optimal Zone) For any explored molecules with descriptors x, test H 0 : E[D(Y x)] E[D(Y x opt )] against H 1 : E[D(Y x)] < E[D(Y x opt )]. Among the 4729 explored molecules, 230 molecules are not significantly worse than the optimal one. B Desirability indexes of two molecules are compared taking QSAR models prediction error into account. Journée Jeunes Chercheurs - September 21st, 2007 p. 21/23

22 Molecules optimisation Uncertainty analysis D^ M(x) TOP 100 : molecule out of at least one applicability domain : molecule included in all applicability domains. Green CI for E[D(Y x)]: molecules equivalent to the optimal one and Black CI for E[D(Y x)]: molecules significantly worse than the optimal one Journée Jeunes Chercheurs - September 21st, 2007 p. 22/23

23 Conclusion Integrated methodology to virtually screen combinatorial molecules libraries - QSAR models development - Desirability index - WEALD QSAR models should be validated - Goodness-of-fit - Internal and external predictivity - Applicability domain The uncertainty of the desirability indexes should be quantified - Confidence interval - Indistinguishable Optimal Zone Journée Jeunes Chercheurs - September 21st, 2007 p. 23/23

Molecular descriptors and chemometrics: a powerful combined tool for pharmaceutical, toxicological and environmental problems.

Molecular descriptors and chemometrics: a powerful combined tool for pharmaceutical, toxicological and environmental problems. Roberto Todeschini Milano Chemometrics and QSAR Research Group - Dept. of