Computational Drug Repositioning by Ranking and Integrating Multiple Data Sources

Size: px

Start display at page:

Download "Computational Drug Repositioning by Ranking and Integrating Multiple Data Sources"

Jeffry Floyd
8 years ago
Views:

1 Computational Drug Repositioning by Ranking and Integrating Multiple Data Sources Ping Zhang IBM T. J. Watson Research Center Pankaj Agarwal GlaxoSmithKline Zoran Obradovic Temple University

2 Terms and ideas Drug, Chemical Compound Drug Targets, Target Proteins, Off Targets Indicated effects, Side Effects Drug Indication, Indicated Diseases

3 Timescale: drug discovery and development

4 Why drugs fail...

5 Drug repositioning Drug repositioning (also known as Drug repurposing, Drug re-profiling, Therapeutic Switching and Drug re-tasking) is the application of known drugs and compounds to new indications (i.e., new diseases)

Switching and Drug re-tasking) is the application of

6 Shorter timelines & less risk

7 Computational drug repositioning If two drugs d x and d y are found to be similar, and d y is used for treating disease s, then d x is a repositioning candidate for disease s treatment. Chemical Properties: [Keiser et al., Nature 2009], [Swamidas, Brief Bioinform 2011] Biological Properties: [Li et al., Plos CB 2009], [Kotelnikova et al., JBCB 2010] Phenotypic Properties: [Campillos et al., Science 2008], [Hu and Agarwal, Plos One 2009], [Yang and Agarwal, Plos One 2011], Integrate multiple drug data sources for better solutions.

, Nature 2009], [Swamidas, Brief Bioinform 2011] Biological Properties: [Li et al., Plos CB 2009], [Kotelnikova et al.

8 Computing similarity of drug chemical structures Collected 1007 approved small-molecule drugs from DrugBank with their chemical structure information. Used CDK to encode each component into 881- dimensional substructure vector defined in PubChem. Tanimoto similarity: the proportion of substructures in common between two molecules.

Used CDK to encode each component into 881- dimensional substructure vector

9 Computing similarity of drug protein targets Mapped DrugBank Target information to Uniprot extracted 3152 relationships between 1007 drugs and 775 proteins. target P( d ) P( d ) x y 1 sim ( d, d ) g( Pd ( ), P( d)) x y i x j y Pd ( x) Pd ( y) i 1 j 1 where given a drug d, we present its target protein set as P(d); then P(d) is the size of the target protein set of drug d. The sequence similarity function of two proteins g is calculated as a Smith-Waterman sequence alignment score.

target P( d ) P( d ) x y 1 sim ( d, d ) g( Pd ( ), P( d)) x y i x j y Pd ( x) Pd ( y) i 1 j 1 where given a drug d, we

10 Computing similarity of drug side-effect profiles Side-effect keywords were obtained from the SIDER database (information from drug s package inserts). Each drug was represented by 1385-dimensional binary side-effect profile whose elements encode for the presence (1) or absence (0) of each of the side-effect key words. Then we can use Tanimoto to measure the side-effect similarity. Obtained relationships between 613 drugs and 1385 side effects. 394 drugs from DrugBank approved list could NOT be mapped to SIDER drug names. Imputing missing side-effect profiles from chemical structure information. Method similar to [Pauwels et al., BMC Bioinformatics 2011]

words. Then we can use Tanimoto to measure the side-effect similarity. Obtained 40974 relationships between 613 drugs and 1385 side effects.

11 Computing prediction score from a single data source Obtained a drug s known use(s) National Drug File Reference Terminology. Constructed a gold set of 3250 treatment relationship between 799 drugs and 719 diseases. i f ( d, s) sim ( d, d ) C( s indications( d )) i x x y y d N ( d ) y k x C is a characteristic function that return 1 if d y has a disease indication s and 0 otherwise, and N k (d x ) are the k nearest neighbors of drug d x according to the metric sim i which is determined by the type of i-th data source. x query drug d x Neighborhood of d x

i f ( d, s) sim ( d, d ) C( s indications( d )) i x x y y d N ( d ) y k x C is a characteristic function that return 1 if d y has a disease

12 Combining multiple measures A new drug repositioning framework: Similarity-based LArgemargin learning of Multiple Sources (SLAMS)

13 Large margin method Given m scores for a drug-disease pair (d, s), we propose a large margin method to calculate final score f E as a weighted average of individual scores: A weight vector w, used for integration of m prediction, be found by solving the optimization problem.

weighted average of individual scores: A weight vector w, used for

14 Method comparison PREDICT (Gottlieb et al. Mol. Sys. Biol. 2011): Uses similarity measures as features, learns a logistic regression classifier to yield a classification score. Simple Average: Assumes that each data source is equally informative, thus simply averages all k-nn prediction scores. SLAMS: Algorithm proposed in this study that uses a large margin method to automatically weighs and integrates multiple data sources.

15 Data source comparison Distribution of SLAMS weights for chemical, biological and phenotypic data sources.

16 Analysis of novel predictions False-positive (FP) drug-disease associations were predicted by our method but they were not present in the training set. Some FP associations could be false, but a few associations could be true and can be considered as drug repositioning candidates in the real-world drug discovery. Of 4066 found drug-disease associations in ClinicalTrials.gov (not included in the training set), our FP associations cover 21%. Therefore, our predictions statistically overlap drug-disease associations tested in clinical trials, suggesting that the predicted drugs may be regarded as valuable repositioning candidates for further drug discovery research. All data sets and predicted drug-disease associations are available at

Of 4066 found drug-disease associations in ClinicalTrials.gov (not included in the training set), our FP associations cover 21%.

17 Examples of FP predictions for Rheumatoid Arthritis

18 Conclusion We proposed SLAMS, a new drug repositioning framework by integrating chemical, biological, and phenotypic properties. The method allows easy integration of additional drug information sources. The method ranked multiple drug information sources based on their contributions to the prediction, thus paving the way for prioritizing multiple data sources and building more reliable drug repositioning models.

The method ranked multiple drug information sources based on their contributions to the prediction,

19 Future work: integrate more information sources

20 Thank you! Questions? Ping Zhang: Pankaj Agarwal: Zoran Obradovic:

Towards Personalized Medicine: Leveraging Patient Similarity and Drug Similarity Analytics

Towards Personalized Medicine: Leveraging Patient Similarity and Drug Similarity Analytics Ping Zhang, PhD, Fei Wang, PhD, Jianying Hu, PhD, Robert Sorrentino, MD Healthcare Analytics Research Group, IBM