Tamoxifen Resistance in Breast Cancer

Size: px
Start display at page:

Download "Tamoxifen Resistance in Breast Cancer"

Transcription

1 Gene Expression Analysis : Tamoxifen Resistance in Breast Cancer B. Haibe-Kains 1,2 G. Bontempi 2 C. Sotiriou 1 1 Unité Microarray, Institut Jules Bordet 2 Machine Learning Group, Université Libre de Bruxelles September 29, 2006 Benjamin Haibe-Kains (ULB) IRIDIA Microarray Meeting September 29, / 21

2 Biological Question and Datasets Biological question : Can we predict which patients will resist to the Tamoxifen treatment in an adjuvant setting? Tamoxifen treated patients coming from 3 different institutions, can we pool the data? gene-expressions : normalization survival data : model fitting. 255 eligible patients (samples) probes (variables) Benjamin Haibe-Kains (ULB) IRIDIA Microarray Meeting September 29, / 21

3 Survival Data Tamoxifen treated patients survival curves Tamoxifen treated patients hazard functions Probability of survival OXFT KIT GUYT 99 patients (19 events) 69 patients (20 events) 87 patients (28 events) Hazard Rate OXFT KIT GUYT 99 patients (19 events) 69 patients (20 events) 87 patients (28 events) Time Time Benjamin Haibe-Kains (ULB) IRIDIA Microarray Meeting September 29, / 21

4 Tamoxifen Analysis Raw gene expession data cross-validation Normalization Data transformation Feature selection Model fitting In order to keep the design simple, we fix : the number of models to aggregate (see feature selection step) the cutoff selection (see the model fitting step). Performance assessment Benjamin Haibe-Kains (ULB) IRIDIA Microarray Meeting September 29, / 21

5 Normalization Goal : reduce the systematic variability between samples. Method : normalization. Procedure 1 Background correction, expression quantification and normalization were performed using Robust Multichip Average [Irizarry et al., 2003, Bolstad et al., 2003]. 2 RMA performed separately per population 3 Gene median centering per population. No artifact highlighted by unsupervised clustering. Benjamin Haibe-Kains (ULB) IRIDIA Microarray Meeting September 29, / 21

6 Data Transformation Goal : reduce the dimensionality of the problem. Method : cluster highly correlated variables. Procedure 1 Use of an independent dataset of untreated patients (137 patients, probes) 2 Filtering of the less variant probes. 3 Hierarchical clustering (average linkage, uncentered Pearson correlation). 4 Cut the tree at height Keep only clusters with at least 5 known UniGenes. 110 clusters where half are significantly associated with survival (on the Tamoxifen dataset). Benjamin Haibe-Kains (ULB) IRIDIA Microarray Meeting September 29, / 21

7 Data Transformation Feature Pairwise Correlation Histogram of pairwize correlations Frequency pairwize correlation mean: median: std: range: [ 0.651,1] Benjamin Haibe-Kains (ULB) IRIDIA Microarray Meeting September 29, / 21

8 Feature Selection Goal : fast selection of the relevant features. Method : ranking based on univariate model significance. Procedure 1 For each feature, compute the likelihood ratio test of the univariate Cox model. 2 Perform the ranking based on the p-value. 3 Select the top n features (n is fixed). n features are selected. Benjamin Haibe-Kains (ULB) IRIDIA Microarray Meeting September 29, / 21

9 Model Fitting Goal : fit a simple model with low variance. Method : aggregation of n univariate classifiers. Procedure 1 The univariate models for the n top features were computed during the previous step. 2 Compute a linear combination of these classifiers with weight of 1. Continuous score representing the risk of a patient. Benjamin Haibe-Kains (ULB) IRIDIA Microarray Meeting September 29, / 21

10 Performance Assessment We do binary classification from survival data. We can not use traditional statistics (sensitivity, specificity, χ 2 test,... ) Adaptation of such estimators to deal with censoring. Benjamin Haibe-Kains (ULB) IRIDIA Microarray Meeting September 29, / 21

11 Survival Statistics for 2 Groups There exist several ways to assess difference in survival between 2 groups Kaplan-Meier estimator and Logrank test : KM method estimates survivor function such that # death at time t j Ŝ(t) = {}}{ d j [1 ]. n j j:t j t }{{} # at risk at time t j logrank method tests H 0 : S 1 (t) = S 2 (t) t 0. Hazard ratio (HR) : relative hazard between 2 groups using Cox model with one dummy variable (G = 0/1 for low and high-risk groups). Time-dependent ROC Curves and area under ROC curves : extension of traditional ROC curves dealing with censoring data. Benjamin Haibe-Kains (ULB) IRIDIA Microarray Meeting September 29, / 21

12 Performance in LOO Logrank and Time-Dependent ROC Curve Survival curves COXSM13 (LOO classification) Tamoxifen treated patients time dependent ROC curves (5 yrs) Probability of survival logrank = 1.041e 07 loglik = e 06 low risk 179(33) high risk 76(34) sensitivity COXSM13 (AUC=0.767) Time 1 specificity Benjamin Haibe-Kains (ULB) IRIDIA Microarray Meeting September 29, / 21

13 Performance Hazard Ratio and Logrank Training set Test set Hazard ratio Log-rank p OXFT (99/19) KIT/GUYT (156/48) 2.17 [1.2,3.91] 8.46e-3 KIT (69/20) OXFT/GUYT (186/47) 4.07 [2.23,7.41] 7.98e-7 GUYT (87/28) OXFT/KIT (168/39) 5.93 [3,11.75] 1.24e-9 KIT/GUYT (156/48) OXFT (99/19) [5.38,39.52] 1.74e-11 OXFT/GUYT (186/47) KIT (69/20) 3.44 [1.36,8.67] 5.27e-3 OXFT/KIT (168/39) GUYT (87/28) 2.23 [1.05,4.71] 3.11e-2 Leave-one-out c.v [2.32;6.41] 1.04e-7 Multiple 10-fold c.v [2.66,3.84] 9.04e-7 Benjamin Haibe-Kains (ULB) IRIDIA Microarray Meeting September 29, / 21

14 Performance wrt Signature Size Hazard ratio (CI) wrt signature size 10FOLD CV Logrank p value wrt signature size 10FOLD CV log2 hazard ratio log10 logrank p value signature.size signature.size Benjamin Haibe-Kains (ULB) IRIDIA Microarray Meeting September 29, / 21

15 Performance vs Random At level of performance estimation : 1000 random permutations of the labels and perform the whole procedure only 1% of such classifications gives better discrimnation. At level of feature selection : random selection of n features most of the feature selection result in good classifiers because of the high proportion of relevant features. use of the anti -ranking very poor performance. Benjamin Haibe-Kains (ULB) IRIDIA Microarray Meeting September 29, / 21

16 Feature Selection Stability Signature stabilit COXSM13 (10FOLDCV) pclust.79 pclust.148 pclust.120 pclust.112 pclust.375 pclust.201 pclust.859 pclust.521 pclust.784 pclust.360 pclust.231 pclust.110 pclust.337 pclust.971 pclust.13 pclust.1008 pclust.606 pclust.289 pclust.435 pclust.88 pclust.34 pclust.425 pclust.752 pclust.851 pclust.714 pclust.26 pclust.207 pclust.350 pclust.968 pclust.692 pclust.709 pclust.1318 pclust.17 pclust.399 pclust.252 pclust.188 pclust.149 pclust.43 pclust.146 pclust.62 pclust.177 pclust.667 pclust.462 pclust.181 Over the multiple 10-fold cross-validations, several top n features are selected If the same set of features are selected every time, we see high relative frequency for these features relative frequency Benjamin Haibe-Kains (ULB) IRIDIA Microarray Meeting September 29, / 21

17 Feature Selection Stability Signature stabilit COXSM13 (10FOLDCV) pclust.79 pclust.148 pclust.120 pclust.112 pclust.375 pclust.201 pclust.859 pclust.521 pclust.784 pclust.360 pclust.231 pclust.110 pclust.337 pclust.971 pclust.13 pclust.1008 pclust.606 pclust.289 pclust.435 pclust.88 pclust.34 pclust.425 pclust.752 pclust.851 pclust.714 pclust.26 pclust.207 pclust.350 pclust.968 pclust.692 pclust.709 pclust.1318 pclust.17 pclust.399 pclust.252 pclust.188 pclust.149 pclust.43 pclust.146 pclust.62 pclust.177 pclust.667 pclust.462 pclust.181 partial ranking area Over the multiple 10-fold cross-validations, several top n features are selected If the same set of features are selected every time, we see high relative frequency for these features. We can compute the area under partial ranking w.r.t. the number of selected features relative frequency Benjamin Haibe-Kains (ULB) IRIDIA Microarray Meeting September 29, / 21

18 Stability wrt Signature Size Partial ranking stability wrt signature size 10FOLD CV Area Under Partial Ranking n = 13 seems to be a good trade-off between the number of selected features (signature) and its stability Signature Size Benjamin Haibe-Kains (ULB) IRIDIA Microarray Meeting September 29, / 21

19 Final Model Use the same method with all the samples. The result is a model with a set of features. The model and the features are expected from previous results. pclust.120 pclust.360 pclust.201 pclust.148 pclust.521 pclust.784 pclust.375 pclust.231 pclust.110 pclust.337 pclust.112 pclust.79 pclust.859 pclust.120 pclust.360 pclust.201 pclust.148 pclust.521 pclust.784 pclust.375 pclust.231 pclust.110 pclust.337 pclust.112 pclust.79 pclust.859 Benjamin Haibe-Kains (ULB) IRIDIA Microarray Meeting September 29, / 21

20 Future Works Study the stability of the initial clustering (data transformation step). Use of Gene Ontology to study the signature in a biological point of view. Objective comparison with the traditional clinical variables. Benjamin Haibe-Kains (ULB) IRIDIA Microarray Meeting September 29, / 21

21 Current Research Interests Meta-analysis. Ranking statistics. Input space transformation (using biological knowledge, such that gene list enrichment or GO). Optimization framework for binary classification of survival data. Benjamin Haibe-Kains (ULB) IRIDIA Microarray Meeting September 29, / 21

22 Thank you for your attention. Benjamin Haibe-Kains (ULB) IRIDIA Microarray Meeting September 29, / 21

23 Bolstad, B. M., Irizarry, R. A., Astrand, M., and Speed, T. P. (2003). A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics, 19(2): Irizarry, R. A., Boldstad, B. M., Collin, F., Cope, L. M., Hobbs, B., and Speed, T. R. (2003). Summaries of affymetrix genechip probe level data. Nucleic Acids Research, 31(4). Benjamin Haibe-Kains (ULB) IRIDIA Microarray Meeting September 29, / 21

Predictive Gene Signature Selection for Adjuvant Chemotherapy in Non-Small Cell Lung Cancer Patients

Predictive Gene Signature Selection for Adjuvant Chemotherapy in Non-Small Cell Lung Cancer Patients Predictive Gene Signature Selection for Adjuvant Chemotherapy in Non-Small Cell Lung Cancer Patients by Li Liu A practicum report submitted to the Department of Public Health Sciences in conformity with

More information

Statistics Graduate Courses

Statistics Graduate Courses Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.

More information

Personalized Predictive Medicine and Genomic Clinical Trials

Personalized Predictive Medicine and Genomic Clinical Trials Personalized Predictive Medicine and Genomic Clinical Trials Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer Institute http://brb.nci.nih.gov brb.nci.nih.gov Powerpoint presentations

More information

Gene Expression Analysis

Gene Expression Analysis Gene Expression Analysis Jie Peng Department of Statistics University of California, Davis May 2012 RNA expression technologies High-throughput technologies to measure the expression levels of thousands

More information

Tips for surviving the analysis of survival data. Philip Twumasi-Ankrah, PhD

Tips for surviving the analysis of survival data. Philip Twumasi-Ankrah, PhD Tips for surviving the analysis of survival data Philip Twumasi-Ankrah, PhD Big picture In medical research and many other areas of research, we often confront continuous, ordinal or dichotomous outcomes

More information

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear

More information

Frozen Robust Multi-Array Analysis and the Gene Expression Barcode

Frozen Robust Multi-Array Analysis and the Gene Expression Barcode Frozen Robust Multi-Array Analysis and the Gene Expression Barcode Matthew N. McCall October 13, 2015 Contents 1 Frozen Robust Multiarray Analysis (frma) 2 1.1 From CEL files to expression estimates...................

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Comparative genomic hybridization Because arrays are more than just a tool for expression analysis

Comparative genomic hybridization Because arrays are more than just a tool for expression analysis Microarray Data Analysis Workshop MedVetNet Workshop, DTU 2008 Comparative genomic hybridization Because arrays are more than just a tool for expression analysis Carsten Friis ( with several slides from

More information

Step-by-Step Guide to Basic Expression Analysis and Normalization

Step-by-Step Guide to Basic Expression Analysis and Normalization Step-by-Step Guide to Basic Expression Analysis and Normalization Page 1 Introduction This document shows you how to perform a basic analysis and normalization of your data. A full review of this document

More information

Statistical issues in the analysis of microarray data

Statistical issues in the analysis of microarray data Statistical issues in the analysis of microarray data Daniel Gerhard Institute of Biostatistics Leibniz University of Hannover ESNATS Summerschool, Zermatt D. Gerhard (LUH) Analysis of microarray data

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

Building risk prediction models - with a focus on Genome-Wide Association Studies. Charles Kooperberg

Building risk prediction models - with a focus on Genome-Wide Association Studies. Charles Kooperberg Building risk prediction models - with a focus on Genome-Wide Association Studies Risk prediction models Based on data: (D i, X i1,..., X ip ) i = 1,..., n we like to fit a model P(D = 1 X 1,..., X p )

More information

Classification of Bad Accounts in Credit Card Industry

Classification of Bad Accounts in Credit Card Industry Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition

More information

A Comparison of Normalization Methods for High Density Oligonucleotide Array Data Based on Variance and Bias

A Comparison of Normalization Methods for High Density Oligonucleotide Array Data Based on Variance and Bias A Comparison of Normalization Methods for High Density Oligonucleotide Array Data Based on Variance and Bias B. M. Bolstad, R. A. Irizarry 2, M. Astrand 3 and T. P. Speed 4, 5 Group in Biostatistics, University

More information

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Introduction to Machine Learning Lecture 1 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Introduction Logistics Prerequisites: basics concepts needed in probability and statistics

More information

Quality Assessment of Exon and Gene Arrays

Quality Assessment of Exon and Gene Arrays Quality Assessment of Exon and Gene Arrays I. Introduction In this white paper we describe some quality assessment procedures that are computed from CEL files from Whole Transcript (WT) based arrays such

More information

Performance Metrics for Graph Mining Tasks

Performance Metrics for Graph Mining Tasks Performance Metrics for Graph Mining Tasks 1 Outline Introduction to Performance Metrics Supervised Learning Performance Metrics Unsupervised Learning Performance Metrics Optimizing Metrics Statistical

More information

Survival Analysis of Dental Implants. Abstracts

Survival Analysis of Dental Implants. Abstracts Survival Analysis of Dental Implants Andrew Kai-Ming Kwan 1,4, Dr. Fu Lee Wang 2, and Dr. Tak-Kun Chow 3 1 Census and Statistics Department, Hong Kong, China 2 Caritas Institute of Higher Education, Hong

More information

Distances, Clustering, and Classification. Heatmaps

Distances, Clustering, and Classification. Heatmaps Distances, Clustering, and Classification Heatmaps 1 Distance Clustering organizes things that are close into groups What does it mean for two genes to be close? What does it mean for two samples to be

More information

Interpretation of Somers D under four simple models

Interpretation of Somers D under four simple models Interpretation of Somers D under four simple models Roger B. Newson 03 September, 04 Introduction Somers D is an ordinal measure of association introduced by Somers (96)[9]. It can be defined in terms

More information

Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection

Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection Directions in Statistical Methodology for Multivariable Predictive Modeling Frank E Harrell Jr University of Virginia Seattle WA 19May98 Overview of Modeling Process Model selection Regression shape Diagnostics

More information

Factors for success in big data science

Factors for success in big data science Factors for success in big data science Damjan Vukcevic Data Science Murdoch Childrens Research Institute 16 October 2014 Big Data Reading Group (Department of Mathematics & Statistics, University of Melbourne)

More information

MHI3000 Big Data Analytics for Health Care Final Project Report

MHI3000 Big Data Analytics for Health Care Final Project Report MHI3000 Big Data Analytics for Health Care Final Project Report Zhongtian Fred Qiu (1002274530) http://gallery.azureml.net/details/81ddb2ab137046d4925584b5095ec7aa 1. Data pre-processing The data given

More information

Survey, Statistics and Psychometrics Core Research Facility University of Nebraska-Lincoln. Log-Rank Test for More Than Two Groups

Survey, Statistics and Psychometrics Core Research Facility University of Nebraska-Lincoln. Log-Rank Test for More Than Two Groups Survey, Statistics and Psychometrics Core Research Facility University of Nebraska-Lincoln Log-Rank Test for More Than Two Groups Prepared by Harlan Sayles (SRAM) Revised by Julia Soulakova (Statistics)

More information

Data Analysis, Research Study Design and the IRB

Data Analysis, Research Study Design and the IRB Minding the p-values p and Quartiles: Data Analysis, Research Study Design and the IRB Don Allensworth-Davies, MSc Research Manager, Data Coordinating Center Boston University School of Public Health IRB

More information

Supervised and unsupervised learning - 1

Supervised and unsupervised learning - 1 Chapter 3 Supervised and unsupervised learning - 1 3.1 Introduction The science of learning plays a key role in the field of statistics, data mining, artificial intelligence, intersecting with areas in

More information

Using multiple models: Bagging, Boosting, Ensembles, Forests

Using multiple models: Bagging, Boosting, Ensembles, Forests Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Supervised Feature Selection & Unsupervised Dimensionality Reduction Supervised Feature Selection & Unsupervised Dimensionality Reduction Feature Subset Selection Supervised: class labels are given Select a subset of the problem features Why? Redundant features much or

More information

Evaluation & Validation: Credibility: Evaluating what has been learned

Evaluation & Validation: Credibility: Evaluating what has been learned Evaluation & Validation: Credibility: Evaluating what has been learned How predictive is a learned model? How can we evaluate a model Test the model Statistical tests Considerations in evaluating a Model

More information

Row Quantile Normalisation of Microarrays

Row Quantile Normalisation of Microarrays Row Quantile Normalisation of Microarrays W. B. Langdon Departments of Mathematical Sciences and Biological Sciences University of Essex, CO4 3SQ Technical Report CES-484 ISSN: 1744-8050 23 June 2008 Abstract

More information

Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

More information

Linda Staub & Alexandros Gekenidis

Linda Staub & Alexandros Gekenidis Seminar in Statistics: Survival Analysis Chapter 2 Kaplan-Meier Survival Curves and the Log- Rank Test Linda Staub & Alexandros Gekenidis March 7th, 2011 1 Review Outcome variable of interest: time until

More information

Information processing for new generation of clinical decision support systems

Information processing for new generation of clinical decision support systems Information processing for new generation of clinical decision support systems Thomas Mazzocco tma@cs.stir.ac.uk COSIPRA lab - School of Natural Sciences University of Stirling, Scotland (UK) 2nd SPLab

More information

More details on the inputs, functionality, and output can be found below.

More details on the inputs, functionality, and output can be found below. Overview: The SMEEACT (Software for More Efficient, Ethical, and Affordable Clinical Trials) web interface (http://research.mdacc.tmc.edu/smeeactweb) implements a single analysis of a two-armed trial comparing

More information

Data mining for prediction

Data mining for prediction Data mining for prediction Prof. Gianluca Bontempi Département d Informatique Faculté de Sciences ULB Université Libre de Bruxelles email: gbonte@ulb.ac.be Outline Extracting knowledge from observations.

More information

Tutorial for proteome data analysis using the Perseus software platform

Tutorial for proteome data analysis using the Perseus software platform Tutorial for proteome data analysis using the Perseus software platform Laboratory of Mass Spectrometry, LNBio, CNPEM Tutorial version 1.0, January 2014. Note: This tutorial was written based on the information

More information

Analysis of the colorectal tumor microenvironment using integrative bioinformatic tools

Analysis of the colorectal tumor microenvironment using integrative bioinformatic tools MLECNIK Bernhard & BINDEA Gabriela Analysis of the colorectal tumor microenvironment using integrative bioinformatic tools INSERM U872, Jérôme Galon Team15: Integrative Cancer Immunology Cordeliers Research

More information

Analysing Questionnaires using Minitab (for SPSS queries contact -) Graham.Currell@uwe.ac.uk

Analysing Questionnaires using Minitab (for SPSS queries contact -) Graham.Currell@uwe.ac.uk Analysing Questionnaires using Minitab (for SPSS queries contact -) Graham.Currell@uwe.ac.uk Structure As a starting point it is useful to consider a basic questionnaire as containing three main sections:

More information

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat Information Builders enables agile information solutions with business intelligence (BI) and integration technologies. WebFOCUS the most widely utilized business intelligence platform connects to any enterprise

More information

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be

More information

Gene expression analysis. Ulf Leser and Karin Zimmermann

Gene expression analysis. Ulf Leser and Karin Zimmermann Gene expression analysis Ulf Leser and Karin Zimmermann Ulf Leser: Bioinformatics, Wintersemester 2010/2011 1 Last lecture What are microarrays? - Biomolecular devices measuring the transcriptome of a

More information

Chapter 7. Feature Selection. 7.1 Introduction

Chapter 7. Feature Selection. 7.1 Introduction Chapter 7 Feature Selection Feature selection is not used in the system classification experiments, which will be discussed in Chapter 8 and 9. However, as an autonomous system, OMEGA includes feature

More information

0 value2. 3. Assign labels to the proteins in each interval of the ranked list

0 value2. 3. Assign labels to the proteins in each interval of the ranked list Ranked list of protein degrees in decreasing order 0 100 Proteins with few connections Proteins with many connections 1. Sort a random number between 80 and 98 (value1) 0 value1 2. Sort a random number

More information

October 31, 2014. The effect of price on youth alcohol. consumption in Canada. Stephenson Strobel & Evelyn Forget. Introduction. Data and Methodology

October 31, 2014. The effect of price on youth alcohol. consumption in Canada. Stephenson Strobel & Evelyn Forget. Introduction. Data and Methodology October 31, 2014 Why should we care? Deaths 0 500 1000 1500 2000 2500 2000 2002 2004 2006 2008 2010 2012 Year Accidents (unintentional injuries) Intentional self-harm (suicide) Other causes of death Assault

More information

affyplm: Fitting Probe Level Models

affyplm: Fitting Probe Level Models affyplm: Fitting Probe Level Models Ben Bolstad bmb@bmbolstad.com http://bmbolstad.com April 16, 2015 Contents 1 Introduction 2 2 Fitting Probe Level Models 2 2.1 What is a Probe Level Model and What is

More information

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer Machine Learning Chapter 18, 21 Some material adopted from notes by Chuck Dyer What is learning? Learning denotes changes in a system that... enable a system to do the same task more efficiently the next

More information

L25: Ensemble learning

L25: Ensemble learning L25: Ensemble learning Introduction Methods for constructing ensembles Combination strategies Stacked generalization Mixtures of experts Bagging Boosting CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna

More information

Data Mining and Visualization

Data Mining and Visualization Data Mining and Visualization Jeremy Walton NAG Ltd, Oxford Overview Data mining components Functionality Example application Quality control Visualization Use of 3D Example application Market research

More information

STATISTICA Formula Guide: Logistic Regression. Table of Contents

STATISTICA Formula Guide: Logistic Regression. Table of Contents : Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

More information

Software and Methods for the Analysis of Affymetrix GeneChip Data. Rafael A Irizarry Department of Biostatistics Johns Hopkins University

Software and Methods for the Analysis of Affymetrix GeneChip Data. Rafael A Irizarry Department of Biostatistics Johns Hopkins University Software and Methods for the Analysis of Affymetrix GeneChip Data Rafael A Irizarry Department of Biostatistics Johns Hopkins University Outline Overview Bioconductor Project Examples 1: Gene Annotation

More information

Data, Measurements, Features

Data, Measurements, Features Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are

More information

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses. DESCRIPTIVE STATISTICS The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses. DESCRIPTIVE VS. INFERENTIAL STATISTICS Descriptive To organize,

More information

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics. Course Catalog In order to be assured that all prerequisites are met, students must acquire a permission number from the education coordinator prior to enrolling in any Biostatistics course. Courses are

More information

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees Statistical Data Mining Practical Assignment 3 Discriminant Analysis and Decision Trees In this practical we discuss linear and quadratic discriminant analysis and tree-based classification techniques.

More information

How To Cluster

How To Cluster Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main

More information

The NCPE has issued a recommendation regarding the use of pertuzumab for this indication. The NCPE does not recommend reimbursement of pertuzumab.

The NCPE has issued a recommendation regarding the use of pertuzumab for this indication. The NCPE does not recommend reimbursement of pertuzumab. Cost Effectiveness of Pertuzumab (Perjeta ) in Combination with Trastuzumab and Docetaxel in Adults with HER2-Positive Metastatic or Locally Recurrent Unresectable Breast Cancer Who Have Not Received Previous

More information

Decision Trees from large Databases: SLIQ

Decision Trees from large Databases: SLIQ Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values

More information

Tests for Two Survival Curves Using Cox s Proportional Hazards Model

Tests for Two Survival Curves Using Cox s Proportional Hazards Model Chapter 730 Tests for Two Survival Curves Using Cox s Proportional Hazards Model Introduction A clinical trial is often employed to test the equality of survival distributions of two treatment groups.

More information

Cluster software and Java TreeView

Cluster software and Java TreeView Cluster software and Java TreeView To download the software: http://bonsai.hgc.jp/~mdehoon/software/cluster/software.htm http://bonsai.hgc.jp/~mdehoon/software/cluster/manual/treeview.html Cluster 3.0

More information

Guided Reading 9 th Edition. informed consent, protection from harm, deception, confidentiality, and anonymity.

Guided Reading 9 th Edition. informed consent, protection from harm, deception, confidentiality, and anonymity. Guided Reading Educational Research: Competencies for Analysis and Applications 9th Edition EDFS 635: Educational Research Chapter 1: Introduction to Educational Research 1. List and briefly describe the

More information

Measuring gene expression (Microarrays) Ulf Leser

Measuring gene expression (Microarrays) Ulf Leser Measuring gene expression (Microarrays) Ulf Leser This Lecture Gene expression Microarrays Idea Technologies Problems Quality control Normalization Analysis next week! 2 http://learn.genetics.utah.edu/content/molecules/transcribe/

More information

Strategies for Identifying Students at Risk for USMLE Step 1 Failure

Strategies for Identifying Students at Risk for USMLE Step 1 Failure Vol. 42, No. 2 105 Medical Student Education Strategies for Identifying Students at Risk for USMLE Step 1 Failure Jira Coumarbatch, MD; Leah Robinson, EdS; Ronald Thomas, PhD; Patrick D. Bridge, PhD Background

More information

Analysis of gene expression data. Ulf Leser and Philippe Thomas

Analysis of gene expression data. Ulf Leser and Philippe Thomas Analysis of gene expression data Ulf Leser and Philippe Thomas This Lecture Protein synthesis Microarray Idea Technologies Applications Problems Quality control Normalization Analysis next week! Ulf Leser:

More information

Capturing Best Practice for Microarray Gene Expression Data Analysis

Capturing Best Practice for Microarray Gene Expression Data Analysis Capturing Best Practice for Microarray Gene Expression Data Analysis Gregory Piatetsky-Shapiro KDnuggets gps@kdnuggets.com Tom Khabaza SPSS tkhabaza@spss.com Sridhar Ramaswamy MIT / Whitehead Institute

More information

Learning from Diversity

Learning from Diversity Learning from Diversity Epitope Prediction with Sequence and Structure Features using an Ensemble of Support Vector Machines Rob Patro and Carl Kingsford Center for Bioinformatics and Computational Biology

More information

Package ROC632. February 19, 2015

Package ROC632. February 19, 2015 Type Package Package ROC632 February 19, 2015 Title Construction of diagnostic or prognostic scoring system and internal validation of its discriminative capacities based on ROC curve and 0.633+ boostrap

More information

Histogram-based Outlier Score (HBOS): A fast Unsupervised Anomaly Detection Algorithm

Histogram-based Outlier Score (HBOS): A fast Unsupervised Anomaly Detection Algorithm Histogram-based Outlier Score (HBOS): A fast Unsupervised Anomaly Detection Algorithm Markus Goldstein and Andreas Dengel German Research Center for Artificial Intelligence (DFKI), Trippstadter Str. 122,

More information

Identifying SPAM with Predictive Models

Identifying SPAM with Predictive Models Identifying SPAM with Predictive Models Dan Steinberg and Mikhaylo Golovnya Salford Systems 1 Introduction The ECML-PKDD 2006 Discovery Challenge posed a topical problem for predictive modelers: how to

More information

Credit Card Fraud Detection and Concept-Drift Adaptation with Delayed Supervised Information

Credit Card Fraud Detection and Concept-Drift Adaptation with Delayed Supervised Information Credit Card Fraud Detection and Concept-Drift Adaptation with Delayed Supervised Information Andrea Dal Pozzolo, Giacomo Boracchi, Olivier Caelen, Cesare Alippi, and Gianluca Bontempi 15/07/2015 IEEE IJCNN

More information

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools 2009-2010

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools 2009-2010 Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools 2009-2010 Week 1 Week 2 14.0 Students organize and describe distributions of data by using a number of different

More information

Binary Logistic Regression

Binary Logistic Regression Binary Logistic Regression Main Effects Model Logistic regression will accept quantitative, binary or categorical predictors and will code the latter two in various ways. Here s a simple model including

More information

Data Mining. Nonlinear Classification

Data Mining. Nonlinear Classification Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

More information

Lecture 2 ESTIMATING THE SURVIVAL FUNCTION. One-sample nonparametric methods

Lecture 2 ESTIMATING THE SURVIVAL FUNCTION. One-sample nonparametric methods Lecture 2 ESTIMATING THE SURVIVAL FUNCTION One-sample nonparametric methods There are commonly three methods for estimating a survivorship function S(t) = P (T > t) without resorting to parametric models:

More information

Time series experiments

Time series experiments Time series experiments Time series experiments Why is this a separate lecture: The price of microarrays are decreasing more time series experiments are coming Often a more complex experimental design

More information

An Application of Weibull Analysis to Determine Failure Rates in Automotive Components

An Application of Weibull Analysis to Determine Failure Rates in Automotive Components An Application of Weibull Analysis to Determine Failure Rates in Automotive Components Jingshu Wu, PhD, PE, Stephen McHenry, Jeffrey Quandt National Highway Traffic Safety Administration (NHTSA) U.S. Department

More information

Better credit models benefit us all

Better credit models benefit us all Better credit models benefit us all Agenda Credit Scoring - Overview Random Forest - Overview Random Forest outperform logistic regression for credit scoring out of the box Interaction term hypothesis

More information

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05 Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

More information

A General Framework for Weighted Gene Co-expression Network Analysis

A General Framework for Weighted Gene Co-expression Network Analysis Please cite: Statistical Applications in Genetics and Molecular Biology (2005). A General Framework for Weighted Gene Co-expression Network Analysis Bin Zhang and Steve Horvath Departments of Human Genetics

More information

Design and Analysis of Phase III Clinical Trials

Design and Analysis of Phase III Clinical Trials Cancer Biostatistics Center, Biostatistics Shared Resource, Vanderbilt University School of Medicine June 19, 2008 Outline 1 Phases of Clinical Trials 2 3 4 5 6 Phase I Trials: Safety, Dosage Range, and

More information

ROOT: A data mining tool from CERN What can actuaries do with it?

ROOT: A data mining tool from CERN What can actuaries do with it? ROOT: A data mining tool from CERN What can actuaries do with it? Ravi Kumar, Senior Manager, Deloitte Consulting LLP Lucas Lau, Senior Consultant, Deloitte Consulting LLP Southern California Casualty

More information

SUMAN DUVVURU STAT 567 PROJECT REPORT

SUMAN DUVVURU STAT 567 PROJECT REPORT SUMAN DUVVURU STAT 567 PROJECT REPORT SURVIVAL ANALYSIS OF HEROIN ADDICTS Background and introduction: Current illicit drug use among teens is continuing to increase in many countries around the world.

More information

Simple linear regression

Simple linear regression Simple linear regression Introduction Simple linear regression is a statistical method for obtaining a formula to predict values of one variable from another where there is a causal relationship between

More information

Protein Protein Interaction Networks

Protein Protein Interaction Networks Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics

More information

Easily Identify Your Best Customers

Easily Identify Your Best Customers IBM SPSS Statistics Easily Identify Your Best Customers Use IBM SPSS predictive analytics software to gain insight from your customer database Contents: 1 Introduction 2 Exploring customer data Where do

More information

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19 PREFACE xi 1 INTRODUCTION 1 1.1 Overview 1 1.2 Definition 1 1.3 Preparation 2 1.3.1 Overview 2 1.3.2 Accessing Tabular Data 3 1.3.3 Accessing Unstructured Data 3 1.3.4 Understanding the Variables and Observations

More information

Identification of rheumatoid arthritis and osteoarthritis patients by transcriptome-based rule set generation

Identification of rheumatoid arthritis and osteoarthritis patients by transcriptome-based rule set generation Identification of rheumatoid arthritis and osterthritis patients by transcriptome-based rule set generation Bering Limited Report generated on September 19, 2014 Contents 1 Dataset summary 2 1.1 Project

More information

Predictive Modeling with JMP Genomics

Predictive Modeling with JMP Genomics Predictive Modeling with JMP Genomics Page 1 Introduction Predictive modeling or biomarker discovery with modern genomics data is a popular activity and is a critical component in delivering on the promise

More information

2.500 Threshold. 2.000 1000e - 001. Threshold. Exponential phase. Cycle Number

2.500 Threshold. 2.000 1000e - 001. Threshold. Exponential phase. Cycle Number application note Real-Time PCR: Understanding C T Real-Time PCR: Understanding C T 4.500 3.500 1000e + 001 4.000 3.000 1000e + 000 3.500 2.500 Threshold 3.000 2.000 1000e - 001 Rn 2500 Rn 1500 Rn 2000

More information

The Artificial Prediction Market

The Artificial Prediction Market The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory

More information

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information

More information

Randomized trials versus observational studies

Randomized trials versus observational studies Randomized trials versus observational studies The case of postmenopausal hormone therapy and heart disease Miguel Hernán Harvard School of Public Health www.hsph.harvard.edu/causal Joint work with James

More information

Methods for Meta-analysis in Medical Research

Methods for Meta-analysis in Medical Research Methods for Meta-analysis in Medical Research Alex J. Sutton University of Leicester, UK Keith R. Abrams University of Leicester, UK David R. Jones University of Leicester, UK Trevor A. Sheldon University

More information

# For usage of the functions, it is necessary to install the "survival" and the "penalized" package.

# For usage of the functions, it is necessary to install the survival and the penalized package. ###################################################################### ### R-script for the manuscript ### ### ### ### Survival models with preclustered ### ### gene groups as covariates ### ### ### ###

More information

Protein Prospector and Ways of Calculating Expectation Values

Protein Prospector and Ways of Calculating Expectation Values Protein Prospector and Ways of Calculating Expectation Values 1/16 Aenoch J. Lynn; Robert J. Chalkley; Peter R. Baker; Mark R. Segal; and Alma L. Burlingame University of California, San Francisco, San

More information

New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction

New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction Introduction New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Predictive analytics encompasses the body of statistical knowledge supporting the analysis of massive data sets.

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Lecture 15 - ROC, AUC & Lift Tom Kelsey School of Computer Science University of St Andrews http://tom.home.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-17-AUC

More information

Analyzing the Effect of Treatment and Time on Gene Expression in Partek Genomics Suite (PGS) 6.6: A Breast Cancer Study

Analyzing the Effect of Treatment and Time on Gene Expression in Partek Genomics Suite (PGS) 6.6: A Breast Cancer Study Analyzing the Effect of Treatment and Time on Gene Expression in Partek Genomics Suite (PGS) 6.6: A Breast Cancer Study The data for this study is taken from experiment GSE848 from the Gene Expression

More information