Segmentation models and applications with R



Similar documents
Package changepoint. R topics documented: November 9, Type Package Title Methods for Changepoint Detection Version 2.

Detection of changes in variance using binary segmentation and optimal partitioning

The CUSUM algorithm a small review. Pierre Granjon

Lecture 3: Linear methods for classification

A mixture model for random graphs

Probabilistic Methods for Time-Series Analysis

Statistical issues in the analysis of microarray data

Introduction to General and Generalized Linear Models

DNA Copy Number and Loss of Heterozygosity Analysis Algorithms

False Discovery Rates

Statistics Graduate Courses

Statistical Machine Learning

Data Mining: An Overview. David Madigan

STA 4273H: Statistical Machine Learning

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Comparative genomic hybridization Because arrays are more than just a tool for expression analysis

Circular binary segmentation for the analysis of array-based DNA copy number data

Publication List. Chen Zehua Department of Statistics & Applied Probability National University of Singapore

Overview. Longitudinal Data Variation and Correlation Different Approaches. Linear Mixed Models Generalized Linear Mixed Models

Penalized Logistic Regression and Classification of Microarray Data

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Introduction to Online Learning Theory

Protein Protein Interaction Networks

Big Data - Lecture 1 Optimization reminders

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Functional Principal Components Analysis with Survey Data

Package hier.part. February 20, Index 11. Goodness of Fit Measures for a Regression Hierarchy

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.

Likelihood Approaches for Trial Designs in Early Phase Oncology

Functional Data Analysis of MALDI TOF Protein Spectra

Discuss the size of the instance for the minimum spanning tree problem.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Discrete Optimization

MIMO CHANNEL CAPACITY

CSCI567 Machine Learning (Fall 2014)

Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit scoring, contracting.

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Package MixGHD. June 26, 2015

Poisson Models for Count Data

CHAPTER 9 EXAMPLES: MULTILEVEL MODELING WITH COMPLEX SURVEY DATA

Statistical machine learning, high dimension and big data

Linear Threshold Units

QUICKEST MULTIDECISION ABRUPT CHANGE DETECTION WITH SOME APPLICATIONS TO NETWORK MONITORING

Applications to Data Smoothing and Image Processing I

Model-Based Recursive Partitioning for Detecting Interaction Effects in Subgroups

Developing Risk Adjustment Techniques Using the System for Assessing Health Care Quality in the

Introduction To Real Time Quantitative PCR (qpcr)

Discussion on the paper Hypotheses testing by convex optimization by A. Goldenschluger, A. Juditsky and A. Nemirovski.

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

BayesX - Software for Bayesian Inference in Structured Additive Regression

L3: Statistical Modeling with Hadoop

Wild Binary Segmentation for multiple change-point detection

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Exchange Rate Regime Analysis for the Chinese Yuan

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

How To Enumerate Set Partitions Of A Set Of Five In A Set (R) On C++ (R-Problems) On A Computer (Robert) On An Ipa (Roberts) On Linux (Roger) On

Exact Inference for Gaussian Process Regression in case of Big Data with the Cartesian Product Structure

Bayesian Information Criterion The BIC Of Algebraic geometry

Machine Learning Big Data using Map Reduce

Probabilistic user behavior models in online stores for recommender systems

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem

Classification - Examples

Predictive Gene Signature Selection for Adjuvant Chemotherapy in Non-Small Cell Lung Cancer Patients

Equilibrium computation: Part 1

Nimble Algorithms for Cloud Computing. Ravi Kannan, Santosh Vempala and David Woodruff

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

BIG DATA AND HIGH DIMENSIONAL DATA ANALYSIS

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague.

Statistics in Applications III. Distribution Theory and Inference

On Adaboost and Optimal Betting Strategies

Variational approach to restore point-like and curve-like singularities in imaging

Analysis of gene expression data. Ulf Leser and Philippe Thomas

Message-passing sequential detection of multiple change points in networks

> plot(exp.btgpllm, main = "treed GP LLM,", proj = c(1)) > plot(exp.btgpllm, main = "treed GP LLM,", proj = c(2)) quantile diff (error)

Multivariate Normal Distribution

Alignment and Preprocessing for Data Analysis

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

Statistical Machine Learning from Data

Data Integration. Lectures 16 & 17. ECS289A, WQ03, Filkov

ANOVA. February 12, 2015

Hardware Implementation of Probabilistic State Machine for Word Recognition

Linear Classification. Volker Tresp Summer 2015

Hidden Markov Models

Why? A central concept in Computer Science. Algorithms are ubiquitous.

Specification of the Bayesian CRM: Model and Sample Size. Ken Cheung Department of Biostatistics, Columbia University

Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models.

Transcription:

Segmentation models and applications with R Franck Picard UMR 5558 UCB CNRS LBBE, Lyon, France franck.picard@univ-lyon1.fr http://pbil.univ-lyon1.fr/members/fpicard/ INED-28/04/11 F. Picard (CNRS-LBBE) Segmentation INED-28/04/11 1 / 19

60 40 20 0-20 -40-60 -80-100 80 60 40 20 0-20 -40-60 -80-100 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 500 1000 1500 2000 2500 3000 3500 4000 4500 0 500 1000 1500 2000 2500 3000 3500 4000 4500 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Segmentation is everywhere! 4 (a) 2 0-2 -4 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 time (sec) 4 (b) 2 0-2 -4 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 time (sec) EEG segmentation [2] Market prices segmentation [3] F. Picard (CNRS-LBBE) Segmentation INED-28/04/11 2 / 19

Segmentation is everywhere! TOULOUSE-BLAGNAC 31069001 TM (H) CUGNAUX 31157001 TM (H) 3 TM (H) ( C) 14 TM (H) ( C) 14 2 Amplified segments 12 Moyenne = 13.2 C 12 Moyenne = 13.2 C 1 1960 1970 1980 1990 2000 1960 1970 1980 1990 2000 ONDES 31403001 TM (H) SEGREVILLE 31540001 TM (H) log 2 rat 0 14 TM (H) ( C) 14 12 Moyenne = 12.9 C 12 Moyenne = 12.3 C 1960 1970 1980 1990 2000 1960 1970 1980 1990 2000 TM (H) ( C) Fig. 2.1.3.4.2.3 Climate series segmentation [5] 1 Unaltered segment 2 Deleted segment 3 1.57 1.58 1.59 1.6 1.61 1.62 1.63 1.64 1.65 1.66 1.67 genomic position x 10 6 Array CGH segmentation [8] Tableau 2.1.3.4.2.4 F. Picard (CNRS-LBBE) Segmentation INED-28/04/11 3 / 19

Segmentation to detect copy number variations Comparative Genomic Hybridization is used to measure gene copy number variations between genomes. The number of genes is measured by fluorescence at given positions The logratio of signals shows jumps and segments Detect segments that correspond to regions that share the same copy number on average log 2 rat 3 Amplified segments 2 1 0 1 Unaltered segment 2 Deleted segment 3 1.57 1.58 1.59 1.6 1.61 1.62 1.63 1.64 1.65 1.66 1.67 genomic position x 10 6 Baseline at 0 for no difference F. Picard (CNRS-LBBE) Segmentation INED-28/04/11 4 / 19

Outline of the presentation Explain the statistical developments associated with segmentation models Give an overview of the subject (with bibliography) Provide an R package dedicated to the analysis of CGH data by segmentation models Explain the choices relative to the construction of the package Introduce the generalization to multiple series segmentation F. Picard (CNRS-LBBE) Segmentation INED-28/04/11 5 / 19

The cghseg package Idea: develop a package for segmentation in the context of CGH data analysis The community of Bioinformaticians uses R extensively The size of the data can be a problem (discussion) Use S4 classes with 3 main classes: CGHData (CGHd), CGHOptions (CGHo), CGHResults (CGHr). F. Picard (CNRS-LBBE) Segmentation INED-28/04/11 6 / 19

The CGHData class Raw data are in the data.frame() format They are stored in a list() format in a CGHd object > Y[1:5,1:5] Ind1 Ind2 Ind3 Ind4 Ind5 1 0.15218335 0.1741900 0.03386524 0.14293254 0.2639392 2 0.46361794-0.6023429-0.12644954-0.27317036-0.3458813 3 0.07078370 0.3880629 0.83691230 0.19800776 0.8538934 4 0.21176834 0.1623984 0.12279919-0.39214814 0.1802575 5 0.35821410-0.1347911-0.11833753-0.00863382-0.4885733 > > CGHd = new("cghdata",y) > CGHd ****** CGHdata show ****** [CGHd show] Data are in the list format [[patient]] [CGHd show] Data sample: Y[[Ind1]] [1] 0.1521834 0.4636179 0.0707837 0.2117683 0.3582141 Y[[Ind2]] [1] 0.1741900-0.6023429 0.3880629 0.1623984-0.1347911 [CGHd show] probeid sample: NULL [CGHd show] genomic positions sample: NULL [CGHd show] GC content sample: NULL F. Picard (CNRS-LBBE) Segmentation INED-28/04/11 7 / 19

A piece-wise constant regression We observe a Gaussian process (iid) Y = {Y 1,..., Y n } with Y t N (µ t, σ 2 ). We suppose that there exists K + 1 change-points t 0 <... < t K such that the mean of the signal is constant between two changes and different from a change to another. I k =]t k 1, t k ]: interval of stationarity, µ k the mean of the signal between two changes: t I k, Y t = µ k + E t, E t N (0, σ 2 ). In its generalization, the parameter subject to changes could be the variance, the spectrum... F. Picard (CNRS-LBBE) Segmentation INED-28/04/11 8 / 19

Parameters and estimation strategy The parameters: t = {t 0,..., t K }, µ = {µ 1,..., µ K } and σ 2. The estimation is done for a given K which is estimated afterwards. The log-likelihood of the model is: log L K (Y; t, µ, σ 2 ) = K t k k=1 t=t k 1 +1 f (y t ; µ k, σ 2 ). When K and t are known, how to estimate µ? When K is known, how to estimate t? How to choose K? F. Picard (CNRS-LBBE) Segmentation INED-28/04/11 9 / 19

Parameter estimation When K and t are known the estimation of µ is straightforward: µ k = σ 2 = 1 n 1 t k t k 1 K t k t= t k 1 +1 t k k=1 t= t k 1 +1 y t, (y t µ k ) 2. Find t such that: { t = arg max log LK (Y; t, µ, σ 2 ) }. t F. Picard (CNRS-LBBE) Segmentation INED-28/04/11 10 / 19

Dynamic Programming to optimize the log-likelihood Partition n data points into K segments: complexity O(n K ). DP reduces the complexity to O(n 2 ) when K is fixed. Analogy with the shortest path problem (Bellman principle) RSS k (i, j) cost of the path connecting i to j in k segments: 0 i < j n, RSS 1 (i, j) = j (y t ȳ ij ) 2, t=i+1 1 k K 1, RSS k+1 (1, j) = min 1 h j {RSS k(1, h) + RSS 1 (h + 1, j)}. F. Picard (CNRS-LBBE) Segmentation INED-28/04/11 11 / 19

Dynamic Programming on very large signals? Even if DP reduces the computational burden to O(n 2 ) it may be problematic when n 10 6 Constraint the length of segments (lmin, lmax) Find a trick to the trick to decrease the complexity of DP [9] Use C++ to externalize heavy computations F. Picard (CNRS-LBBE) Segmentation INED-28/04/11 12 / 19

Model selection The number of segments K should be estimated: { K = arg max log LK (Y; t, µ, σ 2 ) βpen(k) }. K Main difficulty: breakpoints are discrete parameters the likelihood is not differentiable wrt t C K 1 n 1 possible segmentations for a model with K segments. how to define the dimension of the model? How to define pen(k),β? modified BIC criterion [10], non asymptotic criterion [4], L-curve criterion [2]. F. Picard (CNRS-LBBE) Segmentation INED-28/04/11 13 / 19

uniseg() and the CGHResults class From a CGHd object and a CGHo object Use uniseg() such that CGHr = uniseg(cghd,cgho) uniseg() performs automatic model selection > CGHr["loglik"] $Ind1 [1] -85.64-50.72-46.49... $Ind2 [1] -95.43-58.53-56.68... > CGHr["mu"] $Ind1 begin end mean 1 1 77-0.03122034 2 78 100-0.99103873 > CGHr["mu"] $Ind2 begin end mean 1 1 43 0.02545556 2 44 56-0.92030745 3 57 100-0.17527263... > CGHr["from"] [1] "uniseg" F. Picard (CNRS-LBBE) Segmentation INED-28/04/11 14 / 19

Different functions to get many informations on the model Given the size of the data CGHr stores results in a sparse format Small functions are implemented to retrieve the desired information bp = getbp(cghr) to retrieve breakpoints in a 0/1 format seg = getsegprofiles(cghr) to retrieve predictions of the model F. Picard (CNRS-LBBE) Segmentation INED-28/04/11 15 / 19

Joint segmentation of multiple profiles When analyzing multiple profiles (or series), one may want to perform a joint analysis [7, 1] Y i (t): the signal for individual i = 1,...I with segments {I i k } t I i k, Y i(t) = µ ik + ε i (t), ε i (t) N (0, σ 2 ). µ i specific levels of segments T i specific incidence matrix of the breaks Y i = T i µ i + E i F. Picard (CNRS-LBBE) Segmentation INED-28/04/11 16 / 19

Power of the S4 classes We can still use the CGHd class for the data Use a new function adapted to the multi-series setting: CGHr = multiseg(cghd,cgho) The format of the output is the same but the computational procedure is different multiseg() also uses C++ code to compute the breakpoint positions and the number of segments per series F. Picard (CNRS-LBBE) Segmentation INED-28/04/11 17 / 19

Conclusions Segmentation models are used in many application fields Other packages exist like CBS [6] for sequential analysis Algorithmic considerations are central when using such models Developing a R package dedicated to segmentation requires the use of a more efficient language (like C++) The use of such strategy becomes a standard in computational biology (ultra-high dimensional) The submission to the CRAN is made more difficult by the different languages Check on http://pbil.univ-lyon1.fr/members/fpicard/ for more detailed presentations on the subject F. Picard (CNRS-LBBE) Segmentation INED-28/04/11 18 / 19

F. Picard an E. Lebarbier, E. Budinska, and S. Robin. Joint segmentation of multivariate gaussian processes using mixed linear models. CSDA, 55(2), 2011. M. Lavielle. Using penalized contrasts for the change-point problem. Signal Processing, 85(8):1501 1510, 2005. M. Lavielle and Teyssière G. Detection of multiple change-points in multiple time-series. Lithuanian Mathematical Journal, 46(4):351 376, 2006. E. Lebarbier. Detecting multiple change-points in the mean of Gaussian process by model selection. Signal Processing, 85:717 736, 2005. O. Mestre. Methodes statistiques pour l homogeneisation de longues series climatiques. PhD thesis, Université Paul Sabatier, Toulouse, 2000. AB. Olshen, ES. Venkatraman, R. Lucito, and M. Wigler. Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics, 5(4):557 572, 2004. F. Picard, E. Lebarbier, M. Hoebeke, B. Thiam, and S. Robin. Joint segmentation, calling, and normalization of multiple CGH profiles. Biostatistics, 2011. F. Picard, S. Robin, M. Lavielle, C. Vaisse, and J-J. Daudin. A statistical approach for CGH microarray data analysis. BMC Bioinformatics, 6:27, 2005. Guillem Rigaill. Pruned dynamic programming for optimal multiple change-point detection. F. Picard (CNRS-LBBE) Segmentation INED-28/04/11 18 / 19

Technical report, arxiv:1004.0887v1, 2010. NR. Zhang and DO. Siegmund. A modified bayes information criterion with applications to the analysis of comparative genomic hybridization data. Biometrics, 63(1):22 32, 2007. F. Picard (CNRS-LBBE) Segmentation INED-28/04/11 18 / 19