Graphical Modeling for Genomic Data

Similar documents
An Introduction to the Use of Bayesian Network to Analyze Gene Expression Data

FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem

Algorithms in Computational Biology (236522) spring 2007 Lecture #1

The Basics of Graphical Models

Data, Measurements, Features

5 Directed acyclic graphs

Extracting correlation structure from large random matrices

Monitoring the Behaviour of Credit Card Holders with Graphical Chain Models

Network Analysis. BCH 5101: Analysis of -Omics Data 1/34

Name Class Date. Figure Which nucleotide in Figure 13 1 indicates the nucleic acid above is RNA? a. uracil c. cytosine b. guanine d.

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague.

Statistical Machine Learning

Protein Protein Interaction Networks

Statistical issues in the analysis of microarray data

Data Mining: Algorithms and Applications Matrix Math Review

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Final Project Report

School of Nursing. Presented by Yvette Conley, PhD

THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE. Alexander Barvinok

BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology

Traffic Driven Analysis of Cellular Data Networks

Special report. Chronic Lymphocytic Leukemia (CLL) Genomic Biology 3020 April 20, 2006

Unsupervised and supervised dimension reduction: Algorithms and connections

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Statistical machine learning, high dimension and big data

Performance Metrics for Graph Mining Tasks

Protein Synthesis How Genes Become Constituent Molecules

Qualitative modeling of biological systems

GenBank, Entrez, & FASTA

Control of Gene Expression

WORKSHOP ON TOPOLOGY AND ABSTRACT ALGEBRA FOR BIOMEDICINE

Exploratory Factor Analysis and Principal Components. Pekka Malo & Anton Frantsev 30E00500 Quantitative Empirical Research Spring 2016

Understanding the dynamics and function of cellular networks

Visualizing Networks: Cytoscape. Prat Thiru

Part 2: Community Detection

Control of Gene Expression

A role of microrna in the regulation of telomerase? Yuan Ming Yeh, Pei Rong Huang, and Tzu Chien V. Wang

DATA ANALYSIS II. Matrix Algorithms

Human Genome Organization: An Update. Genome Organization: An Update

3. The Junction Tree Algorithms

Genetomic Promototypes

Some probability and statistics

Model-based Synthesis. Tony O Hagan

Effective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data

Component Ordering in Independent Component Analysis Based on Data Power

STA 4273H: Statistical Machine Learning

Constrained Least Squares

Comparative genomic hybridization Because arrays are more than just a tool for expression analysis

Data Mining - Evaluation of Classifiers

Cell Phone based Activity Detection using Markov Logic Network

Appendix 2 Molecular Biology Core Curriculum. Websites and Other Resources

Basic Concepts of DNA, Proteins, Genes and Genomes

Course: Model, Learning, and Inference: Lecture 5

Translation Study Guide

The sequence of bases on the mrna is a code that determines the sequence of amino acids in the polypeptide being synthesized:

Molecular Computing Athabasca Hall Sept. 30, 2013

Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model

Micro RNAs: potentielle Biomarker für das. Blutspenderscreening

GENE REGULATION. Teacher Packet

Systematic discovery of regulatory motifs in human promoters and 30 UTRs by comparison of several mammals

Digital Health: Catapulting Personalised Medicine Forward STRATIFIED MEDICINE

Integrating DNA Motif Discovery and Genome-Wide Expression Analysis. Erin M. Conlon

TOWARD BIG DATA ANALYSIS WORKSHOP

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

Supervised Learning (Big Data Analytics)

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Multivariate Normal Distribution

Portfolio Distribution Modelling and Computation. Harry Zheng Department of Mathematics Imperial College

EPIGENETICS DNA and Histone Model

USE OF EIGENVALUES AND EIGENVECTORS TO ANALYZE BIPARTIVITY OF NETWORK GRAPHS

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

Software and Methods for the Analysis of Affymetrix GeneChip Data. Rafael A Irizarry Department of Biostatistics Johns Hopkins University

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

Vector and Matrix Norms

1 Mutation and Genetic Change

Penalized Logistic Regression and Classification of Microarray Data

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Personalized Predictive Medicine and Genomic Clinical Trials

A mixture model for random graphs

Molecular Genetics. RNA, Transcription, & Protein Synthesis

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Expression Quantification (I)

Gene Models & Bed format: What they represent.

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Gene Expression Analysis

Statistics Graduate Courses

In this section, we will consider techniques for solving problems of this type.

Learning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu

Markov random fields and Gibbs measures

Average Redistributional Effects. IFAI/IZA Conference on Labor Market Policy Evaluation

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Transcription:

Graphical Modeling for Genomic Data Carel F.W. Peeters cf.peeters@vumc.nl Joint work with: Wessel N. van Wieringen Mark A. van de Wiel Molecular Biostatistics Unit Dept. of Epidemiology & Biostatistics VU University medical center Amsterdam, the Netherlands Summer School: Big Data in Clinical Medicine Enschede, 03/07/2014 CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 1 / 66

Outline 1 Preliminaries I: Molecular Biology and Genomics Data Some Molecular Biology Omics and Genomic Data Approaches and Desire 2 Preliminaries II: Graphical Modeling Pathways and Graphs Undirected Graphical Modeling Directed Graphical Modeling 3 Undirected Graphical Modeling with the Graphical Ridge Sample Covariance and Precision The Ridge Precision Estimator Illustration 4 Directed Cyclic Mixed Graphs for Genomic Data Integration Model Model as Graphical Object Illustration 5 So What and Further Research CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 2 / 66

Preliminaries I: Molecular Biology and Genomics Data Outline 1 Preliminaries I: Molecular Biology and Genomics Data Some Molecular Biology Omics and Genomic Data Approaches and Desire 2 Preliminaries II: Graphical Modeling Pathways and Graphs Undirected Graphical Modeling Directed Graphical Modeling 3 Undirected Graphical Modeling with the Graphical Ridge Sample Covariance and Precision The Ridge Precision Estimator Illustration 4 Directed Cyclic Mixed Graphs for Genomic Data Integration Model Model as Graphical Object Illustration 5 So What and Further Research CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 3 / 66

Preliminaries I: Molecular Biology and Genomics Data Some Molecular Biology The eukaryotic cell Cell Smallest independent living unit. Contains a complete copy of the genome. Genome Total genetic constitution of an organism: the full (haploid) set of chromosomes with all its genes. CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 4 / 66

Preliminaries I: Molecular Biology and Genomics Data Some Molecular Biology Chromosome Chromosome A structure of coiled DNA. Chromosomal DNA encodes genetic information. CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 5 / 66

Preliminaries I: Molecular Biology and Genomics Data Some Molecular Biology Genes CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 6 / 66

Preliminaries I: Molecular Biology and Genomics Data Some Molecular Biology Central dogma molecular biology Illustration: http://tfscientist.hubpages.com/hub/protein-production-a-step-by-step-illustrated-guide CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 7 / 66

Preliminaries I: Molecular Biology and Genomics Data Some Molecular Biology Complexities DNA copy number (CN) Normal: Each somatic cell contains 2 copies of every chromosome Aberration: Abnormal number of copies of one or more sections of DNA Logic: CN GE ; CN GE CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 8 / 66

Preliminaries I: Molecular Biology and Genomics Data Some Molecular Biology Complexities DNA methylation Refers to the addition of methyl group to CpG site Pre-transcriptional regulator of gene expression Logic: If CpG-site methylated gene off Illustration: http://www.sigmaaldrich.com/technical-documents/articles/biofiles/introduction-to-dna-methylation.html CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 9 / 66

Preliminaries I: Molecular Biology and Genomics Data Some Molecular Biology Complexities Gene Transcription mrna Translation Protein mir micro RNA (mirna) A family of small RNAs, approx. 22 nucleotides in length Bind to sequences of complementarity in target mrna Post-transcriptional regulators of mrna Logic: mirna GE ; mirna GE RNA degradation or limiting of RNA translation Implicated in cancer CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 10 / 66

Preliminaries I: Molecular Biology and Genomics Data Some Molecular Biology Message Message Not enough to look at gene expression alone Integration The functional statistical integration of data from multiple high-throughput omics platforms Why go integrative? Regulatory mechanisms can only be understood at multiple genomic levels Detection of more robust markers (in terms regulatory significance) CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 11 / 66

Preliminaries I: Molecular Biology and Genomics Data Omics and Genomic Data Omics and omics data -ome A totality of some (molecular biological) sort -omics Collective quantification of some pool of molecular molecules Genomics The omics of the genome (of some organism) CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 12 / 66

Preliminaries I: Molecular Biology and Genomics Data Omics and Genomic Data Array data CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 13 / 66

Preliminaries I: Molecular Biology and Genomics Data Omics and Genomic Data Design CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 14 / 66

Preliminaries I: Molecular Biology and Genomics Data Omics and Genomic Data Profiles CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 15 / 66

Preliminaries I: Molecular Biology and Genomics Data Omics and Genomic Data Challenge: Dimensionality genomic data CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 16 / 66

Preliminaries I: Molecular Biology and Genomics Data Omics and Genomic Data Challenge: Dimensionality genomic data CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 17 / 66

Preliminaries I: Molecular Biology and Genomics Data Approaches and Desire Unit of analysis DNA gene DNA region DNA pathway CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 18 / 66

Preliminaries I: Molecular Biology and Genomics Data Approaches and Desire Featurewise and regional analyzes Approach Restrict dimension model Test model across genome Employ familywise error control CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 19 / 66

Preliminaries I: Molecular Biology and Genomics Data Approaches and Desire Our focus: Pathways CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 20 / 66

Preliminaries I: Molecular Biology and Genomics Data Approaches and Desire Motivation Pathways Knowledge incomplete Knowledge biased towards well-known pathways Loosely defined using repositories (e.g., KEGG) CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 21 / 66

Preliminaries I: Molecular Biology and Genomics Data Approaches and Desire Motivation Desire Consider data from multiple genomic platforms Exploratively infer graph (reconstruct topology) Cope with high-dimensional situation Maintain computational friendliness CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 22 / 66

Preliminaries II: Graphical Modeling Outline 1 Preliminaries I: Molecular Biology and Genomics Data Some Molecular Biology Omics and Genomic Data Approaches and Desire 2 Preliminaries II: Graphical Modeling Pathways and Graphs Undirected Graphical Modeling Directed Graphical Modeling 3 Undirected Graphical Modeling with the Graphical Ridge Sample Covariance and Precision The Ridge Precision Estimator Illustration 4 Directed Cyclic Mixed Graphs for Genomic Data Integration Model Model as Graphical Object Illustration 5 So What and Further Research CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 23 / 66

Preliminaries II: Graphical Modeling Pathways and Graphs Graphs Representation Pathways are represented by a graph (or network) Vertices Node or vertex represents molecular feature Edges Edge or arrow represents some functional relation CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 24 / 66

Preliminaries II: Graphical Modeling Pathways and Graphs Correlation networks Example Three variables: Y 1, Y 2, and Y 3 cor(y 1, Y 2) = 0 cor(y 1, Y 3) = 0 cor(y 2, Y 3) 0 Marginal dependence Undirected edge represents marginal dependence CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 25 / 66

Preliminaries II: Graphical Modeling Pathways and Graphs Interpretational danger CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 26 / 66

Preliminaries II: Graphical Modeling Pathways and Graphs Solution: Conditioning CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 27 / 66

Preliminaries II: Graphical Modeling Pathways and Graphs Solution: Conditioning CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 28 / 66

Preliminaries II: Graphical Modeling Pathways and Graphs Solution: Conditioning CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 29 / 66

Preliminaries II: Graphical Modeling Pathways and Graphs Conditional dependence Partial correlation Measures degree of association between two random variables when controlling for third variables Conditioned correlation cor(y 1, Y 2 Y 3) cor(y 1, Y 3 Y 2) cor(y 2, Y 3 Y 1) If, e.g., cor(y 2, Y 3 Y 1) = 0, we say Y 2 and Y 3 are independent given Y 1 cor(y 1, Y 2 Y 3) 0 cor(y 1, Y 3 Y 2) 0 cor(y 2, Y 3 Y 1) = 0 CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 30 / 66

Preliminaries II: Graphical Modeling Undirected Graphical Modeling Gaussian graphical modeling Graphical modeling A class of probabilistic models utilizing graphs to express conditional (in)dependence relations between random variables Gaussian setting Vertices: Correspond to random variables with normal distribution Edges: Correspond to the conditional dependence structure Say y N p(0, Σ), and define Σ 1 Ω. Then, for Y j, Y j vertex set V, j j ω jj ωjj ω j j = 0 ω jj = 0 Y j Y j V \ {Y j, Y j } Y j Y j ω 11 ω 12 ω 13 ω 14 ω 21 ω 22 0 0 ω 31 0 ω 33 ω 34 ω 41 0 ω 43 ω 44 CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 31 / 66

Preliminaries II: Graphical Modeling Undirected Graphical Modeling Gaussian graphical modeling Ω = ω 11 ω 12 ω 13 ω 21 ω 22 0 ω 31 0 ω 33 Σ = Ω 1 = σ 11 σ 12 σ 13 σ 21 σ 22 σ 23 σ 31 σ 32 σ 33 CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 32 / 66

Preliminaries II: Graphical Modeling Directed Graphical Modeling Undirected and directed graphs CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 33 / 66

Preliminaries II: Graphical Modeling Directed Graphical Modeling d-separation CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 34 / 66

Undirected Graphical Modeling with the Graphical Ridge Outline 1 Preliminaries I: Molecular Biology and Genomics Data Some Molecular Biology Omics and Genomic Data Approaches and Desire 2 Preliminaries II: Graphical Modeling Pathways and Graphs Undirected Graphical Modeling Directed Graphical Modeling 3 Undirected Graphical Modeling with the Graphical Ridge Sample Covariance and Precision The Ridge Precision Estimator Illustration 4 Directed Cyclic Mixed Graphs for Genomic Data Integration Model Model as Graphical Object Illustration 5 So What and Further Research CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 35 / 66

Undirected Graphical Modeling with the Graphical Ridge To start: Easy Code > CVres <- optpenalty.aloocv(y, 0.00001, 0.01, step=100) > rprec <- ridges(cov(y), CVres$optLambda) > P0 <- sparsify(symm(rprec), type="localfdr", FDRcut=0.95) > Ugraph(P0, type="fancy", prune=true) CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 36 / 66

Undirected Graphical Modeling with the Graphical Ridge Sample Covariance and Precision Setting Consider p denotes the number of variables n denotes the number of observations The Sample Covariance matrix Let S denote the sample covariance matrix Inverse S 1 is proportional to the partial correlation matrix Usage Many statistical models directly dependent on S and its inverse S 1 : Multivariate regression Factor analysis Structural equation models Graphical models... CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 37 / 66

Undirected Graphical Modeling with the Graphical Ridge Sample Covariance and Precision Problem However When n close to p: S is ill-behaved When p > n: S is singular and its inverse S 1 is undefined Desired Provision allowing graphical modeling when p > n CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 38 / 66

Undirected Graphical Modeling with the Graphical Ridge Sample Covariance and Precision Explaining the inverse The scalar inverse Let a denote a number (excluding 0) The inverse is then the number b such that a b = 1 Clearly, b = 1 a Matrix A matrix is a generalization of a number, an array of numbers a 11 a 12 a 1p a 21 a 22 a 2p A =........ a p1 a p2 a pp CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 39 / 66

Undirected Graphical Modeling with the Graphical Ridge Sample Covariance and Precision Explaining the inverse The Matrix Inverse Consider the matrix A. Its inverse B = A 1 is defined such that AB = I, where 1 0 0 0 1 0 I =........ 0 0 1 Solution A 1 = [ A 1 11 + A 1 11 A12Q 1 A 21A 1 11 A 1 11 A12Q 1 Q 1 A 21A 1 11 Q 1 with Q 1 denoting the Schur complement and Q = A 22 A 21A 1 11 A12. ], CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 40 / 66

Undirected Graphical Modeling with the Graphical Ridge Sample Covariance and Precision Singularity CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 41 / 66

Undirected Graphical Modeling with the Graphical Ridge The Ridge Precision Estimator Ridge estimator of the precision matrix Ridge regularization Analytic penalized ML estimator: where { [ ˆΩ(λ) = λi p + 1 ] } 1/2 (S λt)2 + 1 1 (S λt), 4 2 T denotes a p.d. symmetric target matrix λ (0, ) denotes a penalty parameter To do Choose value penalty parameter Determine support CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 42 / 66

Undirected Graphical Modeling with the Graphical Ridge The Ridge Precision Estimator Visual explanation CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 43 / 66

Undirected Graphical Modeling with the Graphical Ridge The Ridge Precision Estimator Choosing the penalty value K-fold cross-validation (CV) Single iteration of K-fold CV CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 44 / 66

Undirected Graphical Modeling with the Graphical Ridge The Ridge Precision Estimator Choosing the penalty value K-fold CV score ϕ K (λ) = K k=1 } n k { ln ˆΩ(λ) k + tr[ ˆΩ(λ) k S k ], n k is the size of subset k, for k = 1,..., K disjoint subsets; S k denotes the sample covariance matrix on kth test set; ˆΩ(λ) k denotes the estimated regularized precision matrix on kth training set Highest predictive accuracy Choose n k = 1, such that K = n (known as leave-one-out CV - LOOCV) Problem K-fold CV is computationally demanding for large p and/or large K Solution Computationally efficient approximate LOOCV score CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 45 / 66

Undirected Graphical Modeling with the Graphical Ridge The Ridge Precision Estimator Edge selection Mixture distribution Partial correlation distribution modeled by mixture distribution: η 0 [0, 1] is the mixture weight; f 0 is the distribution of a null-edge; f ε is the distribution of a present edge η 0f 0 + (1 η 0)f ε Posterior probability edge presence Allows to determine empirical posterior probability that edge is present given the value of the estimated partial correlation CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 46 / 66

Undirected Graphical Modeling with the Graphical Ridge Illustration Example Data TCGA breast cancer data (http://cancergenome.nih.gov/) MAPK pathway genes (as defined by KEGG) p = 262, n = 496 CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 47 / 66

Undirected Graphical Modeling with the Graphical Ridge Illustration Comparison Data UPP ER+ breast cancer data (http://www.bioconductor.org/) Apoptosis pathway genes (as defined by KEGG) p = 83 CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 48 / 66

Undirected Graphical Modeling with the Graphical Ridge Illustration Comparison CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 49 / 66

Undirected Graphical Modeling with the Graphical Ridge Illustration Software rags2ridges R package that implements The ridge estimator Supporting functionalities for graphical modeling Availability Available for free from the Comprehensive R Archive Network: http://cran.r-project.org/web/packages/rags2ridges/index.html R R is a free software programming language and software environment for statistical computing CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 50 / 66

Directed Cyclic Mixed Graphs for Genomic Data Integration Outline 1 Preliminaries I: Molecular Biology and Genomics Data Some Molecular Biology Omics and Genomic Data Approaches and Desire 2 Preliminaries II: Graphical Modeling Pathways and Graphs Undirected Graphical Modeling Directed Graphical Modeling 3 Undirected Graphical Modeling with the Graphical Ridge Sample Covariance and Precision The Ridge Precision Estimator Illustration 4 Directed Cyclic Mixed Graphs for Genomic Data Integration Model Model as Graphical Object Illustration 5 So What and Further Research CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 51 / 66

Directed Cyclic Mixed Graphs for Genomic Data Integration CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 52 / 66

Directed Cyclic Mixed Graphs for Genomic Data Integration Model Model and assumptions Model The SEM model we consider can be expressed as: y i := By i + Γx i + ɛ i, i = 1,..., n. Assumptions 1 Properly preprocessed data 2 y i y i, i i 3 ɛ i N p(0, Ψ), with Ψ diag[ψ 11,..., ψ pp], and ψ jj > 0, j 4 x i N q(0, Φ), with Φ 0 5 x i ɛ i, i, i 6 (I p B) is nonsingular and β jj = 0, j CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 53 / 66

Directed Cyclic Mixed Graphs for Genomic Data Integration Model Graphical representation Question Can we read off conditional independencies? CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 54 / 66

Directed Cyclic Mixed Graphs for Genomic Data Integration Model as Graphical Object m-separation Stretching idea of the collider CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 55 / 66

Directed Cyclic Mixed Graphs for Genomic Data Integration Model as Graphical Object Directed cyclic mixed graph CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 56 / 66

Directed Cyclic Mixed Graphs for Genomic Data Integration Model as Graphical Object Approach Steps 1 Regularize the joint sample covariance matrix on y i and x i 2 Test for vanishing partial correlations to obtain sparse representation 3 Solve for parameters with simple iterative algorithm ω yy 11 ω yy 12 ω yy 13 ω yy 14 ω yx 11 ω yx 12 0 0 ω yy 21 ω yy 22 ω yy 23 ω yy 24 ω yx 21 0 0 0 ω yy 31 ω yy 32 ω yy 33 0 ω yx 31 0 0 0 ω yy 41 ω yy 42 0 ω yy 44 0 ω yx 42 0 0 ω xy 11 ω xy 12 ω xy 13 0 ω xx 11 0 ω xx 13 ω xx 14 ω xy 21 0 0 ω xy 24 0 ω xx 22 0 0 0 0 0 0 ω xx 31 0 ω xx 33 ω xx 34 0 0 0 0 ω xx 41 0 ω xx 43 ω xx 44 CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 57 / 66

Directed Cyclic Mixed Graphs for Genomic Data Integration Illustration Application: GBM CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 58 / 66

Directed Cyclic Mixed Graphs for Genomic Data Integration Illustration Application: GBM CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 59 / 66

Directed Cyclic Mixed Graphs for Genomic Data Integration Illustration Application: GBM CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 60 / 66

Directed Cyclic Mixed Graphs for Genomic Data Integration Illustration Application: GBM CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 61 / 66

Directed Cyclic Mixed Graphs for Genomic Data Integration Illustration Application: GBM CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 62 / 66

So What and Further Research Outline 1 Preliminaries I: Molecular Biology and Genomics Data Some Molecular Biology Omics and Genomic Data Approaches and Desire 2 Preliminaries II: Graphical Modeling Pathways and Graphs Undirected Graphical Modeling Directed Graphical Modeling 3 Undirected Graphical Modeling with the Graphical Ridge Sample Covariance and Precision The Ridge Precision Estimator Illustration 4 Directed Cyclic Mixed Graphs for Genomic Data Integration Model Model as Graphical Object Illustration 5 So What and Further Research CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 63 / 66

So What and Further Research So what? Why of interest Enables exploration networks in situations unsuitable for standard statistics Can aid in the identification of more robust markers Can point to markers of interest for perturbation experiments Can aid in focussing temporal experiments CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 64 / 66

So What and Further Research Further research Extensions Consider data from more than 2 platforms Modeling differential networks Modeling temporal networks CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 65 / 66

References References Koster, J.T.A. (1996) Markov Properties of Nonrecursive Causal Models. Annals of Statistics, 24:2148 Pearl, J. (2009, 2nd ed.) Causality: Models, reasoning, and inference. Cambridge, UK: Cambridge University Press Peeters, C.F.W., & van Wieringen, W.N. (2014) rags2ridges: Ridge estimation of precision matrices from high-dimensional data. R Package Version 1.2 Peeters, C.F.W., van Wieringen, W.N., & van de Wiel, M.A. (in preparation) Gaussian Directed Cyclic Mixed Graph Modeling for Genomic Data Integration. Richardson, T. (2003) Markov properties for acyclic directed mixed graphs. Scandinavian Journal of Statistics, 30:145 Schäfer, J., & K. Strimmer (2005) A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Statistical Applications in Genetics and Molecular Biology, 4:32 Vujačić, I. and Abbruzzo, A. and Wit, E. C. (2014) A computationally fast alternative to cross-validation in penalized Gaussian graphical models. arxiv:1309.6216v2 [stat.me] van Wieringen, W.N. & Peeters, C.F.W. (under review) Ridge Estimation of Inverse Covariance Matrices from High-Dimensional Data. arxiv:1403.0904 [stat.me] CFWP (VUmc) Graphs for Genomic Data Enschede, 03/07/2014 66 / 66