n Introduction to the Use of ayesian Network to nalyze Gene Expression Data Cristina Manfredotti Dipartimento di Informatica, Sistemistica e Comunicazione (D.I.S.Co. Università degli Studi Milano-icocca manfredotti@disco.unimib.it
Introduction central goal of molecular biology is to understand the regulation of protein synthesis. DN microarray experiments can measure thousands of gene expression levels simultaneously. n important challenge is to develop methodologies that are both statistically sound and computationally tractable. ayesian network learning.
iological ackground DN DN is a double-stranded molecule Hereditary information is encoded Gene Gene is a segment of DN Contain the information required to make a protein
Motivations Each gene encodes a protein and proteins are the functional units of life Every gene is present in every cell, but only a fraction of the genes are expressed at any time Many diseases result from the interaction between genes Understanding the mechanisms that determine which genes are expressed, and when they are expressed, is the key to the development of new treatments of diseases
ayesian Networks Prior work Clustering of expression data Groups together genes with similar expression pattern Disadvantage: does not reveal structural relations between genes ig challenge Extract meaningful information from the expression data Discover interactions between genes based on the measurements
ayesian Networks ayesian Network (N is a graphical representation of a probability distribution Compact & intuitive representation Useful for describing processes composed of locally interacting components Have a good statistical foundation Efficient model learning algorithm Capture causal relationships Deals with noisy data
Representing Distributions ayesian networks is a representation of a joint probability distribution. ayesian network has two components. G: a directed-acyclic graph structure Θ: a set of parameters for conditional distribution of each variable The joint probability distribution of {X,, X n } is represented by ayesian Network as follows: P( X,..., X = n i = n P( X Pa ( X where Pa G (X i is the set of parents of X i given the graph G, i G i
n Example of a Simple N Gene Gene E Gene Gene D Gene C - Gene and Gene D are independent given Gene. - Gene asserts dependency between Gene and Gene E. - Gene and Gene C are independent given Gene. ( ( (, ( (,,, (,, (, ( ( (,,,, ( E P D P C P E P P D C E P C D P C P P P E D C P = = Gene Gene E Gene Gene D Gene C
Learning ayesian Networks Given a training set D = {x,, x N } of independent instances of X, find a network = <G, Θ> that best matches D. The score function for a network is defined as, S ( G : D = P( G D = where P( D G P( G P( D P ( D G = P( D G, Θ P( Θ G dθ is the marginal likelihood which averages the probability of the data over all possible parameter assignments to G.
Learning ayesian Networks Directed-acyclic graph structure G:
Learning ayesian Networks Directed-acyclic graph structure G:
simple example We want to construct a N of a system composed of 3 genes (, and C that can be ON or OFF Given the training set D Fix a number of iteration M Choose (randomly M structures G J (binarysquared matrix Learn the Conditional Probability Table Choose the graph that has the max score.
simple example C D: D = 3 M = 6
Structures: G G j : C C C C CC G 5
G : P(= = 6/3 P(= = 7/3 C\, C /3 5/3 4/3 3/3 \ 4/3 2/3 7/3
G 5 : \,C /3 3/3 4/3 2/3 3/3 C C\ /3 4/3 P(= = 4/3 3/3 5/3 P(= = 9/3
simple example D: C
G : C P([ ] G P(G = 6/3*4/3*/3*2/6 P([ ] G P(G = 6/3*2/3*5/3*2/6 P([ ] G P(G = 6/3*4/3**2/6 Score = /n P(D i G
G 5 : C P([ ] G 5 P(G 5 = /3*4/3*/3*/6 P([ ] G 5 P(G 5 = 2/3*9/3*5/3*/6 P([ ] G 5 P(G 5 = 3/3*4/3*3/3*/6 Score = /n P(D i G 5
nalyzing Expression Data Practical problem Small data sets variables hundreds of or thousands of genes samples just tens of microarray experiments On the positive side, genetic regulation networks are sparse!!! Characterize and learn features that are common to most of these networks
nalyzing Expression Data: The first feature Markov relations Symmetric relation: Y is in X s Markov blanket iff there is either an edge between them, or both are parents of another variable (Pearl 98. iological interpretation: a Markov relation indicates that the two genes are related in some joint biological interaction or process
nalyzing Expression Data: The second feature order relations Global property: is an ancestor of in all the equivalent ayesian networks learned iological interpretation: an order relation indicates that the transcription of one gene is a direct cause of the transcription of another gene
Estimating Statistical Confidence in Features To what extent does the data support a given feature? effective and relatively simple approach for estimating confidence: bootstrap method. For i =,, m Re-sample with replacement N instances from D. Denote by D i the resulting dataset. pply the learning procedure on D i to induce a network structure G. For each feature f of interest calculate conf ( f = m m i = f ( G i where f(g is if f is a feature in G, and otherwise.
How to collect data: Gene knock down Gene knock out Compound Tessue microarray Time course
Where are we going