An Introduction to the Use of Bayesian Network to Analyze Gene Expression Data



Similar documents
Relational Dynamic Bayesian Networks: a report. Cristina Manfredotti

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

Graphical Modeling for Genomic Data

Using Bayesian Networks to Analyze Expression Data ABSTRACT

InSyBio BioNets: Utmost efficiency in gene expression data and biological networks analysis

Graph Mining and Social Network Analysis

Feed Forward Loops in Biological Systems

Big Data Text Mining and Visualization. Anton Heijs

Protein Protein Interaction Networks

CURRICULUM VITAE. Phd in computer science

Data Integration. Lectures 16 & 17. ECS289A, WQ03, Filkov

The Basics of Graphical Models

BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology

Exercise with Gene Ontology - Cytoscape - BiNGO

T cell Epitope Prediction

Search engines: ranking algorithms

Current Motif Discovery Tools and their Limitations

IC05 Introduction on Networks &Visualization Nov

Mathematical Models of Supervised Learning and their Application to Medical Diagnosis

Programming Tools based on Big Data and Conditional Random Fields

Translation Study Guide

RNA & Protein Synthesis

Algorithms in Computational Biology (236522) spring 2007 Lecture #1

SOLiD System accuracy with the Exact Call Chemistry module

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

Supervised Learning (Big Data Analytics)

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

Molecular Computing Athabasca Hall Sept. 30, 2013

Graph theoretic approach to analyze amino acid network

A Toolbox for Bicluster Analysis in R

Regents Biology REGENTS REVIEW: PROTEIN SYNTHESIS

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Sanjeev Kumar. contribute

Basic Analysis of Microarray Data

Data Mining and Machine Learning in Bioinformatics

JustClust User Manual

Exploratory data analysis for microarray data

Guide for Data Visualization and Analysis using ACSN

Using Graph Theory to Analyze Gene Network Coherence

2.500 Threshold e Threshold. Exponential phase. Cycle Number

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

Bayesian Networks. Mausam (Slides by UW-AI faculty)

diagnosis through Random

A Bayesian Network Model for Diagnosis of Liver Disorders Agnieszka Onisko, M.S., 1,2 Marek J. Druzdzel, Ph.D., 1 and Hanna Wasyluk, M.D.,Ph.D.

3. The Junction Tree Algorithms

> Semantic Web Use Cases and Case Studies

Information Management course

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Big Data, Machine Learning, Causal Models

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Learning outcomes. Knowledge and understanding. Competence and skills

SAP HANA Enabling Genome Analysis

Multi-Class and Structured Classification

Techniques and Tools for Rich Internet Applications Testing

A Performance Comparison of Five Algorithms for Graph Isomorphism

Hidden Markov models in gene finding. Bioinformatics research group David R. Cheriton School of Computer Science University of Waterloo

Clustering & Visualization

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

A Review of Data Mining Techniques

Qualitative Simulation and Model Checking in Genetic Regulatory Networks

MIC - Detecting Novel Associations in Large Data Sets. by Nico Güttler, Andreas Ströhlein and Matt Huska

Hidden Markov Models

What s New in Pathway Studio Web 11.1

DETERMINING THE CONDITIONAL PROBABILITIES IN BAYESIAN NETWORKS

Appendix 2 Molecular Biology Core Curriculum. Websites and Other Resources

Morphological analysis on structural MRI for the early diagnosis of neurodegenerative diseases. Marco Aiello On behalf of MAGIC-5 collaboration

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

Real-time PCR: Understanding C t

life science data mining

Visualization Techniques in Data Mining

Learning Instance-Specific Predictive Models

Categorical Data Visualization and Clustering Using Subjective Factors

Course: Model, Learning, and Inference: Lecture 5

Boolean Network Models

How To Find Influence Between Two Concepts In A Network

Understanding the dynamics and function of cellular networks

Using NLP and Ontologies for Notary Document Management Systems

How to Get More Value from Your Survey Data

Name Date Period. 2. When a molecule of double-stranded DNA undergoes replication, it results in

Tutorial 9: SWATH data analysis in Skyline

Integrating DNA Motif Discovery and Genome-Wide Expression Analysis. Erin M. Conlon

Data Mining Practical Machine Learning Tools and Techniques

Molecular Genetics: Challenges for Statistical Practice. J.K. Lindsey

Comparison of K-means and Backpropagation Data Mining Algorithms

The Data Mining Process

Unsupervised learning: Clustering

Big Data Mining Services and Knowledge Discovery Applications on Clouds

FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem

Cluster Analysis for Evaluating Trading Strategies 1

Travis Goodwin & Sanda Harabagiu

Transcription:

n Introduction to the Use of ayesian Network to nalyze Gene Expression Data Cristina Manfredotti Dipartimento di Informatica, Sistemistica e Comunicazione (D.I.S.Co. Università degli Studi Milano-icocca manfredotti@disco.unimib.it

Introduction central goal of molecular biology is to understand the regulation of protein synthesis. DN microarray experiments can measure thousands of gene expression levels simultaneously. n important challenge is to develop methodologies that are both statistically sound and computationally tractable. ayesian network learning.

iological ackground DN DN is a double-stranded molecule Hereditary information is encoded Gene Gene is a segment of DN Contain the information required to make a protein

Motivations Each gene encodes a protein and proteins are the functional units of life Every gene is present in every cell, but only a fraction of the genes are expressed at any time Many diseases result from the interaction between genes Understanding the mechanisms that determine which genes are expressed, and when they are expressed, is the key to the development of new treatments of diseases

ayesian Networks Prior work Clustering of expression data Groups together genes with similar expression pattern Disadvantage: does not reveal structural relations between genes ig challenge Extract meaningful information from the expression data Discover interactions between genes based on the measurements

ayesian Networks ayesian Network (N is a graphical representation of a probability distribution Compact & intuitive representation Useful for describing processes composed of locally interacting components Have a good statistical foundation Efficient model learning algorithm Capture causal relationships Deals with noisy data

Representing Distributions ayesian networks is a representation of a joint probability distribution. ayesian network has two components. G: a directed-acyclic graph structure Θ: a set of parameters for conditional distribution of each variable The joint probability distribution of {X,, X n } is represented by ayesian Network as follows: P( X,..., X = n i = n P( X Pa ( X where Pa G (X i is the set of parents of X i given the graph G, i G i

n Example of a Simple N Gene Gene E Gene Gene D Gene C - Gene and Gene D are independent given Gene. - Gene asserts dependency between Gene and Gene E. - Gene and Gene C are independent given Gene. ( ( (, ( (,,, (,, (, ( ( (,,,, ( E P D P C P E P P D C E P C D P C P P P E D C P = = Gene Gene E Gene Gene D Gene C

Learning ayesian Networks Given a training set D = {x,, x N } of independent instances of X, find a network = <G, Θ> that best matches D. The score function for a network is defined as, S ( G : D = P( G D = where P( D G P( G P( D P ( D G = P( D G, Θ P( Θ G dθ is the marginal likelihood which averages the probability of the data over all possible parameter assignments to G.

Learning ayesian Networks Directed-acyclic graph structure G:

Learning ayesian Networks Directed-acyclic graph structure G:

simple example We want to construct a N of a system composed of 3 genes (, and C that can be ON or OFF Given the training set D Fix a number of iteration M Choose (randomly M structures G J (binarysquared matrix Learn the Conditional Probability Table Choose the graph that has the max score.

simple example C D: D = 3 M = 6

Structures: G G j : C C C C CC G 5

G : P(= = 6/3 P(= = 7/3 C\, C /3 5/3 4/3 3/3 \ 4/3 2/3 7/3

G 5 : \,C /3 3/3 4/3 2/3 3/3 C C\ /3 4/3 P(= = 4/3 3/3 5/3 P(= = 9/3

simple example D: C

G : C P([ ] G P(G = 6/3*4/3*/3*2/6 P([ ] G P(G = 6/3*2/3*5/3*2/6 P([ ] G P(G = 6/3*4/3**2/6 Score = /n P(D i G

G 5 : C P([ ] G 5 P(G 5 = /3*4/3*/3*/6 P([ ] G 5 P(G 5 = 2/3*9/3*5/3*/6 P([ ] G 5 P(G 5 = 3/3*4/3*3/3*/6 Score = /n P(D i G 5

nalyzing Expression Data Practical problem Small data sets variables hundreds of or thousands of genes samples just tens of microarray experiments On the positive side, genetic regulation networks are sparse!!! Characterize and learn features that are common to most of these networks

nalyzing Expression Data: The first feature Markov relations Symmetric relation: Y is in X s Markov blanket iff there is either an edge between them, or both are parents of another variable (Pearl 98. iological interpretation: a Markov relation indicates that the two genes are related in some joint biological interaction or process

nalyzing Expression Data: The second feature order relations Global property: is an ancestor of in all the equivalent ayesian networks learned iological interpretation: an order relation indicates that the transcription of one gene is a direct cause of the transcription of another gene

Estimating Statistical Confidence in Features To what extent does the data support a given feature? effective and relatively simple approach for estimating confidence: bootstrap method. For i =,, m Re-sample with replacement N instances from D. Denote by D i the resulting dataset. pply the learning procedure on D i to induce a network structure G. For each feature f of interest calculate conf ( f = m m i = f ( G i where f(g is if f is a feature in G, and otherwise.

How to collect data: Gene knock down Gene knock out Compound Tessue microarray Time course

Where are we going