Integrating DNA Motif Discovery and Genome-Wide Expression Analysis. Erin M. Conlon

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "Integrating DNA Motif Discovery and Genome-Wide Expression Analysis. Erin M. Conlon"

Transcription

1 Integrating DNA Motif Discovery and Genome-Wide Expression Analysis Department of Mathematics and Statistics University of Massachusetts Amherst Statistics in Functional Genomics Workshop Ascona, Switzerland June 30, 2004 test

2 Motif Discovery Identify short patterns in DNA sequence Patterns play role in control of gene expression Finding sites will help: develop disease treatments understand disease susceptibility

3 Motif Discovery Whole genome sequence ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTC TCATCTTCACATCGCATCACCAGTTCAGGATAGACACGG ACGGCCTCGATTGACGGTGGTACAATTTACCGATGGCTG CACTATGCCCTATCGATCGACCTCTCATGCTTCACATCG CATCACCAGTTCAGGATAGACACGGTCACATCGCATCAC Microarray information Regulatory sequence upstream from genes GATGGCTGCACCTCATCGTATGCCCTACGACCTCTCGC CACATCGCATCTCATCGACCAGTTCAGACACGGACGGC GCCTCGCTCATCGGTGGTACAGTTCAAACCTGACTAAA TCTCGTTAGGACCATCTCATCGACCCACATCGAGAGCG CGCTAGCCCTCATCGGATCTTGTTCGAGAATTGCCTAT

4 Gene Expression Control transcription factor gene expression CTCATCG upstream DNA sequence gene

5 Transcription Factor Binding Sites Upstream Sequence Co-expressed Genes GATGGGGGCTCATCGACGTGTATGC...ACGATGTCTC Gene 1 CACACCCCCTCTCATCGCGTCCCTT...CGCCCCCCCG Gene 2 GCCTCCTCATCGGTGGTACTCCAGT...TACATGACTA Gene 3 TCTCATGCTCATCGCATCACGTGTA...GCAATGAGAG CGCCTCATCGTGGATCTTGCGAATT...AGAATGGCCT Gene 100 Transcription Start

6 1) Motif Matrix 2) Sequence Logo 3) Consensus Sequence CTCATCG

7 MDscan Motif Finding Algorithm Uses 100 highest expressed genes, finds 30 candidate motifs for each width [5,15] Confirms motifs using 500 highest expressed genes Repeat for lowest expressed genes

8 Motifs Correlated with Expression Goal: relate global gene expression to motif matrices For each motif: calculate sequence score for each gene. score number of copies of a motif in each gene s upstream sequence regress gene expression to motif scores, determine significant motifs

9 Single Motif Regression Expression Sequence score # motif copies

10 Linear Regression Model For each motif: where Y = α + β S + g m mg e g Y g = log 2 -ratio of expression β S mg e m g = = = regression coefficient sequence score error

11 Over-expression of a Transcription Factor Rox1p is a transcription factor in yeast that binds to the 10-mer: TCTATTGTTT (from SCPD database of transcription factor binding sites)

12 Rox1p Over-expression Yeast expression data for Rox1p over-expression for 5,838 genes 800 basepair upstream sequence for each gene Use genes most repressed to find and refine 330 candidate motifs width [5,15] Regression with global gene expression to calculate p-values and rank motifs

13 Overexpressing a Transcription Factor Known binding site: TCTATTGTTT

14 Comparison to Other Motif-Finding Algorithms Statistically-based algorithms 1) AlignAce (Roth et al. 1998): Gibbs sampling approach 2) MEME (Grundy et al. 1996): expectation maximization (EM) Both use iterative procedures to update random initial probability matrices Drawback may be trapped in local maxima

15 Over-expressing a Transcription Factor Known binding site: TCTATTGTTT

16 Combinatorial Effects of Motifs Identify motifs that work together to control gene expression Method: MDscan generates 660 motifs width [5,15] that both enhance and inhibit expression Remove non-significant motifs Stepwise regression to determine final additive model

17 Multiple Regression Model to Determine Motifs Working Together where S Y β g m mg M e g = = = = = log 2 -ratio of regression coefficient sequence score subset of error M Y = α + β S + g m mg m=1 expression e significant motifs g

18 Yeast Amino Acid Starvation Experiment Expression for 5,970 genes Find motifs both enhancing and inhibiting expression 235 significant motifs Stepwise regression yields 25 final motifs

19 Multiple Motifs Influencing Expression

20 Known Motifs Positive Coefficients: STRE, URS1: respond to stress PHO4, MET4: nutrient scavenging GCN4: amino acid production Negative Coefficients: M3A, M3B, RAP1: slow cell growth

21 Motifs Influencing Expression over Time Yeast cell cycle information (Spellman et al. 1998): 2 cell cycles 18 time points 7-minute intervals Examine expression patterns over time

22 Time Series Expression Use Motif Regressor to find multiple motifs at each time point 273 motifs total Each motif is regressed with the expression at all other 17 time points

23 Motif: ACGCGTCGCG Phase Test M/G1 G1 S G2 M M/G1 G1 S G2 M

24 Motif: GCTCATCGC Phase Test M/G1 G1 S G2 M M/G1 G1 S G2 M

25 Motif Clustering Method: Hierarchically cluster motif patterns Euclidean distance 20 clusters Plot average coefficients for each cluster

26 Cluster 1: Known Motif SCB (6 motifs) Regression Coefficient Test Phase M/G1 G1 S G2 M M/G1 G1 S G2 M

27 Known Cell Cycle Motifs Regression Coefficient Phase Test Cell Cycle Time Points M/G1 G1 S G2 M M/G1 G1 S G2 M

28 Other Cell Cycle Motifs Regression Coefficient Phase Test Cell Cycle Time Points M/G1 G1 S G2 M M/G1 G1 S G2 M

29 Non Cell Cycle Motifs Regression Coefficient Phase Test Cell Cycle Time Points M/G1 G1 S G2 M M/G1 G1 S G2 M

30 Simulation Study Randomly assign yeast cell cycle expression to 5,838 genes Use MDscan to find candidate motifs Use simple linear regression to determine p-values of motifs Repeat 100 times to generate 40,324 motifs

31 Simulation Results Motifs From Real Sequences Motifs From Random Sequences

32 Summary Microarray and sequence information are combined to find transcription factor binding sites Stepwise regression identifies motifs working together to control expression We find known motifs, and new putative motifs in single experiments and time course experiments

33 Acknowledgements X. Shirley Liu Jun Liu Departments of Biostatistics and Statistics, Harvard University Jason Lieb Department of Biology University of North Carolina This work was partially supported by NIH National Library of Medicine grant 1F37LM

34 Reference Conlon, E.M., Liu, X.S., Lieb, J.D., Liu, J.S. (2003) Integrating regulatory motif discovery and genomewide expression analysis. Proc Natl Acad Sci USA 100:

Data Integration. Lectures 16 & 17. ECS289A, WQ03, Filkov

Data Integration. Lectures 16 & 17. ECS289A, WQ03, Filkov Data Integration Lectures 16 & 17 Lectures Outline Goals for Data Integration Homogeneous data integration time series data (Filkov et al. 2002) Heterogeneous data integration microarray + sequence microarray

More information

NOVEL GENOME-SCALE CORRELATION BETWEEN DNA REPLICATION AND RNA TRANSCRIPTION DURING THE CELL CYCLE IN YEAST IS PREDICTED BY DATA-DRIVEN MODELS

NOVEL GENOME-SCALE CORRELATION BETWEEN DNA REPLICATION AND RNA TRANSCRIPTION DURING THE CELL CYCLE IN YEAST IS PREDICTED BY DATA-DRIVEN MODELS NOVEL GENOME-SCALE CORRELATION BETWEEN DNA REPLICATION AND RNA TRANSCRIPTION DURING THE CELL CYCLE IN YEAST IS PREDICTED BY DATA-DRIVEN MODELS Orly Alter (a) *, Gene H. Golub (b), Patrick O. Brown (c)

More information

T cell Epitope Prediction

T cell Epitope Prediction Institute for Immunology and Informatics T cell Epitope Prediction EpiMatrix Eric Gustafson January 6, 2011 Overview Gathering raw data Popular sources Data Management Conservation Analysis Multiple Alignments

More information

10-810 /02-710 Computational Genomics. Clustering expression data

10-810 /02-710 Computational Genomics. Clustering expression data 10-810 /02-710 Computational Genomics Clustering expression data What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally,

More information

Genetomic Promototypes

Genetomic Promototypes Genetomic Promototypes Mirkó Palla and Dana Pe er Department of Mechanical Engineering Clarkson University Potsdam, New York and Department of Genetics Harvard Medical School 77 Avenue Louis Pasteur Boston,

More information

PREDA S4-classes. Francesco Ferrari October 13, 2015

PREDA S4-classes. Francesco Ferrari October 13, 2015 PREDA S4-classes Francesco Ferrari October 13, 2015 Abstract This document provides a description of custom S4 classes used to manage data structures for PREDA: an R package for Position RElated Data Analysis.

More information

Systematic discovery of regulatory motifs in human promoters and 30 UTRs by comparison of several mammals

Systematic discovery of regulatory motifs in human promoters and 30 UTRs by comparison of several mammals Systematic discovery of regulatory motifs in human promoters and 30 UTRs by comparison of several mammals Xiaohui Xie 1, Jun Lu 1, E. J. Kulbokas 1, Todd R. Golub 1, Vamsi Mootha 1, Kerstin Lindblad-Toh

More information

A Brief Introduction to Systems Biology: Gene Regulatory Networks Rajat K. De

A Brief Introduction to Systems Biology: Gene Regulatory Networks Rajat K. De A Brief Introduction to Systems Biology: Gene Regulatory Networks Rajat K. De Machine Intelligence Unit, Indian Statistical Institute 203 B. T. Road Kolkata 700108 email id: rajat@isical.ac.in 1 Informatics

More information

Systems Biology through Data Analysis and Simulation

Systems Biology through Data Analysis and Simulation Biomolecular Networks Initiative Systems Biology through Data Analysis and Simulation William Cannon Computational Biosciences 5/30/03 Cellular Dynamics Microbial Cell Dynamics Data Mining Nitrate NARX

More information

Probabilistic methods for post-genomic data integration

Probabilistic methods for post-genomic data integration Probabilistic methods for post-genomic data integration Dirk Husmeier Biomathematics & Statistics Scotland (BioSS) JMB, The King s Buildings, Edinburgh EH9 3JZ United Kingdom http://wwwbiossacuk/ dirk

More information

Current Motif Discovery Tools and their Limitations

Current Motif Discovery Tools and their Limitations Current Motif Discovery Tools and their Limitations Philipp Bucher SIB / CIG Workshop 3 October 2006 Trendy Concepts and Hypotheses Transcription regulatory elements act in a context-dependent manner.

More information

Supplementary Information

Supplementary Information Supplementary Information S1: Degree Distribution of TFs in the E.coli TRN and CRN based on Operons 1000 TRN Number of TFs 100 10 y = 619.55x -1.4163 R 2 = 0.8346 1 1 10 100 1000 Degree of TFs CRN 100

More information

Software reviews. Expression Pro ler: A suite of web-based tools for the analysis of microarray gene expression data

Software reviews. Expression Pro ler: A suite of web-based tools for the analysis of microarray gene expression data Expression Pro ler: A suite of web-based tools for the analysis of microarray gene expression data DNA microarray analysis 1±3 has become one of the most widely used tools for the analysis of gene expression

More information

Network Analysis. BCH 5101: Analysis of -Omics Data 1/34

Network Analysis. BCH 5101: Analysis of -Omics Data 1/34 Network Analysis BCH 5101: Analysis of -Omics Data 1/34 Network Analysis Graphs as a representation of networks Examples of genome-scale graphs Statistical properties of genome-scale graphs The search

More information

Statistics Graduate Courses

Statistics Graduate Courses Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.

More information

COMPUTATIONAL FRAMEWORKS FOR UNDERSTANDING THE FUNCTION AND EVOLUTION OF DEVELOPMENTAL ENHANCERS IN DROSOPHILA

COMPUTATIONAL FRAMEWORKS FOR UNDERSTANDING THE FUNCTION AND EVOLUTION OF DEVELOPMENTAL ENHANCERS IN DROSOPHILA COMPUTATIONAL FRAMEWORKS FOR UNDERSTANDING THE FUNCTION AND EVOLUTION OF DEVELOPMENTAL ENHANCERS IN DROSOPHILA Saurabh Sinha, Dept of Computer Science, University of Illinois Cis-regulatory modules (enhancers)

More information

TOWARD BIG DATA ANALYSIS WORKSHOP

TOWARD BIG DATA ANALYSIS WORKSHOP TOWARD BIG DATA ANALYSIS WORKSHOP 邁 向 巨 量 資 料 分 析 研 討 會 摘 要 集 2015.06.05-06 巨 量 資 料 之 矩 陣 視 覺 化 陳 君 厚 中 央 研 究 院 統 計 科 學 研 究 所 摘 要 視 覺 化 (Visualization) 與 探 索 式 資 料 分 析 (Exploratory Data Analysis, EDA)

More information

Interactive Exploration of Coherent Patterns in Time-series Gene Expression Data

Interactive Exploration of Coherent Patterns in Time-series Gene Expression Data 8.25.3 Interactive Exploration of Coherent Patterns in Time-series Gene Expression Data Daxin Jiang Jian Pei Aidong Zhang Computer Science and Engineering Microarray Technology http://www.ipam.ucla.edu/programs/fg2/fgt_speed7.ppt

More information

Data Mining Analysis of HIV-1 Protease Crystal Structures

Data Mining Analysis of HIV-1 Protease Crystal Structures Data Mining Analysis of HIV-1 Protease Crystal Structures Gene M. Ko, A. Srinivas Reddy, Sunil Kumar, and Rajni Garg AP0907 09 Data Mining Analysis of HIV-1 Protease Crystal Structures Gene M. Ko 1, A.

More information

FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem

FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem Elsa Bernard Laurent Jacob Julien Mairal Jean-Philippe Vert September 24, 2013 Abstract FlipFlop implements a fast method for de novo transcript

More information

Title: Lending Club Interest Rates are closely linked with FICO scores and Loan Length

Title: Lending Club Interest Rates are closely linked with FICO scores and Loan Length Title: Lending Club Interest Rates are closely linked with FICO scores and Loan Length Introduction: The Lending Club is a unique website that allows people to directly borrow money from other people [1].

More information

Data Clustering. Dec 2nd, 2013 Kyrylo Bessonov

Data Clustering. Dec 2nd, 2013 Kyrylo Bessonov Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main

More information

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics. Course Catalog In order to be assured that all prerequisites are met, students must acquire a permission number from the education coordinator prior to enrolling in any Biostatistics course. Courses are

More information

MIC - Detecting Novel Associations in Large Data Sets. by Nico Güttler, Andreas Ströhlein and Matt Huska

MIC - Detecting Novel Associations in Large Data Sets. by Nico Güttler, Andreas Ströhlein and Matt Huska MIC - Detecting Novel Associations in Large Data Sets by Nico Güttler, Andreas Ströhlein and Matt Huska Outline Motivation Method Results Criticism Conclusions Motivation - Goal Determine important undiscovered

More information

Statistical issues in the analysis of microarray data

Statistical issues in the analysis of microarray data Statistical issues in the analysis of microarray data Daniel Gerhard Institute of Biostatistics Leibniz University of Hannover ESNATS Summerschool, Zermatt D. Gerhard (LUH) Analysis of microarray data

More information

Protein Protein Interaction Networks

Protein Protein Interaction Networks Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics

More information

Chapter 14: Analyzing Relationships Between Variables

Chapter 14: Analyzing Relationships Between Variables Chapter Outlines for: Frey, L., Botan, C., & Kreps, G. (1999). Investigating communication: An introduction to research methods. (2nd ed.) Boston: Allyn & Bacon. Chapter 14: Analyzing Relationships Between

More information

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca Clustering Adrian Groza Department of Computer Science Technical University of Cluj-Napoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 K-means 3 Hierarchical Clustering What is Datamining?

More information

Based Models and Evolutionary Algorithms. A dissertation presented to. the faculty of. In partial fulfillment. of the requirements for the degree

Based Models and Evolutionary Algorithms. A dissertation presented to. the faculty of. In partial fulfillment. of the requirements for the degree Motif Selection: Identification of Gene Regulatory Elements using Sequence Coverage Based Models and Evolutionary Algorithms A dissertation presented to the faculty of the Russ College of Engineering and

More information

Research on Non-linear Relationship among Histone Modifications in Yeast Genome

Research on Non-linear Relationship among Histone Modifications in Yeast Genome International Conference on Materials, Environmental and Biological Engineering (MEBE 2015) Research on Non-linear Relationship among Histone Modifications in Yeast Genome Panfeng Chen a, JiHua Feng b,

More information

Lecture 20: Protein-Protein Interaction

Lecture 20: Protein-Protein Interaction Lecture 20: Protein-Protein Interaction Proteins are responsible for several functions in a cell ranging from a catalyzing reaction to several complex functions. Protein-protein interaction plays an important

More information

Likelihood Approaches for Trial Designs in Early Phase Oncology

Likelihood Approaches for Trial Designs in Early Phase Oncology Likelihood Approaches for Trial Designs in Early Phase Oncology Clinical Trials Elizabeth Garrett-Mayer, PhD Cody Chiuzan, PhD Hollings Cancer Center Department of Public Health Sciences Medical University

More information

LOGISTIC REGRESSION ANALYSIS

LOGISTIC REGRESSION ANALYSIS LOGISTIC REGRESSION ANALYSIS C. Mitchell Dayton Department of Measurement, Statistics & Evaluation Room 1230D Benjamin Building University of Maryland September 1992 1. Introduction and Model Logistic

More information

Gene Expression Analysis

Gene Expression Analysis Gene Expression Analysis Jie Peng Department of Statistics University of California, Davis May 2012 RNA expression technologies High-throughput technologies to measure the expression levels of thousands

More information

D-optimal plans in observational studies

D-optimal plans in observational studies D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational

More information

Exploratory data analysis for microarray data

Exploratory data analysis for microarray data Eploratory data analysis for microarray data Anja von Heydebreck Ma Planck Institute for Molecular Genetics, Dept. Computational Molecular Biology, Berlin, Germany heydebre@molgen.mpg.de Visualization

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

Feed Forward Loops in Biological Systems

Feed Forward Loops in Biological Systems Feed Forward Loops in Biological Systems Dr. M. Vijayalakshmi School of Chemical and Biotechnology SASTRA University Joint Initiative of IITs and IISc Funded by MHRD Page 1 of 7 Table of Contents 1 INTRODUCTION...

More information

Logistic Regression (1/24/13)

Logistic Regression (1/24/13) STA63/CBB540: Statistical methods in computational biology Logistic Regression (/24/3) Lecturer: Barbara Engelhardt Scribe: Dinesh Manandhar Introduction Logistic regression is model for regression used

More information

Activity 7.21 Transcription factors

Activity 7.21 Transcription factors Purpose To consolidate understanding of protein synthesis. To explain the role of transcription factors and hormones in switching genes on and off. Play the transcription initiation complex game Regulation

More information

Tutorial for proteome data analysis using the Perseus software platform

Tutorial for proteome data analysis using the Perseus software platform Tutorial for proteome data analysis using the Perseus software platform Laboratory of Mass Spectrometry, LNBio, CNPEM Tutorial version 1.0, January 2014. Note: This tutorial was written based on the information

More information

Applying data integration into reconstruction of gene networks from micro

Applying data integration into reconstruction of gene networks from micro Applying data integration into reconstruction of gene networks from microarray data PhD Thesis Proposal Dipartimento di Informatica e Scienze dell Informazione Università degli Studi di Genova December

More information

Gene regulation in prokaryotes

Gene regulation in prokaryotes GENE REGULATION 1 GENE REGULATION Gene regulation refers to the ability of cells to control their level of gene expression Structural genes are regulated so proteins are only produced at certain times

More information

TITLE MOTIVATION OBJECTIVES AUDIENCE COURSE INSTRUCTORS. Analysis of regulatory sequences controlling the expression of gene networks

TITLE MOTIVATION OBJECTIVES AUDIENCE COURSE INSTRUCTORS. Analysis of regulatory sequences controlling the expression of gene networks TITLE Analysis of regulatory sequences controlling the expression of gene networks MOTIVATION Functional genomics techniques are defining sets of genes likely to act in concert. From expression profiles,

More information

Developing Risk Adjustment Techniques Using the SAS@ System for Assessing Health Care Quality in the lmsystem@

Developing Risk Adjustment Techniques Using the SAS@ System for Assessing Health Care Quality in the lmsystem@ Developing Risk Adjustment Techniques Using the SAS@ System for Assessing Health Care Quality in the lmsystem@ Yanchun Xu, Andrius Kubilius Joint Commission on Accreditation of Healthcare Organizations,

More information

Univariate Regression

Univariate Regression Univariate Regression Correlation and Regression The regression line summarizes the linear relationship between 2 variables Correlation coefficient, r, measures strength of relationship: the closer r is

More information

The general form of the PROC GLM statement is

The general form of the PROC GLM statement is Linear Regression Analysis using PROC GLM Regression analysis is a statistical method of obtaining an equation that represents a linear relationship between two variables (simple linear regression), or

More information

Comprehensive Examinations for the Program in Bioinformatics and Computational Biology

Comprehensive Examinations for the Program in Bioinformatics and Computational Biology Comprehensive Examinations for the Program in Bioinformatics and Computational Biology The Comprehensive exams will be given once a year. The format will be six exams. Students must show competency on

More information

RAP: Accurate and fast motif finding based on protein binding microarray data

RAP: Accurate and fast motif finding based on protein binding microarray data RAP: Accurate and fast motif finding based on protein binding microarray data Yaron Orenstein 1, Eran Mick 1,2 and Ron Shamir 1 * 1 Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv,

More information

RNA Structure and folding

RNA Structure and folding RNA Structure and folding Overview: The main functional biomolecules in cells are polymers DNA, RNA and proteins For RNA and Proteins, the specific sequence of the polymer dictates its final structure

More information

BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology http://tinyurl.com/bioinf525-w16

BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology http://tinyurl.com/bioinf525-w16 Course Director: Dr. Barry Grant (DCM&B, bjgrant@med.umich.edu) Description: This is a three module course covering (1) Foundations of Bioinformatics, (2) Statistics in Bioinformatics, and (3) Systems

More information

Core Facility Genomics

Core Facility Genomics Core Facility Genomics versatile genome or transcriptome analyses based on quantifiable highthroughput data ascertainment 1 Topics Collaboration with Harald Binder and Clemens Kreutz Project: Microarray

More information

Evidence to Action: Use of Predictive Models for Beach Water Postings

Evidence to Action: Use of Predictive Models for Beach Water Postings Evidence to Action: Use of Predictive Models for Beach Water Postings Canadian Society for Epidemiology and Biostatistics Caitlyn Paget, June 4 th 2015 Goal is to improve program delivery Can we improve

More information

Entropy based Graph Clustering: Application to Biological and Social Networks

Entropy based Graph Clustering: Application to Biological and Social Networks Entropy based Graph Clustering: Application to Biological and Social Networks Edward C Kenley Young-Rae Cho Department of Computer Science Baylor University Complex Systems Definition Dynamically evolving

More information

Stepwise Regression. Chapter 311. Introduction. Variable Selection Procedures. Forward (Step-Up) Selection

Stepwise Regression. Chapter 311. Introduction. Variable Selection Procedures. Forward (Step-Up) Selection Chapter 311 Introduction Often, theory and experience give only general direction as to which of a pool of candidate variables (including transformed variables) should be included in the regression model.

More information

A Primer of Genome Science THIRD

A Primer of Genome Science THIRD A Primer of Genome Science THIRD EDITION GREG GIBSON-SPENCER V. MUSE North Carolina State University Sinauer Associates, Inc. Publishers Sunderland, Massachusetts USA Contents Preface xi 1 Genome Projects:

More information

SUPPLEMENTAL MATERIAL. A widespread role of the motif environment in transcription factor binding across diverse protein families

SUPPLEMENTAL MATERIAL. A widespread role of the motif environment in transcription factor binding across diverse protein families SUPPLEMENTAL MATERIAL A widespread role of the motif environment in transcription factor binding across diverse protein families Iris Dror 1,2, Tamar Golan 3, Carmit Levy 3, Remo Rohs 2,4 and Yael Mandel-Gutfreund

More information

Complexity in life, multicellular organisms and micrornas

Complexity in life, multicellular organisms and micrornas Complexity in life, multicellular organisms and micrornas Ohad Manor Abstract In this work I would like to discuss the question of defining complexity, and to focus specifically on the question of defining

More information

Computer Aided Drug Design (CADD) Arie BS Farmasi UGM

Computer Aided Drug Design (CADD) Arie BS Farmasi UGM Computer Aided Drug Design () Arie BS Farmasi UGM Drug Research in silico in vivo Clinical trials Model organisms Gene knockouts in cerebro Computer Aided Drug Design vhts Protein ligand docking ADMET

More information

What makes cells different from each other? How do cells respond to information from environment?

What makes cells different from each other? How do cells respond to information from environment? What makes cells different from each other? How do cells respond to information from environment? Regulation of: - Transcription - prokaryotes - eukaryotes - mrna splicing - mrna localisation and translation

More information

What is the difference between basal and activated transcription?

What is the difference between basal and activated transcription? What is the difference between basal and activated transcription? Regulation of Transcription I. Basal vs. activated transcription for mrna genes A. General transcription factor (TF) vs. promoterspecific

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

MotifLab: a tools and data integration workbench for motif discovery and regulatory sequence analysis

MotifLab: a tools and data integration workbench for motif discovery and regulatory sequence analysis Klepper and Drabløs BMC Bioinformatics 2013, 14:9 SOFTWARE Open Access MotifLab: a tools and data integration workbench for motif discovery and regulatory sequence analysis Kjetil Klepper * and Finn Drabløs

More information

Lecture/Recitation Topic SMA 5303 L1 Sampling and statistical distributions

Lecture/Recitation Topic SMA 5303 L1 Sampling and statistical distributions SMA 50: Statistical Learning and Data Mining in Bioinformatics (also listed as 5.077: Statistical Learning and Data Mining ()) Spring Term (Feb May 200) Faculty: Professor Roy Welsch Wed 0 Feb 7:00-8:0

More information

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources 1 of 8 11/7/2004 11:00 AM National Center for Biotechnology Information About NCBI NCBI at a Glance A Science Primer Human Genome Resources Model Organisms Guide Outreach and Education Databases and Tools

More information

They can be obtained in HQJHQH format directly from the home page at: http://www.engene.cnb.uam.es/downloads/kobayashi.dat

They can be obtained in HQJHQH format directly from the home page at: http://www.engene.cnb.uam.es/downloads/kobayashi.dat HQJHQH70 *XLGHG7RXU This document contains a Guided Tour through the HQJHQH platform and it was created for training purposes with respect to the system options and analysis possibilities. It is not intended

More information

Heuristics for the Sorting by Length-Weighted Inversions Problem on Signed Permutations

Heuristics for the Sorting by Length-Weighted Inversions Problem on Signed Permutations Heuristics for the Sorting by Length-Weighted Inversions Problem on Signed Permutations AlCoB 2014 First International Conference on Algorithms for Computational Biology Thiago da Silva Arruda Institute

More information

1) The table lists the smoking habits of a group of college students. Answer: 0.218

1) The table lists the smoking habits of a group of college students. Answer: 0.218 FINAL EXAM REVIEW Name ) The table lists the smoking habits of a group of college students. Sex Non-smoker Regular Smoker Heavy Smoker Total Man 5 52 5 92 Woman 8 2 2 220 Total 22 2 If a student is chosen

More information

Graph theoretic approach to analyze amino acid network

Graph theoretic approach to analyze amino acid network Int. J. Adv. Appl. Math. and Mech. 2(3) (2015) 31-37 (ISSN: 2347-2529) Journal homepage: www.ijaamm.com International Journal of Advances in Applied Mathematics and Mechanics Graph theoretic approach to

More information

MOPAC: MOtif Finding by Preprocessing and Agglomerative Clustering from Microarrays

MOPAC: MOtif Finding by Preprocessing and Agglomerative Clustering from Microarrays MOPAC: MOtif Finding by Preprocessing and Agglomerative Clustering from Microarrays R. GANESH 1, DEBORAH A. SIEGELE 2 and THOMAS R. IOERGER 1 Department of Computer Science 1, and Department of Biology

More information

A signature of power law network dynamics

A signature of power law network dynamics Classification: BIOLOGICAL SCIENCES: Computational Biology A signature of power law network dynamics Ashish Bhan* and Animesh Ray* Center for Network Studies Keck Graduate Institute 535 Watson Drive Claremont,

More information

Pairwise Sequence Alignment

Pairwise Sequence Alignment Pairwise Sequence Alignment carolin.kosiol@vetmeduni.ac.at SS 2013 Outline Pairwise sequence alignment global - Needleman Wunsch Gotoh algorithm local - Smith Waterman algorithm BLAST - heuristics What

More information

C: LEVEL 800 {MASTERS OF ECONOMICS( ECONOMETRICS)}

C: LEVEL 800 {MASTERS OF ECONOMICS( ECONOMETRICS)} C: LEVEL 800 {MASTERS OF ECONOMICS( ECONOMETRICS)} 1. EES 800: Econometrics I Simple linear regression and correlation analysis. Specification and estimation of a regression model. Interpretation of regression

More information

Multivariate Analysis of Ecological Data

Multivariate Analysis of Ecological Data Multivariate Analysis of Ecological Data MICHAEL GREENACRE Professor of Statistics at the Pompeu Fabra University in Barcelona, Spain RAUL PRIMICERIO Associate Professor of Ecology, Evolutionary Biology

More information

Contrast Normalization of Oligonucleotide Arrays MAGNUS ÅSTRAND ABSTRACT

Contrast Normalization of Oligonucleotide Arrays MAGNUS ÅSTRAND ABSTRACT JOURNAL OF COMPUTATIONAL BIOLOGY Volume 0, Number, 003 Mary Ann Liebert, Inc. Pp. 95 0 Contrast Normalization of Oligonucleotide Arrays MAGNUS ÅSTRAND ABSTRACT Affymetrix high-density oligonucleotide array

More information

FART Neural Network based Probabilistic Motif Discovery in Unaligned Biological Sequences

FART Neural Network based Probabilistic Motif Discovery in Unaligned Biological Sequences FART Neural Network based Probabilistic Motif Discovery in Unaligned Biological Sequences M. Hemalatha, P. Ranjit Jeba Thangaiah and K. Vivekanandan, Member IEEE Abstract Finding Motif in bio-sequences

More information

Comparing Methods for Identifying Transcription Factor Target Genes

Comparing Methods for Identifying Transcription Factor Target Genes Comparing Methods for Identifying Transcription Factor Target Genes Alena van Bömmel (R 3.3.73) Matthew Huska (R 3.3.18) Max Planck Institute for Molecular Genetics Folie 1 Transcriptional Regulation TF

More information

A Weighted SNP Correlation Network Analysis for the Estimation of Polygenic Risk Scores

A Weighted SNP Correlation Network Analysis for the Estimation of Polygenic Risk Scores A Weighted SNP Correlation Network Analysis for the Estimation of Polygenic Risk Scores Morgan Levine Department of Human Genetics, UCLA PERSONALIZED MEDICINE Genetic association studies were expected

More information

Computational localization of promoters and transcription start sites in mammalian genomes

Computational localization of promoters and transcription start sites in mammalian genomes Computational localization of promoters and transcription start sites in mammalian genomes Thomas Down This dissertation is submitted for the degree of Doctor of Philosophy Wellcome Trust Sanger Institute

More information

Lecture 19: Proteins, Primary Struture

Lecture 19: Proteins, Primary Struture CPS260/BGT204.1 Algorithms in Computational Biology November 04, 2003 Lecture 19: Proteins, Primary Struture Lecturer: Pankaj K. Agarwal Scribe: Qiuhua Liu 19.1 The Building Blocks of Protein [1] Proteins

More information

Chris Slaughter, DrPH. GI Research Conference June 19, 2008

Chris Slaughter, DrPH. GI Research Conference June 19, 2008 Chris Slaughter, DrPH Assistant Professor, Department of Biostatistics Vanderbilt University School of Medicine GI Research Conference June 19, 2008 Outline 1 2 3 Factors that Impact Power 4 5 6 Conclusions

More information

Data Mining: Overview. What is Data Mining?

Data Mining: Overview. What is Data Mining? Data Mining: Overview What is Data Mining? Recently * coined term for confluence of ideas from statistics and computer science (machine learning and database methods) applied to large databases in science,

More information

Building risk prediction models - with a focus on Genome-Wide Association Studies. Charles Kooperberg

Building risk prediction models - with a focus on Genome-Wide Association Studies. Charles Kooperberg Building risk prediction models - with a focus on Genome-Wide Association Studies Risk prediction models Based on data: (D i, X i1,..., X ip ) i = 1,..., n we like to fit a model P(D = 1 X 1,..., X p )

More information

Comparing Functional Data Analysis Approach and Nonparametric Mixed-Effects Modeling Approach for Longitudinal Data Analysis

Comparing Functional Data Analysis Approach and Nonparametric Mixed-Effects Modeling Approach for Longitudinal Data Analysis Comparing Functional Data Analysis Approach and Nonparametric Mixed-Effects Modeling Approach for Longitudinal Data Analysis Hulin Wu, PhD, Professor (with Dr. Shuang Wu) Department of Biostatistics &

More information

1 Solving LPs: The Simplex Algorithm of George Dantzig

1 Solving LPs: The Simplex Algorithm of George Dantzig Solving LPs: The Simplex Algorithm of George Dantzig. Simplex Pivoting: Dictionary Format We illustrate a general solution procedure, called the simplex algorithm, by implementing it on a very simple example.

More information

A Comparison of Consensus Clustering Methods

A Comparison of Consensus Clustering Methods A Comparison of Consensus Clustering Methods Chuck Wessell Carl Meyer NCSU College of Charleston Ranking and Clustering Workshop August 14, 2009 Outline What is consensus clustering? Details on the AML_ALL

More information

MORPHEUS. http://biodev.cea.fr/morpheus/ Prediction of Transcription Factors Binding Sites based on Position Weight Matrix.

MORPHEUS. http://biodev.cea.fr/morpheus/ Prediction of Transcription Factors Binding Sites based on Position Weight Matrix. MORPHEUS http://biodev.cea.fr/morpheus/ Prediction of Transcription Factors Binding Sites based on Position Weight Matrix. Reference: MORPHEUS, a Webtool for Transcripton Factor Binding Analysis Using

More information

Predictive Gene Signature Selection for Adjuvant Chemotherapy in Non-Small Cell Lung Cancer Patients

Predictive Gene Signature Selection for Adjuvant Chemotherapy in Non-Small Cell Lung Cancer Patients Predictive Gene Signature Selection for Adjuvant Chemotherapy in Non-Small Cell Lung Cancer Patients by Li Liu A practicum report submitted to the Department of Public Health Sciences in conformity with

More information

Figure 1. IBM SPSS Statistics Base & Associated Optional Modules

Figure 1. IBM SPSS Statistics Base & Associated Optional Modules IBM SPSS Statistics: A Guide to Functionality IBM SPSS Statistics is a renowned statistical analysis software package that encompasses a broad range of easy-to-use, sophisticated analytical procedures.

More information

Please follow the directions once you locate the Stata software in your computer. Room 114 (Business Lab) has computers with Stata software

Please follow the directions once you locate the Stata software in your computer. Room 114 (Business Lab) has computers with Stata software STATA Tutorial Professor Erdinç Please follow the directions once you locate the Stata software in your computer. Room 114 (Business Lab) has computers with Stata software 1.Wald Test Wald Test is used

More information

Partial Least Squares (PLS) Regression.

Partial Least Squares (PLS) Regression. Partial Least Squares (PLS) Regression. Hervé Abdi 1 The University of Texas at Dallas Introduction Pls regression is a recent technique that generalizes and combines features from principal component

More information

Basics of microarrays. Petter Mostad 2003

Basics of microarrays. Petter Mostad 2003 Basics of microarrays Petter Mostad 2003 Why microarrays? Microarrays work by hybridizing strands of DNA in a sample against complementary DNA in spots on a chip. Expression analysis measure relative amounts

More information

STATISTICA Formula Guide: Logistic Regression. Table of Contents

STATISTICA Formula Guide: Logistic Regression. Table of Contents : Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

More information

Mixture Models. Jia Li. Department of Statistics The Pennsylvania State University. Mixture Models

Mixture Models. Jia Li. Department of Statistics The Pennsylvania State University. Mixture Models Mixture Models Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Clustering by Mixture Models General bacground on clustering Example method: -means Mixture model based

More information

Programme du parcours Clinical Epidemiology 2014-2015. UMR 1. Methods in therapeutic evaluation A Dechartres/A Flahault

Programme du parcours Clinical Epidemiology 2014-2015. UMR 1. Methods in therapeutic evaluation A Dechartres/A Flahault Programme du parcours Clinical Epidemiology 2014-2015 UR 1. ethods in therapeutic evaluation A /A Date cours Horaires 15/10/2014 14-17h General principal of therapeutic evaluation (1) 22/10/2014 14-17h

More information

Clustering & Visualization

Clustering & Visualization Chapter 5 Clustering & Visualization Clustering in high-dimensional databases is an important problem and there are a number of different clustering paradigms which are applicable to high-dimensional data.

More information

Imputing Missing Data for Gene Expression Arrays

Imputing Missing Data for Gene Expression Arrays Imputing Missing Data for Gene Expression Arrays Trevor Hastie, Robert Tibshirani, Gavin Sherlock, Michael Eisen, Patrick Brown, David Botstein September 9, 999 Technical Report, Division of Biostatistics,

More information

Exercise with Gene Ontology - Cytoscape - BiNGO

Exercise with Gene Ontology - Cytoscape - BiNGO Exercise with Gene Ontology - Cytoscape - BiNGO This practical has material extracted from http://www.cbs.dtu.dk/chipcourse/exercises/ex_go/goexercise11.php In this exercise we will analyze microarray

More information

MeDIP-chip service report

MeDIP-chip service report MeDIP-chip service report Wednesday, 20 August, 2008 Sample source: Cells from University of *** Customer: ****** Organization: University of *** Contents of this service report General information and

More information

Interaktionen von Nukleinsäuren und Proteinen

Interaktionen von Nukleinsäuren und Proteinen Sonja Prohaska Computational EvoDevo Universitaet Leipzig June 9, 2015 DNA is never naked in a cell DNA is usually in association with proteins. In all domains of life there are small, basic chromosomal

More information