Statistical issues in the analysis of microarray data
|
|
- Blaise Lyons
- 8 years ago
- Views:
Transcription
1 Statistical issues in the analysis of microarray data Daniel Gerhard Institute of Biostatistics Leibniz University of Hannover ESNATS Summerschool, Zermatt D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 1 / 30
2 Table of Contents 1 Outline 2 Experimental design 3 Statistical modelling 4 Hypotheses testing 5 Gene set enrichment analysis 6 Classification D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 2 / 30
3 Outline Focus is set on Single channel microarrays One sample per array Gene expressions for thousands of oligonucleotides Identifying genes that are differentially expressed due to a treatment Finding significantly differentially expressed genes with a given error probability (Predicting a treatment level given the gene expression data) D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 3 / 30
4 Controlled experiments Independent replications Multiple sources of variability present: Sample-, array-, environmental variability,... Account for this variability in the experimental design by several replications of arrays, samples, multiple timepoints,... Randomisation Needed to separate treatment effects from other factors, which might influence gene expression D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 4 / 30
5 Experimental design Planning an experiment Multiple arrays per sample? Enables estimating array variability. Large amount of RNA needed. With more complex designs a larger number of arrays, samples is needed Measuring covariates, which are not directly of interest, but might have an influence on gene expression Simple classic design 2 Treatments (Control/Treatment), Multiple arrays/samples per treatments D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 5 / 30
6 Data structure Treatment A Treatment B... Array 1 Array 2 Array 3 Array 4 Array 5 Array 6... Gene 1 y 11 y 12 y 13 y 14 y 15 y Gene 2 y 21 y 22 y 23 y 24 y 25 y Gene 3 y 31 y 32 y 33 y 34 y 35 y D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 6 / 30
7 Data example Generating artificial data 2 treatments (A, B) 20 arrays per treatment 5000 genes per array Normal distributed residuals, array effects within array sd = 1; between array sd = genes show an effect (δ = ±2) 2 x transformation D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 7 / 30
8 Data example Array D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 8 / 30
9 Data example density x D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 9 / 30
10 Normalisation Preliminary data processing Checking for hybridisation errors Variability between arrays might bias the results Only a few genes are expected to show an effect Using all observations or known expressions of reference genes to standardise arrays Trying to shift data into a normal distribution (commonly by log 2 transformation) D. Gerhard (LUH) Analysis of microarray data 23. Sep / 30
11 Example data transformation original transformed density density x log2(x) D. Gerhard (LUH) Analysis of microarray data 23. Sep / 30
12 Median normalisation transformed normalised density density log2(x) log2(x) D. Gerhard (LUH) Analysis of microarray data 23. Sep / 30
13 Estimating treatment effects Statistical models Trying to explain the effects by only a few parameters in a statistical model Estimating parameters e.g. by minimising residuals Due to limited calculation resources, models can be fitted separately for each gene 2 sample design For the simple treatment-control design the difference between arithmetic means & it s standard error for each gene can be estimated. After applying the inverse of the log 2 transformation the fold change (ratio of arithmetic means) is the parameter of interest. D. Gerhard (LUH) Analysis of microarray data 23. Sep / 30
14 Parametric vs. non-parametric methods Parametric methods Assuming normal distribution after log 2 transformation Summarising the data by means and standard errors is adequate under assumptions of a general linear model Nonparametric methods At skewed distributions providing only means & std.err. might be misleading Instead using medians, IQR, range,... Applying rank transformation, resampling methods,... Interpretation of treatment comparisons might be more complicated in models with less assumptions Lack of power at small sample sizes D. Gerhard (LUH) Analysis of microarray data 23. Sep / 30
15 Independent observations? No complete randomisation Observations from non randomised experimental units might be correlated, e.g. Multiple arrays for the same sample Samples of the same individual over time Block structures... Assuming independence of correlated observation may lead to underestimation of variability Introducing multiple error terms in the model Increased complexity of the model, increase in sample size needed D. Gerhard (LUH) Analysis of microarray data 23. Sep / 30
16 Hypotheses Testing Test for a single gene Setting up hypotheses of interest (e.g. H 0 : parameter of interest equals 0) Constructing test statistics for each gene Calculating p-values under assumption of a null distribution for the test statistic Borrowing information from multiple genes At small sample sizes the genewise estimation of std.errors is difficult Adding a fudge factor to the std. err. to minimise the coefficient of variation Borrowing information about variability from all genes by empirical bayes D. Gerhard (LUH) Analysis of microarray data 23. Sep / 30
17 t-test results for the example Distribution of p-values: Frequency p value D. Gerhard (LUH) Analysis of microarray data 23. Sep / 30
18 Error rates As multiple hypotheses are tested, there is a choice of controlling different error rates, and the individual type-i-error might not be adequate # H 0 not rejected # H 0 rejected # true H 0 U V m 0 # false H 0 T S m m 0 known m R R m PCER Per Comparison error rate: E(V )/m FWER Family-wise error rate: P (V > 0) FDR False discovery rate: E (V /R)... D. Gerhard (LUH) Analysis of microarray data 23. Sep / 30
19 FWER controlling procedures Calculating p i adjusted p-values (i = 1,..., m) Bonferroni: p i = min {1, p i /m} (single-step) Holm: p i = min {1, max {p i 1, (m i + 1) p i }} (step-down, for p 1 p i p m ) Utilising a multivariate distribution, resampling methods (single-step)... D. Gerhard (LUH) Analysis of microarray data 23. Sep / 30
20 FWER control using data-driven weights Weighted step-down procedure Weight the m unadjusted p-values pi = p i /w i and order them by p1 p m Reject H i as long pi α P m k=i w k Obtaining weights Choosing weights independently of the significance of the test Gather information about the distribution of hypotheses under the null or in the alternative Examples Weighting by the total variance w i = S i of the entire sample Weighting by nondecreasing monotone functions of the weights w i = f (S i ) Using principle components to define weights D. Gerhard (LUH) Analysis of microarray data 23. Sep / 30
21 FDR controlling procedures Calculating p j adjusted p-values (i = 1,..., m) Benjamini-Hochberg: p j = min j i { m i p i } (step-up, for p 1 p i p m ) Benjamini-Yekutieli: correction under dependence (step-up) Storey: pfdr (estimating m 0 /m)... D. Gerhard (LUH) Analysis of microarray data 23. Sep / 30
22 Comparison of adjustment methods using adj. p-values for the data example Method # H 0 rejected # H 0 falsely rejected unadjusted Bonferroni 91 0 Holm 92 0 S i -weighted 44 0 min-p 82 0 BH BY 98 0 Storey D. Gerhard (LUH) Analysis of microarray data 23. Sep / 30
23 Volcano Plot log 2 fold change log 10 p value unadjusted Bonferroni BH BY min p D. Gerhard (LUH) Analysis of microarray data 23. Sep / 30
24 Gene set enrichment analysis Define multiple sets of genes Test differential expression for these gene sets Small effects of single genes are hard to detect Combination of multiple small effects to get the big picture Reduction of the dimensionality of the multiple testing problem Test effects for whole pathways, functional groups, etc. D. Gerhard (LUH) Analysis of microarray data 23. Sep / 30
25 Assigning genetic features to known classes Classification Reformulating the problem into a setting with p regressors to estimate the class membership probability (control/treatment) for each gene Finding a classification rule by e.g. Logistic regression Discriminant analysis SVM... Validation Fitting the model to training data Validation of the model by test data Crossvalidation to validate the model on training data D. Gerhard (LUH) Analysis of microarray data 23. Sep / 30
26 Problem of high dimensions p >> n Problem Requirement for logistic regression or LDA is that the number of observations is larger than the number of variables Reducing the number of variables by Feature Selection Using Penalized Logistic Regression,... D. Gerhard (LUH) Analysis of microarray data 23. Sep / 30
27 Feature selection Filtering genes Multiple testing approaches can be used as filter Select all variables corresponding to genes with a p-value p p 0 Perform for example logistic regression to model the posterior probability of K classes log Pr (G = k X = x) Pr (G = K X = x) = β k0 + β T k x D. Gerhard (LUH) Analysis of microarray data 23. Sep / 30
28 A second example Generating artificial training data 2 treatments (A, B) 20 arrays per treatment 5000 genes per array Normal distributed residuals, array effects within array sd = 1; between array sd = 0.5 Genes show N(0, 0.25) distributed effects 2 x transformation Generating test data 10 arrays per treatment Same effects as in training data Both datasets are log 2 transformed and median normalized. D. Gerhard (LUH) Analysis of microarray data 23. Sep / 30
29 Feature selection / Classification Choosing only 10 genes with the best t-test results as covariates Performing LDA and logistic regression Validation by the test set LDA: A B A 8 2 B 2 8 logistic regression: A B A 7 2 B 3 8 D. Gerhard (LUH) Analysis of microarray data 23. Sep / 30
30 References Dudoit, S and Van der Laan, M (2008): Multiple Testing Procedures with Application to Genomics. Springer Series in Statistics. Gentleman, R, Carey, VJ, Huber, W, Irizarry, RA, and Dudoit, S (2005): Bioinformatics and Computational Biology Solutions Using R and Bioconductor. Springer for Biology and Health. Hastie, T, Tibshirani, R and Friedman J (2001): The Elements of Statistical Learning. Data Mining, Inference and Prediction. Springer Series in Statistics. Benfamini, Y and Hochberg, Y (1995): Controlling the false discovery rate: a new and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B 57: Benjamini, Y and Yekutieli, D 2001: The control of the false discovery rate in multiple testing under dependency. Annals of Statistics 29: Finos L and Salmaso L (2007): FDR- and FWE-controlling methods using data-driven weights. JSPI. 137: Kropf, S & Läuter, J (2002): Multiple tests for different sets of variables using a data-driven ordering of hypotheses, with an application to gene expression data. Biometrical Journal 44: Saeys Y, Iñaki I, Larrañaga (2007): A review of feature selection techniques in bioinformatics. Bioinformatics. 23: Schwender, H, Ickstadt, K, and Rahnenführer J (2008): Classification with High-Dimensional Genetic Data: Assigning Patients and Genetic Features to Known Classes. Biometrical Journal 50: Storey, JD and Tibshirani, R (2003): Statistical significance for genomewide studies. PNAS. 100: D. Gerhard (LUH) Analysis of microarray data 23. Sep / 30
False Discovery Rates
False Discovery Rates John D. Storey Princeton University, Princeton, USA January 2010 Multiple Hypothesis Testing In hypothesis testing, statistical significance is typically based on calculations involving
More informationGene Expression Analysis
Gene Expression Analysis Jie Peng Department of Statistics University of California, Davis May 2012 RNA expression technologies High-throughput technologies to measure the expression levels of thousands
More informationPackage ERP. December 14, 2015
Type Package Package ERP December 14, 2015 Title Significance Analysis of Event-Related Potentials Data Version 1.1 Date 2015-12-11 Author David Causeur (Agrocampus, Rennes, France) and Ching-Fan Sheu
More informationPackage dunn.test. January 6, 2016
Version 1.3.2 Date 2016-01-06 Package dunn.test January 6, 2016 Title Dunn's Test of Multiple Comparisons Using Rank Sums Author Alexis Dinno Maintainer Alexis Dinno
More informationCancer Biostatistics Workshop Science of Doing Science - Biostatistics
Cancer Biostatistics Workshop Science of Doing Science - Biostatistics Yu Shyr, PhD Jan. 18, 2008 Cancer Biostatistics Center Vanderbilt-Ingram Cancer Center Yu.Shyr@vanderbilt.edu Aims Cancer Biostatistics
More informationGene expression analysis. Ulf Leser and Karin Zimmermann
Gene expression analysis Ulf Leser and Karin Zimmermann Ulf Leser: Bioinformatics, Wintersemester 2010/2011 1 Last lecture What are microarrays? - Biomolecular devices measuring the transcriptome of a
More informationFrom Reads to Differentially Expressed Genes. The statistics of differential gene expression analysis using RNA-seq data
From Reads to Differentially Expressed Genes The statistics of differential gene expression analysis using RNA-seq data experimental design data collection modeling statistical testing biological heterogeneity
More informationExploratory data analysis for microarray data
Eploratory data analysis for microarray data Anja von Heydebreck Ma Planck Institute for Molecular Genetics, Dept. Computational Molecular Biology, Berlin, Germany heydebre@molgen.mpg.de Visualization
More informationMolecular Genetics: Challenges for Statistical Practice. J.K. Lindsey
Molecular Genetics: Challenges for Statistical Practice J.K. Lindsey 1. What is a Microarray? 2. Design Questions 3. Modelling Questions 4. Longitudinal Data 5. Conclusions 1. What is a microarray? A microarray
More informationPenalized Logistic Regression and Classification of Microarray Data
Penalized Logistic Regression and Classification of Microarray Data Milan, May 2003 Anestis Antoniadis Laboratoire IMAG-LMC University Joseph Fourier Grenoble, France Penalized Logistic Regression andclassification
More informationA direct approach to false discovery rates
J. R. Statist. Soc. B (2002) 64, Part 3, pp. 479 498 A direct approach to false discovery rates John D. Storey Stanford University, USA [Received June 2001. Revised December 2001] Summary. Multiple-hypothesis
More informationFalse Discovery Rate Control with Groups
False Discovery Rate Control with Groups James X. Hu, Hongyu Zhao and Harrison H. Zhou Abstract In the context of large-scale multiple hypothesis testing, the hypotheses often possess certain group structures
More informationData Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression
Data Mining and Data Warehousing Henryk Maciejewski Data Mining Predictive modelling: regression Algorithms for Predictive Modelling Contents Regression Classification Auxiliary topics: Estimation of prediction
More informationService courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.
Course Catalog In order to be assured that all prerequisites are met, students must acquire a permission number from the education coordinator prior to enrolling in any Biostatistics course. Courses are
More informationExample: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
More informationAcknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues
Data Mining with Regression Teaching an old dog some new tricks Acknowledgments Colleagues Dean Foster in Statistics Lyle Ungar in Computer Science Bob Stine Department of Statistics The School of the
More informationPredictive Gene Signature Selection for Adjuvant Chemotherapy in Non-Small Cell Lung Cancer Patients
Predictive Gene Signature Selection for Adjuvant Chemotherapy in Non-Small Cell Lung Cancer Patients by Li Liu A practicum report submitted to the Department of Public Health Sciences in conformity with
More informationRegularized Logistic Regression for Mind Reading with Parallel Validation
Regularized Logistic Regression for Mind Reading with Parallel Validation Heikki Huttunen, Jukka-Pekka Kauppi, Jussi Tohka Tampere University of Technology Department of Signal Processing Tampere, Finland
More informationFalse discovery rate and permutation test: An evaluation in ERP data analysis
Research Article Received 7 August 2008, Accepted 8 October 2009 Published online 25 November 2009 in Wiley Interscience (www.interscience.wiley.com) DOI: 10.1002/sim.3784 False discovery rate and permutation
More informationSimple Linear Regression Inference
Simple Linear Regression Inference 1 Inference requirements The Normality assumption of the stochastic term e is needed for inference even if it is not a OLS requirement. Therefore we have: Interpretation
More informationLeast Squares Estimation
Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David
More informationTwo-Way ANOVA tests. I. Definition and Applications...2. II. Two-Way ANOVA prerequisites...2. III. How to use the Two-Way ANOVA tool?...
Two-Way ANOVA tests Contents at a glance I. Definition and Applications...2 II. Two-Way ANOVA prerequisites...2 III. How to use the Two-Way ANOVA tool?...3 A. Parametric test, assume variances equal....4
More informationTutorial for proteome data analysis using the Perseus software platform
Tutorial for proteome data analysis using the Perseus software platform Laboratory of Mass Spectrometry, LNBio, CNPEM Tutorial version 1.0, January 2014. Note: This tutorial was written based on the information
More informationLecture 3: Linear methods for classification
Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,
More informationSoftware and Methods for the Analysis of Affymetrix GeneChip Data. Rafael A Irizarry Department of Biostatistics Johns Hopkins University
Software and Methods for the Analysis of Affymetrix GeneChip Data Rafael A Irizarry Department of Biostatistics Johns Hopkins University Outline Overview Bioconductor Project Examples 1: Gene Annotation
More informationStatistical Analysis. NBAF-B Metabolomics Masterclass. Mark Viant
Statistical Analysis NBAF-B Metabolomics Masterclass Mark Viant 1. Introduction 2. Univariate analysis Overview of lecture 3. Unsupervised multivariate analysis Principal components analysis (PCA) Interpreting
More informationStatistics in Medicine Research Lecture Series CSMC Fall 2014
Catherine Bresee, MS Senior Biostatistician Biostatistics & Bioinformatics Research Institute Statistics in Medicine Research Lecture Series CSMC Fall 2014 Overview Review concept of statistical power
More informationRedwood Building, Room T204, Stanford University School of Medicine, Stanford, CA 94305-5405.
W hittemoretxt050806.tex A Bayesian False Discovery Rate for Multiple Testing Alice S. Whittemore Department of Health Research and Policy Stanford University School of Medicine Correspondence Address:
More informationAdditional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm
Mgt 540 Research Methods Data Analysis 1 Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm http://web.utk.edu/~dap/random/order/start.htm
More informationStrong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach
J. R. Statist. Soc. B (2004) 66, Part 1, pp. 187 205 Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach John D. Storey,
More informationStatistiek II. John Nerbonne. October 1, 2010. Dept of Information Science j.nerbonne@rug.nl
Dept of Information Science j.nerbonne@rug.nl October 1, 2010 Course outline 1 One-way ANOVA. 2 Factorial ANOVA. 3 Repeated measures ANOVA. 4 Correlation and regression. 5 Multiple regression. 6 Logistic
More informationComparing Functional Data Analysis Approach and Nonparametric Mixed-Effects Modeling Approach for Longitudinal Data Analysis
Comparing Functional Data Analysis Approach and Nonparametric Mixed-Effects Modeling Approach for Longitudinal Data Analysis Hulin Wu, PhD, Professor (with Dr. Shuang Wu) Department of Biostatistics &
More informationMaster s Thesis. PERFORMANCE OF BETA-BINOMIAL SGoF MULTITESTING METHOD UNDER DEPENDENCE: A SIMULATION STUDY
Master s Thesis PERFORMANCE OF BETA-BINOMIAL SGoF MULTITESTING METHOD UNDER DEPENDENCE: A SIMULATION STUDY AUTHOR: Irene Castro Conde DIRECTOR: Jacobo de Uña Álvarez Master in Statistical Techniques University
More informationStatistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
More informationDescriptive Statistics
Descriptive Statistics Primer Descriptive statistics Central tendency Variation Relative position Relationships Calculating descriptive statistics Descriptive Statistics Purpose to describe or summarize
More information1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96
1 Final Review 2 Review 2.1 CI 1-propZint Scenario 1 A TV manufacturer claims in its warranty brochure that in the past not more than 10 percent of its TV sets needed any repair during the first two years
More informationStatistics Graduate Courses
Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.
More informationQuality Assessment of Exon and Gene Arrays
Quality Assessment of Exon and Gene Arrays I. Introduction In this white paper we describe some quality assessment procedures that are computed from CEL files from Whole Transcript (WT) based arrays such
More informationPackage empiricalfdr.deseq2
Type Package Package empiricalfdr.deseq2 May 27, 2015 Title Simulation-Based False Discovery Rate in RNA-Seq Version 1.0.3 Date 2015-05-26 Author Mikhail V. Matz Maintainer Mikhail V. Matz
More informationInstitute of Actuaries of India Subject CT3 Probability and Mathematical Statistics
Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics For 2015 Examinations Aim The aim of the Probability and Mathematical Statistics subject is to provide a grounding in
More informationEstimation of σ 2, the variance of ɛ
Estimation of σ 2, the variance of ɛ The variance of the errors σ 2 indicates how much observations deviate from the fitted surface. If σ 2 is small, parameters β 0, β 1,..., β k will be reliably estimated
More informationCOURSE PLAN BDA: Biomedical Data Analysis Master in Bioinformatics for Health Sciences. 2015-2016 Academic Year Qualification.
COURSE PLAN BDA: Biomedical Data Analysis Master in Bioinformatics for Health Sciences 2015-2016 Academic Year Qualification. Master's Degree 1. Description of the subject Subject name: Biomedical Data
More informationJournal of Statistical Software
JSS Journal of Statistical Software September 2014, Volume 59, Issue 13. http://www.jstatsoft.org/ structssi: Simultaneous and Selective Inference for Grouped or Hierarchically Structured Data Kris Sankaran
More informationMultivariate Statistical Inference and Applications
Multivariate Statistical Inference and Applications ALVIN C. RENCHER Department of Statistics Brigham Young University A Wiley-Interscience Publication JOHN WILEY & SONS, INC. New York Chichester Weinheim
More informationBasic Statistics and Data Analysis for Health Researchers from Foreign Countries
Basic Statistics and Data Analysis for Health Researchers from Foreign Countries Volkert Siersma siersma@sund.ku.dk The Research Unit for General Practice in Copenhagen Dias 1 Content Quantifying association
More information5. Linear Regression
5. Linear Regression Outline.................................................................... 2 Simple linear regression 3 Linear model............................................................. 4
More informationAssumptions. Assumptions of linear models. Boxplot. Data exploration. Apply to response variable. Apply to error terms from linear model
Assumptions Assumptions of linear models Apply to response variable within each group if predictor categorical Apply to error terms from linear model check by analysing residuals Normality Homogeneity
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct
More informationIntroduction to data analysis: Supervised analysis
Introduction to data analysis: Supervised analysis Introduction to Microarray Technology course May 2011 Solveig Mjelstad Olafsrud solveig@microarray.no Most slides adapted/borrowed from presentations
More informationStatistical Analysis Strategies for Shotgun Proteomics Data
Statistical Analysis Strategies for Shotgun Proteomics Data Ming Li, Ph.D. Cancer Biostatistics Center Vanderbilt University Medical Center Ayers Institute Biomarker Pipeline normal shotgun proteome analysis
More informationStatistics for BIG data
Statistics for BIG data Statistics for Big Data: Are Statisticians Ready? Dennis Lin Department of Statistics The Pennsylvania State University John Jordan and Dennis K.J. Lin (ICSA-Bulletine 2014) Before
More informationMISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group
MISSING DATA TECHNIQUES WITH SAS IDRE Statistical Consulting Group ROAD MAP FOR TODAY To discuss: 1. Commonly used techniques for handling missing data, focusing on multiple imputation 2. Issues that could
More informationMIC - Detecting Novel Associations in Large Data Sets. by Nico Güttler, Andreas Ströhlein and Matt Huska
MIC - Detecting Novel Associations in Large Data Sets by Nico Güttler, Andreas Ströhlein and Matt Huska Outline Motivation Method Results Criticism Conclusions Motivation - Goal Determine important undiscovered
More informationTime series experiments
Time series experiments Time series experiments Why is this a separate lecture: The price of microarrays are decreasing more time series experiments are coming Often a more complex experimental design
More informationApplications of R Software in Bayesian Data Analysis
Article International Journal of Information Science and System, 2012, 1(1): 7-23 International Journal of Information Science and System Journal homepage: www.modernscientificpress.com/journals/ijinfosci.aspx
More informationAdaptive linear step-up procedures that control the false discovery rate
Biometrika (26), 93, 3, pp. 491 57 26 Biometrika Trust Printed in Great Britain Adaptive linear step-up procedures that control the false discovery rate BY YOAV BENJAMINI Department of Statistics and Operations
More informationThe Statistics Tutor s Quick Guide to
statstutor community project encouraging academics to share statistics support resources All stcp resources are released under a Creative Commons licence The Statistics Tutor s Quick Guide to Stcp-marshallowen-7
More informationHandling missing data in Stata a whirlwind tour
Handling missing data in Stata a whirlwind tour 2012 Italian Stata Users Group Meeting Jonathan Bartlett www.missingdata.org.uk 20th September 2012 1/55 Outline The problem of missing data and a principled
More informationStatistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013
Statistics I for QBIC Text Book: Biostatistics, 10 th edition, by Daniel & Cross Contents and Objectives Chapters 1 7 Revised: August 2013 Chapter 1: Nature of Statistics (sections 1.1-1.6) Objectives
More informationSUMAN DUVVURU STAT 567 PROJECT REPORT
SUMAN DUVVURU STAT 567 PROJECT REPORT SURVIVAL ANALYSIS OF HEROIN ADDICTS Background and introduction: Current illicit drug use among teens is continuing to increase in many countries around the world.
More informationPrinciples of Hypothesis Testing for Public Health
Principles of Hypothesis Testing for Public Health Laura Lee Johnson, Ph.D. Statistician National Center for Complementary and Alternative Medicine johnslau@mail.nih.gov Fall 2011 Answers to Questions
More informationII. DISTRIBUTIONS distribution normal distribution. standard scores
Appendix D Basic Measurement And Statistics The following information was developed by Steven Rothke, PhD, Department of Psychology, Rehabilitation Institute of Chicago (RIC) and expanded by Mary F. Schmidt,
More informationOutline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares
Topic 4 - Analysis of Variance Approach to Regression Outline Partitioning sums of squares Degrees of freedom Expected mean squares General linear test - Fall 2013 R 2 and the coefficient of correlation
More informationUnivariate Regression
Univariate Regression Correlation and Regression The regression line summarizes the linear relationship between 2 variables Correlation coefficient, r, measures strength of relationship: the closer r is
More informationMEU. INSTITUTE OF HEALTH SCIENCES COURSE SYLLABUS. Biostatistics
MEU. INSTITUTE OF HEALTH SCIENCES COURSE SYLLABUS title- course code: Program name: Contingency Tables and Log Linear Models Level Biostatistics Hours/week Ther. Recite. Lab. Others Total Master of Sci.
More informationIntegrating DNA Motif Discovery and Genome-Wide Expression Analysis. Erin M. Conlon
Integrating DNA Motif Discovery and Genome-Wide Expression Analysis Department of Mathematics and Statistics University of Massachusetts Amherst Statistics in Functional Genomics Workshop Ascona, Switzerland
More informationAnalysis of Data. Organizing Data Files in SPSS. Descriptive Statistics
Analysis of Data Claudia J. Stanny PSY 67 Research Design Organizing Data Files in SPSS All data for one subject entered on the same line Identification data Between-subjects manipulations: variable to
More informationIntroduction to General and Generalized Linear Models
Introduction to General and Generalized Linear Models General Linear Models - part I Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby
More informationApplying Statistics Recommended by Regulatory Documents
Applying Statistics Recommended by Regulatory Documents Steven Walfish President, Statistical Outsourcing Services steven@statisticaloutsourcingservices.com 301-325 325-31293129 About the Speaker Mr. Steven
More informationExamples. David Ruppert. April 25, 2009. Cornell University. Statistics for Financial Engineering: Some R. Examples. David Ruppert.
Cornell University April 25, 2009 Outline 1 2 3 4 A little about myself BA and MA in mathematics PhD in statistics in 1977 taught in the statistics department at North Carolina for 10 years have been in
More informationControlling the number of false discoveries: application to high-dimensional genomic data
Journal of Statistical Planning and Inference 124 (2004) 379 398 www.elsevier.com/locate/jspi Controlling the number of false discoveries: application to high-dimensional genomic data Edward L. Korn a;,
More informationInterpretation of Somers D under four simple models
Interpretation of Somers D under four simple models Roger B. Newson 03 September, 04 Introduction Somers D is an ordinal measure of association introduced by Somers (96)[9]. It can be defined in terms
More informationFinding statistical patterns in Big Data
Finding statistical patterns in Big Data Patrick Rubin-Delanchy University of Bristol & Heilbronn Institute for Mathematical Research IAS Research Workshop: Data science for the real world (workshop 1)
More informationMA2823: Foundations of Machine Learning
MA2823: Foundations of Machine Learning École Centrale Paris Fall 2015 Chloé-Agathe Azencot Centre for Computational Biology, Mines ParisTech chloe agathe.azencott@mines paristech.fr TAs: Jiaqian Yu jiaqian.yu@centralesupelec.fr
More informationMachine Learning Methods for Demand Estimation
Machine Learning Methods for Demand Estimation By Patrick Bajari, Denis Nekipelov, Stephen P. Ryan, and Miaoyu Yang Over the past decade, there has been a high level of interest in modeling consumer behavior
More informationStatistical Models in R
Statistical Models in R Some Examples Steven Buechler Department of Mathematics 276B Hurley Hall; 1-6233 Fall, 2007 Outline Statistical Models Structure of models in R Model Assessment (Part IA) Anova
More informationBasic Statistical and Modeling Procedures Using SAS
Basic Statistical and Modeling Procedures Using SAS One-Sample Tests The statistical procedures illustrated in this handout use two datasets. The first, Pulse, has information collected in a classroom
More informationBuilding risk prediction models - with a focus on Genome-Wide Association Studies. Charles Kooperberg
Building risk prediction models - with a focus on Genome-Wide Association Studies Risk prediction models Based on data: (D i, X i1,..., X ip ) i = 1,..., n we like to fit a model P(D = 1 X 1,..., X p )
More informationPackage HHG. July 14, 2015
Type Package Package HHG July 14, 2015 Title Heller-Heller-Gorfine Tests of Independence and Equality of Distributions Version 1.5.1 Date 2015-07-13 Author Barak Brill & Shachar Kaufman, based in part
More informationPrinciples of Data Mining by Hand&Mannila&Smyth
Principles of Data Mining by Hand&Mannila&Smyth Slides for Textbook Ari Visa,, Institute of Signal Processing Tampere University of Technology October 4, 2010 Data Mining: Concepts and Techniques 1 Differences
More informationTOWARD BIG DATA ANALYSIS WORKSHOP
TOWARD BIG DATA ANALYSIS WORKSHOP 邁 向 巨 量 資 料 分 析 研 討 會 摘 要 集 2015.06.05-06 巨 量 資 料 之 矩 陣 視 覺 化 陳 君 厚 中 央 研 究 院 統 計 科 學 研 究 所 摘 要 視 覺 化 (Visualization) 與 探 索 式 資 料 分 析 (Exploratory Data Analysis, EDA)
More informationPart 2: Analysis of Relationship Between Two Variables
Part 2: Analysis of Relationship Between Two Variables Linear Regression Linear correlation Significance Tests Multiple regression Linear Regression Y = a X + b Dependent Variable Independent Variable
More informationSoftware Tools for Scientific Data Analysis and Visualization: And Bioconductor. Stowers Science Club
Software Tools for Scientific Data Analysis and Visualization: And Bioconductor Stowers Science Club Earl F. Glynn Scientific Programmer Bioinformatics 20 May 2005 1 Topics What is R? What is Bioconductor?
More informationResearch Methods & Experimental Design
Research Methods & Experimental Design 16.422 Human Supervisory Control April 2004 Research Methods Qualitative vs. quantitative Understanding the relationship between objectives (research question) and
More informationStatistical inference and data mining: false discoveries control
Statistical inference and data mining: false discoveries control Stéphane Lallich 1 and Olivier Teytaud 2 and Elie Prudhomme 1 1 Université Lyon 2, Equipe de Recherche en Ingénierie des Connaissances 5
More informationStudy Guide for the Final Exam
Study Guide for the Final Exam When studying, remember that the computational portion of the exam will only involve new material (covered after the second midterm), that material from Exam 1 will make
More informationMinería de Datos ANALISIS DE UN SET DE DATOS.! Visualization Techniques! Combined Graph! Charts and Pies! Search for specific functions
Minería de Datos ANALISIS DE UN SET DE DATOS! Visualization Techniques! Combined Graph! Charts and Pies! Search for specific functions Data Mining on the DAG ü When working with large datasets, annotation
More informationGraphical Modeling for Genomic Data
Graphical Modeling for Genomic Data Carel F.W. Peeters cf.peeters@vumc.nl Joint work with: Wessel N. van Wieringen Mark A. van de Wiel Molecular Biostatistics Unit Dept. of Epidemiology & Biostatistics
More informationQUANTITATIVE METHODS BIOLOGY FINAL HONOUR SCHOOL NON-PARAMETRIC TESTS
QUANTITATIVE METHODS BIOLOGY FINAL HONOUR SCHOOL NON-PARAMETRIC TESTS This booklet contains lecture notes for the nonparametric work in the QM course. This booklet may be online at http://users.ox.ac.uk/~grafen/qmnotes/index.html.
More informationErik Parner 14 September 2016. Basic Biostatistics - Day 2-21 September, 2016 1
PhD course in Basic Biostatistics Day Erik Parner, Department of Biostatistics, Aarhus University Log-transformation of continuous data Exercise.+.4+Standard- (Triglyceride) Logarithms and exponentials
More informationA Comparison of Normalization Methods for High Density Oligonucleotide Array Data Based on Variance and Bias
A Comparison of Normalization Methods for High Density Oligonucleotide Array Data Based on Variance and Bias B. M. Bolstad, R. A. Irizarry 2, M. Astrand 3 and T. P. Speed 4, 5 Group in Biostatistics, University
More informationVertical data integration for melanoma prognosis. Australia 3 Melanoma Institute Australia, NSW 2060 Australia. kaushala@maths.usyd.edu.au.
Vertical integration for melanoma prognosis Kaushala Jayawardana 1,4, Samuel Müller 1, Sarah-Jane Schramm 2,3, Graham J. Mann 2,3 and Jean Yang 1 1 School of Mathematics and Statistics, University of Sydney,
More informationSTATISTICAL METHODS FOR DATA MINING
Chapter 1 STATISTICAL METHODS FOR DATA MINING Yoav Benjamini Department of Statistics, School of Mathematical Sciences, Sackler Faculty for Exact Sciences Tel Aviv University ybenja@post.tau.ac.il Moshe
More informationNon-Inferiority Tests for Two Means using Differences
Chapter 450 on-inferiority Tests for Two Means using Differences Introduction This procedure computes power and sample size for non-inferiority tests in two-sample designs in which the outcome is a continuous
More informationQuantitative Methods for Finance
Quantitative Methods for Finance Module 1: The Time Value of Money 1 Learning how to interpret interest rates as required rates of return, discount rates, or opportunity costs. 2 Learning how to explain
More informationMULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS
MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS MSR = Mean Regression Sum of Squares MSE = Mean Squared Error RSS = Regression Sum of Squares SSE = Sum of Squared Errors/Residuals α = Level of Significance
More informationAnalysis of Illumina Gene Expression Microarray Data
Analysis of Illumina Gene Expression Microarray Data Asta Laiho, Msc. Tech. Bioinformatics research engineer The Finnish DNA Microarray Centre Turku Centre for Biotechnology, Finland The Finnish DNA Microarray
More informationParametric and non-parametric statistical methods for the life sciences - Session I
Why nonparametric methods What test to use? Rank Tests Parametric and non-parametric statistical methods for the life sciences - Session I Liesbeth Bruckers Geert Molenberghs Interuniversity Institute
More informationModule 5: Statistical Analysis
Module 5: Statistical Analysis To answer more complex questions using your data, or in statistical terms, to test your hypothesis, you need to use more advanced statistical tests. This module reviews the
More informationFinal Exam Practice Problem Answers
Final Exam Practice Problem Answers The following data set consists of data gathered from 77 popular breakfast cereals. The variables in the data set are as follows: Brand: The brand name of the cereal
More information