Finding statistical patterns in Big Data

Size: px
Start display at page:

Download "Finding statistical patterns in Big Data"

Transcription

1 Finding statistical patterns in Big Data Patrick Rubin-Delanchy University of Bristol & Heilbronn Institute for Mathematical Research IAS Research Workshop: Data science for the real world (workshop 1) 1st May 2015

2 This talk is about hypothesis testing/feature extraction/anomaly detection for problems involving large amounts of data Computational issues are largely ignored Instead, focus is on some of the pitfalls of hypothesis testing The solutions presented are simple Recommendations are meant to help exploratory research; I m not taking any position on publishing standards

3 This talk is about hypothesis testing/feature extraction/anomaly detection for problems involving large amounts of data Computational issues are largely ignored Instead, focus is on some of the pitfalls of hypothesis testing The solutions presented are simple Recommendations are meant to help exploratory research; I m not taking any position on publishing standards

4 This talk is about hypothesis testing/feature extraction/anomaly detection for problems involving large amounts of data Computational issues are largely ignored Instead, focus is on some of the pitfalls of hypothesis testing The solutions presented are simple Recommendations are meant to help exploratory research; I m not taking any position on publishing standards

5 This talk is about hypothesis testing/feature extraction/anomaly detection for problems involving large amounts of data Computational issues are largely ignored Instead, focus is on some of the pitfalls of hypothesis testing The solutions presented are simple Recommendations are meant to help exploratory research; I m not taking any position on publishing standards

6 This talk is about hypothesis testing/feature extraction/anomaly detection for problems involving large amounts of data Computational issues are largely ignored Instead, focus is on some of the pitfalls of hypothesis testing The solutions presented are simple Recommendations are meant to help exploratory research; I m not taking any position on publishing standards

7 Pitfalls of hypothesis testing with Big Data This talk focusses on two pervasive issues: 1 Hypothesis testing when the model is wrong 2 Multiple testing

8 The cyber-security application Network flow data at Los Alamos National Laboratory: 30 GB/day Typical attack pattern: A. Opportunistic infection B. Network traversal C. Data exfiltration Figure : Network traversal, source: Neil et al. (2013)

9 The cyber-security application Global cost of cyber-security is estimated at $400 billion (CSIC, 2014) Botnet behind a third of the spam sent in 2010: earned about $2.7 million spam prevention: cost about > $1 billion (Anderson et al., 2013) UK National Security Strategy Priority Risks (Cabinet Office, 2010) International terrorism affecting the UK or its interests, including a chemical, biological,radiological or nuclear attack by terrorists; and/or a significant increase in the levels of terrorism relating to Northern Ireland. Hostile attacks upon UK cyber space by other states and large scale cyber crime. A major accident or natural hazard which requires a national response, such as severe coastal flooding affecting three or more regions of the UK, or an influenza pandemic. An international military crisis between states, drawing in the UK, and its allies as well as other states and non-state actors.

10 Hypothesis testing framework 1 Null hypothesis H 0, e.g., the drug has no effect; any difference between the two groups is due to chance 2 Alternative hypothesis H 1, e.g., the drug has a (positive) effect 3 Test statistic T, e.g., difference in treatment outcomes 4 P-value: p = P 0 (T T ), where P 0 is the distribution of T under H 0 and T is a replicate of T under H 0 5 We reject the null hypothesis when p is small, e.g., less than 5%

11 Hypothesis testing with the wrong model Example hypothesis test: H 0 : } the data are {{ Gaussian }, µ 1 = µ } {{ } 2 FALSE TRUE H 1 : } the data are {{ Gaussian }, µ 1 > µ } {{ } 2 FALSE FALSE It s right to reject the null hypothesis, but wrong to accept the alternative. In practice this seems to lead to p-values with U-shaped distributions.

12 Hypothesis testing with the wrong model Example hypothesis test: H 0 : } the data are {{ Gaussian }, µ 1 = µ } {{ } 2 FALSE TRUE H 1 : } the data are {{ Gaussian }, µ 1 > µ } {{ } 2 FALSE FALSE It s right to reject the null hypothesis, but wrong to accept the alternative. In practice this seems to lead to p-values with U-shaped distributions.

13 Hypothesis testing with the wrong model Example hypothesis test: H 0 : } the data are {{ Gaussian }, µ 1 = µ } {{ } 2 FALSE TRUE H 1 : } the data are {{ Gaussian }, µ 1 > µ } {{ } 2 FALSE FALSE It s right to reject the null hypothesis, but wrong to accept the alternative. In practice this seems to lead to p-values with U-shaped distributions.

14 Example: two-sample t-test G 1 G Difference: t-test: P-value: T = Ḡ 1 Ḡ Ḡ 1 Ḡ 2 s 12 1/n1 + 1/n p 0.03

15 Example: two-sample t-test G 1 G The t-test assumes: 1 H 0 : the data are independent and Gaussian with the same mean and variance. 2 H 1 : µ 1 > µ 2 Although we might correctly reject H 0, we don t know if we are rejecting: a) the data are Gaussian b) the data have the same mean c) the data have the same variance

16 Example: two-sample t-test G 1 G A vector X 1,..., X n is exchangeable if its joint distribution is the same as X σ(1),..., X σ(n) for any permutation σ. Instead of assuming a Gaussian model under H 0, assume 1 H 0 : the data are exchangeable 2 H 1 : the data are not exchangeable, µ 1 > µ 2 Can think of T as a random draw from T 1,..., T M, where T i is the ith permutation of G 1 and G 2 (Formally, we are conditioning on a sufficient statistic for the unknowns)

17 Example: two-sample t-test G 1 G Difference: t-test: T = Ḡ 1 Ḡ Ḡ 1 Ḡ 2 s 12 1/n1 + 1/n

18 Example: two-sample t-test G1 G Resampled difference: Resampled statistic: T = Ḡ 1 Ḡ Ḡ 1 Ḡ 2 s 12 1/n1 + 1/n

19 Example: two-sample t-test G1 G Resampled difference: Resampled statistic: T = Ḡ 1 Ḡ Ḡ 1 Ḡ 2 s 12 1/n1 + 1/n

20 Example: two-sample t-test G 1 G Difference: Statistic: T = Ḡ 1 Ḡ Ḡ 1 Ḡ 2 s 12 1/n1 + 1/n P-value (permutation-based): ˆp = 1 M + 1 where T 0 = T. M i=0 I(T i T ) 0.21,

21 Reasons to consider a non-parametric approach ( reasons modelling could be hard): 1 Visualisation and curation are difficult (e.g. for logistical or privacy reasons) 2 The same analytic is to be used on different data sources 3 A lack of domain expertise 4 The data are complicated objects, e.g. graphs Reasons not to: 1 Sometimes a non-parametric approach is not available 2 Model-based approaches can have greater power 3 There is a question of how to balance computational effort against simulation error

22 Reasons to consider a non-parametric approach ( reasons modelling could be hard): 1 Visualisation and curation are difficult (e.g. for logistical or privacy reasons) 2 The same analytic is to be used on different data sources 3 A lack of domain expertise 4 The data are complicated objects, e.g. graphs Reasons not to: 1 Sometimes a non-parametric approach is not available 2 Model-based approaches can have greater power 3 There is a question of how to balance computational effort against simulation error

23 Multiple testing Michael Jordan: When you have large amounts of data, your appetite for hypotheses tends to get even larger. (IEEE Spectrum, 20th October 2014). In fact, the number of hypotheses tested often grows much faster than the data. The basic problem: as the number of tests gets large, the probability of finding a significant result becomes very high.

24 Example: spurious correlations Figure : Spurious correlations, source:

25 Two approaches to multiple testing Suppose we have p-values p 1,..., p n. The two canonical tasks are: 1 sub-select a set for further analysis 2 combine the p-values into one overall score of significance

26 Sub-selection Define the false discovery rate to be (Benjamini and Hochberg, 1995): { 0 if no hypothesis is rejected Q = otherwise #incorrect rejections #total rejections Benjamini and Hochberg (1995) propose that Q is the quantity we want to control. 1 Let p (1) p (n) denote the ordered p-values 2 Let k be the largest i such that p (i) i n q 3 Reject hypotheses corresponding to p (1),..., p (k) 4 Then, if the p-values corresponding to the true null hypotheses are independent, E(Q) q

27 Sub-selection Define the false discovery rate to be (Benjamini and Hochberg, 1995): { 0 if no hypothesis is rejected Q = otherwise #incorrect rejections #total rejections Benjamini and Hochberg (1995) propose that Q is the quantity we want to control. 1 Let p (1) p (n) denote the ordered p-values 2 Let k be the largest i such that p (i) i n q 3 Reject hypotheses corresponding to p (1),..., p (k) 4 Then, if the p-values corresponding to the true null hypotheses are independent, E(Q) q

28 Combining p-values I In the second approach, we consider the joint hypothesis test: One method for combining p-values: 1 Let π = H 0 : all of the null hypotheses hold H 1 : at least one alternative holds min i 1,...,n { np(i) 2 Then if the p-values are independent under H 0 (Simes, 1986), π uniform[0, 1] i } 3 Furthermore, if the p-values are positively dependent (Sarkar and Chang, 1997), π st uniform[0, 1], where st denotes the usual stochastic order. In statistical terminology, rejecting on the basis of π is conservative.

29 Combining p-values I In the second approach, we consider the joint hypothesis test: One method for combining p-values: 1 Let π = H 0 : all of the null hypotheses hold H 1 : at least one alternative holds min i 1,...,n { np(i) 2 Then if the p-values are independent under H 0 (Simes, 1986), π uniform[0, 1] i } 3 Furthermore, if the p-values are positively dependent (Sarkar and Chang, 1997), π st uniform[0, 1], where st denotes the usual stochastic order. In statistical terminology, rejecting on the basis of π is conservative.

30 Combining p-values I In the second approach, we consider the joint hypothesis test: One method for combining p-values: 1 Let π = H 0 : all of the null hypotheses hold H 1 : at least one alternative holds min i 1,...,n { np(i) 2 Then if the p-values are independent under H 0 (Simes, 1986), π uniform[0, 1] i } 3 Furthermore, if the p-values are positively dependent (Sarkar and Chang, 1997), π st uniform[0, 1], where st denotes the usual stochastic order. In statistical terminology, rejecting on the basis of π is conservative.

31 Combining p-values I In the second approach, we consider the joint hypothesis test: One method for combining p-values: 1 Let π = H 0 : all of the null hypotheses hold H 1 : at least one alternative holds min i 1,...,n { np(i) 2 Then if the p-values are independent under H 0 (Simes, 1986), π uniform[0, 1] i } 3 Furthermore, if the p-values are positively dependent (Sarkar and Chang, 1997), π st uniform[0, 1], where st denotes the usual stochastic order. In statistical terminology, rejecting on the basis of π is conservative.

32 Combining p-values II Consider the more specific needles-in-a-haystack scenario 1 n very large 2 Most of the tests are expected to have no signal: H 0 : all of the null hypotheses hold H 1 : a vanishing proportion of the alternatives hold Let π = max n{[(fraction significant at α) α]/ α(1 α)} 0 α α 0 Then under sparse conditions set out in Donoho and Jin (2004), π will manage to detect the alternative whenever it is asymptotically theoretically possible.

33 Combining p-values II Consider the more specific needles-in-a-haystack scenario 1 n very large 2 Most of the tests are expected to have no signal: H 0 : all of the null hypotheses hold H 1 : a vanishing proportion of the alternatives hold Let π = max n{[(fraction significant at α) α]/ α(1 α)} 0 α α 0 Then under sparse conditions set out in Donoho and Jin (2004), π will manage to detect the alternative whenever it is asymptotically theoretically possible.

34 Conclusion 1 We ve advertised a few simple techniques that can help with hypothesis testing at scale 2 Computational issues have been largely ignored, e.g. the permutation test is more effort (but we can control that) 3 Some of the concepts touched upon have a much deeper theory, e.g. exchangeability, dependence, stochastic orders, that is possibly very relevant to the theory of Big Data.

35 Anderson, R., Barton, C., Böhme, R., Clayton, R., Van Eeten, M. J., Levi, M., Moore, T., and Savage, S. (2013). Measuring the cost of cybercrime. In The economics of information security and privacy, pages Springer. Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological), pages Cabinet Office and National security and intelligence (2010). A strong britain in an age of uncertainty:the national security strategy. Center for Strategic and International Studies (2014). Net losses: Estimating the global cost of cybercrime economic impact of cybercrime II. Donoho, D. and Jin, J. (2004). Higher criticism for detecting sparse heterogeneous mixtures. Annals of Statistics, pages Neil, J., Hash, C., Brugh, A., Fisk, M., and Storlie, C. B. (2013). Scan statistics for the online detection of locally anomalous subgraphs. Technometrics, 55(4): Sarkar, S. K. and Chang, C.-K. (1997). The simes method for multiple hypothesis testing with positively dependent test statistics. Journal of the American Statistical Association, 92(440): Simes, R. J. (1986). An improved bonferroni procedure for multiple tests of significance. Biometrika, 73(3):

Monte Carlo testing with Big Data

Monte Carlo testing with Big Data Monte Carlo testing with Big Data Patrick Rubin-Delanchy University of Bristol & Heilbronn Institute for Mathematical Research Joint work with: Axel Gandy (Imperial College London) with contributions from:

More information

False Discovery Rates

False Discovery Rates False Discovery Rates John D. Storey Princeton University, Princeton, USA January 2010 Multiple Hypothesis Testing In hypothesis testing, statistical significance is typically based on calculations involving

More information

Anomaly detection for Big Data, networks and cyber-security

Anomaly detection for Big Data, networks and cyber-security Anomaly detection for Big Data, networks and cyber-security Patrick Rubin-Delanchy University of Bristol & Heilbronn Institute for Mathematical Research Joint work with Nick Heard (Imperial College London),

More information

Combining Weak Statistical Evidence in Cyber Security

Combining Weak Statistical Evidence in Cyber Security Combining Weak Statistical Evidence in Cyber Security Nick Heard Department of Mathematics, Imperial College London; Heilbronn Institute for Mathematical Research, University of Bristol Intelligent Data

More information

Package dunn.test. January 6, 2016

Package dunn.test. January 6, 2016 Version 1.3.2 Date 2016-01-06 Package dunn.test January 6, 2016 Title Dunn's Test of Multiple Comparisons Using Rank Sums Author Alexis Dinno Maintainer Alexis Dinno

More information

Permutation Tests for Comparing Two Populations

Permutation Tests for Comparing Two Populations Permutation Tests for Comparing Two Populations Ferry Butar Butar, Ph.D. Jae-Wan Park Abstract Permutation tests for comparing two populations could be widely used in practice because of flexibility of

More information

CONTENTS OF DAY 2. II. Why Random Sampling is Important 9 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE

CONTENTS OF DAY 2. II. Why Random Sampling is Important 9 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE 1 2 CONTENTS OF DAY 2 I. More Precise Definition of Simple Random Sample 3 Connection with independent random variables 3 Problems with small populations 8 II. Why Random Sampling is Important 9 A myth,

More information

Statistical issues in the analysis of microarray data

Statistical issues in the analysis of microarray data Statistical issues in the analysis of microarray data Daniel Gerhard Institute of Biostatistics Leibniz University of Hannover ESNATS Summerschool, Zermatt D. Gerhard (LUH) Analysis of microarray data

More information

Section 13, Part 1 ANOVA. Analysis Of Variance

Section 13, Part 1 ANOVA. Analysis Of Variance Section 13, Part 1 ANOVA Analysis Of Variance Course Overview So far in this course we ve covered: Descriptive statistics Summary statistics Tables and Graphs Probability Probability Rules Probability

More information

Thresholding of Statistical Maps in Functional Neuroimaging Using the False Discovery Rate 1

Thresholding of Statistical Maps in Functional Neuroimaging Using the False Discovery Rate 1 NeuroImage 15, 870 878 (2002) doi:10.1006/nimg.2001.1037, available online at http://www.idealibrary.com on Thresholding of Statistical Maps in Functional Neuroimaging Using the False Discovery Rate 1

More information

C. The null hypothesis is not rejected when the alternative hypothesis is true. A. population parameters.

C. The null hypothesis is not rejected when the alternative hypothesis is true. A. population parameters. Sample Multiple Choice Questions for the material since Midterm 2. Sample questions from Midterms and 2 are also representative of questions that may appear on the final exam.. A randomly selected sample

More information

Bootstrapping Big Data

Bootstrapping Big Data Bootstrapping Big Data Ariel Kleiner Ameet Talwalkar Purnamrita Sarkar Michael I. Jordan Computer Science Division University of California, Berkeley {akleiner, ameet, psarkar, jordan}@eecs.berkeley.edu

More information

Understanding Confidence Intervals and Hypothesis Testing Using Excel Data Table Simulation

Understanding Confidence Intervals and Hypothesis Testing Using Excel Data Table Simulation Understanding Confidence Intervals and Hypothesis Testing Using Excel Data Table Simulation Leslie Chandrakantha lchandra@jjay.cuny.edu Department of Mathematics & Computer Science John Jay College of

More information

Tutorial 5: Hypothesis Testing

Tutorial 5: Hypothesis Testing Tutorial 5: Hypothesis Testing Rob Nicholls nicholls@mrc-lmb.cam.ac.uk MRC LMB Statistics Course 2014 Contents 1 Introduction................................ 1 2 Testing distributional assumptions....................

More information

Package HHG. July 14, 2015

Package HHG. July 14, 2015 Type Package Package HHG July 14, 2015 Title Heller-Heller-Gorfine Tests of Independence and Equality of Distributions Version 1.5.1 Date 2015-07-13 Author Barak Brill & Shachar Kaufman, based in part

More information

Class 19: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.1)

Class 19: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.1) Spring 204 Class 9: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.) Big Picture: More than Two Samples In Chapter 7: We looked at quantitative variables and compared the

More information

Two-sample hypothesis testing, II 9.07 3/16/2004

Two-sample hypothesis testing, II 9.07 3/16/2004 Two-sample hypothesis testing, II 9.07 3/16/004 Small sample tests for the difference between two independent means For two-sample tests of the difference in mean, things get a little confusing, here,

More information

Controlling the number of false discoveries: application to high-dimensional genomic data

Controlling the number of false discoveries: application to high-dimensional genomic data Journal of Statistical Planning and Inference 124 (2004) 379 398 www.elsevier.com/locate/jspi Controlling the number of false discoveries: application to high-dimensional genomic data Edward L. Korn a;,

More information

The Bonferonni and Šidák Corrections for Multiple Comparisons

The Bonferonni and Šidák Corrections for Multiple Comparisons The Bonferonni and Šidák Corrections for Multiple Comparisons Hervé Abdi 1 1 Overview The more tests we perform on a set of data, the more likely we are to reject the null hypothesis when it is true (i.e.,

More information

A study on the bi-aspect procedure with location and scale parameters

A study on the bi-aspect procedure with location and scale parameters 통계연구(2012), 제17권 제1호, 19-26 A study on the bi-aspect procedure with location and scale parameters (Short Title: Bi-aspect procedure) Hyo-Il Park 1) Ju Sung Kim 2) Abstract In this research we propose a

More information

Simple Linear Regression Inference

Simple Linear Regression Inference Simple Linear Regression Inference 1 Inference requirements The Normality assumption of the stochastic term e is needed for inference even if it is not a OLS requirement. Therefore we have: Interpretation

More information

From the help desk: Bootstrapped standard errors

From the help desk: Bootstrapped standard errors The Stata Journal (2003) 3, Number 1, pp. 71 80 From the help desk: Bootstrapped standard errors Weihua Guan Stata Corporation Abstract. Bootstrapping is a nonparametric approach for evaluating the distribution

More information

Redwood Building, Room T204, Stanford University School of Medicine, Stanford, CA 94305-5405.

Redwood Building, Room T204, Stanford University School of Medicine, Stanford, CA 94305-5405. W hittemoretxt050806.tex A Bayesian False Discovery Rate for Multiple Testing Alice S. Whittemore Department of Health Research and Policy Stanford University School of Medicine Correspondence Address:

More information

Name: Date: Use the following to answer questions 3-4:

Name: Date: Use the following to answer questions 3-4: Name: Date: 1. Determine whether each of the following statements is true or false. A) The margin of error for a 95% confidence interval for the mean increases as the sample size increases. B) The margin

More information

Projects Involving Statistics (& SPSS)

Projects Involving Statistics (& SPSS) Projects Involving Statistics (& SPSS) Academic Skills Advice Starting a project which involves using statistics can feel confusing as there seems to be many different things you can do (charts, graphs,

More information

Non-Inferiority Tests for One Mean

Non-Inferiority Tests for One Mean Chapter 45 Non-Inferiority ests for One Mean Introduction his module computes power and sample size for non-inferiority tests in one-sample designs in which the outcome is distributed as a normal random

More information

Internet Appendix to False Discoveries in Mutual Fund Performance: Measuring Luck in Estimated Alphas

Internet Appendix to False Discoveries in Mutual Fund Performance: Measuring Luck in Estimated Alphas Internet Appendix to False Discoveries in Mutual Fund Performance: Measuring Luck in Estimated Alphas A. Estimation Procedure A.1. Determining the Value for from the Data We use the bootstrap procedure

More information

The Variability of P-Values. Summary

The Variability of P-Values. Summary The Variability of P-Values Dennis D. Boos Department of Statistics North Carolina State University Raleigh, NC 27695-8203 boos@stat.ncsu.edu August 15, 2009 NC State Statistics Departement Tech Report

More information

Package ERP. December 14, 2015

Package ERP. December 14, 2015 Type Package Package ERP December 14, 2015 Title Significance Analysis of Event-Related Potentials Data Version 1.1 Date 2015-12-11 Author David Causeur (Agrocampus, Rennes, France) and Ching-Fan Sheu

More information

Fallback tests for co-primary endpoints

Fallback tests for co-primary endpoints Research Article Received 16 April 2014, Accepted 27 January 2016 Published online 25 February 2016 in Wiley Online Library (wileyonlinelibrary.com) DOI: 10.1002/sim.6911 Fallback tests for co-primary

More information

The CUSUM algorithm a small review. Pierre Granjon

The CUSUM algorithm a small review. Pierre Granjon The CUSUM algorithm a small review Pierre Granjon June, 1 Contents 1 The CUSUM algorithm 1.1 Algorithm............................... 1.1.1 The problem......................... 1.1. The different steps......................

More information

Survival Analysis of Left Truncated Income Protection Insurance Data. [March 29, 2012]

Survival Analysis of Left Truncated Income Protection Insurance Data. [March 29, 2012] Survival Analysis of Left Truncated Income Protection Insurance Data [March 29, 2012] 1 Qing Liu 2 David Pitt 3 Yan Wang 4 Xueyuan Wu Abstract One of the main characteristics of Income Protection Insurance

More information

A direct approach to false discovery rates

A direct approach to false discovery rates J. R. Statist. Soc. B (2002) 64, Part 3, pp. 479 498 A direct approach to false discovery rates John D. Storey Stanford University, USA [Received June 2001. Revised December 2001] Summary. Multiple-hypothesis

More information

An Introduction to Statistics Course (ECOE 1302) Spring Semester 2011 Chapter 10- TWO-SAMPLE TESTS

An Introduction to Statistics Course (ECOE 1302) Spring Semester 2011 Chapter 10- TWO-SAMPLE TESTS The Islamic University of Gaza Faculty of Commerce Department of Economics and Political Sciences An Introduction to Statistics Course (ECOE 130) Spring Semester 011 Chapter 10- TWO-SAMPLE TESTS Practice

More information

Healthcare data analytics. Da-Wei Wang Institute of Information Science wdw@iis.sinica.edu.tw

Healthcare data analytics. Da-Wei Wang Institute of Information Science wdw@iis.sinica.edu.tw Healthcare data analytics Da-Wei Wang Institute of Information Science wdw@iis.sinica.edu.tw Outline Data Science Enabling technologies Grand goals Issues Google flu trend Privacy Conclusion Analytics

More information

Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach

Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach J. R. Statist. Soc. B (2004) 66, Part 1, pp. 187 205 Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach John D. Storey,

More information

Statistical Analysis. NBAF-B Metabolomics Masterclass. Mark Viant

Statistical Analysis. NBAF-B Metabolomics Masterclass. Mark Viant Statistical Analysis NBAF-B Metabolomics Masterclass Mark Viant 1. Introduction 2. Univariate analysis Overview of lecture 3. Unsupervised multivariate analysis Principal components analysis (PCA) Interpreting

More information

Master s Thesis. PERFORMANCE OF BETA-BINOMIAL SGoF MULTITESTING METHOD UNDER DEPENDENCE: A SIMULATION STUDY

Master s Thesis. PERFORMANCE OF BETA-BINOMIAL SGoF MULTITESTING METHOD UNDER DEPENDENCE: A SIMULATION STUDY Master s Thesis PERFORMANCE OF BETA-BINOMIAL SGoF MULTITESTING METHOD UNDER DEPENDENCE: A SIMULATION STUDY AUTHOR: Irene Castro Conde DIRECTOR: Jacobo de Uña Álvarez Master in Statistical Techniques University

More information

The Wondrous World of fmri statistics

The Wondrous World of fmri statistics Outline The Wondrous World of fmri statistics FMRI data and Statistics course, Leiden, 11-3-2008 The General Linear Model Overview of fmri data analysis steps fmri timeseries Modeling effects of interest

More information

Two-Sample T-Tests Allowing Unequal Variance (Enter Difference)

Two-Sample T-Tests Allowing Unequal Variance (Enter Difference) Chapter 45 Two-Sample T-Tests Allowing Unequal Variance (Enter Difference) Introduction This procedure provides sample size and power calculations for one- or two-sided two-sample t-tests when no assumption

More information

False Discovery Rate Control with Groups

False Discovery Rate Control with Groups False Discovery Rate Control with Groups James X. Hu, Hongyu Zhao and Harrison H. Zhou Abstract In the context of large-scale multiple hypothesis testing, the hypotheses often possess certain group structures

More information

Tutorial for proteome data analysis using the Perseus software platform

Tutorial for proteome data analysis using the Perseus software platform Tutorial for proteome data analysis using the Perseus software platform Laboratory of Mass Spectrometry, LNBio, CNPEM Tutorial version 1.0, January 2014. Note: This tutorial was written based on the information

More information

Recall this chart that showed how most of our course would be organized:

Recall this chart that showed how most of our course would be organized: Chapter 4 One-Way ANOVA Recall this chart that showed how most of our course would be organized: Explanatory Variable(s) Response Variable Methods Categorical Categorical Contingency Tables Categorical

More information

ESTABLISHING A NATIONAL CYBERSECURITY SYSTEM IN THE CONTEXT OF NATIONAL SECURITY AND DEFENCE SECTOR REFORM

ESTABLISHING A NATIONAL CYBERSECURITY SYSTEM IN THE CONTEXT OF NATIONAL SECURITY AND DEFENCE SECTOR REFORM Information & Security: An International Journal Valentyn Petrov, vol.31, 2014, 73-77 http://dx.doi.org/10.11610/isij.3104 ESTABLISHING A NATIONAL CYBERSECURITY SYSTEM IN THE CONTEXT OF NATIONAL SECURITY

More information

Big Data and Cyber Security A bibliometric study Jacky Akoka, Isabelle Comyn-Wattiau, Nabil Laoufi Workshop SCBC - 2015 (ER 2015) 1 Big Data a new generation of technologies and architectures, designed

More information

Testing a claim about a population mean

Testing a claim about a population mean Introductory Statistics Lectures Testing a claim about a population mean One sample hypothesis test of the mean Department of Mathematics Pima Community College Redistribution of this material is prohibited

More information

Non-Inferiority Tests for Two Means using Differences

Non-Inferiority Tests for Two Means using Differences Chapter 450 on-inferiority Tests for Two Means using Differences Introduction This procedure computes power and sample size for non-inferiority tests in two-sample designs in which the outcome is a continuous

More information

Distributed Regression For Heterogeneous Data Sets 1

Distributed Regression For Heterogeneous Data Sets 1 Distributed Regression For Heterogeneous Data Sets 1 Yan Xing, Michael G. Madden, Jim Duggan, Gerard Lyons Department of Information Technology National University of Ireland, Galway Ireland {yan.xing,

More information

STATISTICS 8, FINAL EXAM. Last six digits of Student ID#: Circle your Discussion Section: 1 2 3 4

STATISTICS 8, FINAL EXAM. Last six digits of Student ID#: Circle your Discussion Section: 1 2 3 4 STATISTICS 8, FINAL EXAM NAME: KEY Seat Number: Last six digits of Student ID#: Circle your Discussion Section: 1 2 3 4 Make sure you have 8 pages. You will be provided with a table as well, as a separate

More information

Chapter 23 Inferences About Means

Chapter 23 Inferences About Means Chapter 23 Inferences About Means Chapter 23 - Inferences About Means 391 Chapter 23 Solutions to Class Examples 1. See Class Example 1. 2. We want to know if the mean battery lifespan exceeds the 300-minute

More information

Two-Way ANOVA tests. I. Definition and Applications...2. II. Two-Way ANOVA prerequisites...2. III. How to use the Two-Way ANOVA tool?...

Two-Way ANOVA tests. I. Definition and Applications...2. II. Two-Way ANOVA prerequisites...2. III. How to use the Two-Way ANOVA tool?... Two-Way ANOVA tests Contents at a glance I. Definition and Applications...2 II. Two-Way ANOVA prerequisites...2 III. How to use the Two-Way ANOVA tool?...3 A. Parametric test, assume variances equal....4

More information

Testing Hypotheses About Proportions

Testing Hypotheses About Proportions Chapter 11 Testing Hypotheses About Proportions Hypothesis testing method: uses data from a sample to judge whether or not a statement about a population may be true. Steps in Any Hypothesis Test 1. Determine

More information

Chapter 8 Hypothesis Testing Chapter 8 Hypothesis Testing 8-1 Overview 8-2 Basics of Hypothesis Testing

Chapter 8 Hypothesis Testing Chapter 8 Hypothesis Testing 8-1 Overview 8-2 Basics of Hypothesis Testing Chapter 8 Hypothesis Testing 1 Chapter 8 Hypothesis Testing 8-1 Overview 8-2 Basics of Hypothesis Testing 8-3 Testing a Claim About a Proportion 8-5 Testing a Claim About a Mean: s Not Known 8-6 Testing

More information

Two-Group Hypothesis Tests: Excel 2013 T-TEST Command

Two-Group Hypothesis Tests: Excel 2013 T-TEST Command Two group hypothesis tests using Excel 2013 T-TEST command 1 Two-Group Hypothesis Tests: Excel 2013 T-TEST Command by Milo Schield Member: International Statistical Institute US Rep: International Statistical

More information

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics For 2015 Examinations Aim The aim of the Probability and Mathematical Statistics subject is to provide a grounding in

More information

A Statistical Framework for Operational Infrasound Monitoring

A Statistical Framework for Operational Infrasound Monitoring A Statistical Framework for Operational Infrasound Monitoring Stephen J. Arrowsmith Rod W. Whitaker LA-UR 11-03040 The views expressed here do not necessarily reflect the views of the United States Government,

More information

Protein Protein Interaction Networks

Protein Protein Interaction Networks Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics

More information

Permutation P-values Should Never Be Zero: Calculating Exact P-values When Permutations Are Randomly Drawn

Permutation P-values Should Never Be Zero: Calculating Exact P-values When Permutations Are Randomly Drawn Permutation P-values Should Never Be Zero: Calculating Exact P-values When Permutations Are Randomly Drawn Gordon K. Smyth & Belinda Phipson Walter and Eliza Hall Institute of Medical Research Melbourne,

More information

Two-Sample T-Tests Assuming Equal Variance (Enter Means)

Two-Sample T-Tests Assuming Equal Variance (Enter Means) Chapter 4 Two-Sample T-Tests Assuming Equal Variance (Enter Means) Introduction This procedure provides sample size and power calculations for one- or two-sided two-sample t-tests when the variances of

More information

LAB 4 INSTRUCTIONS CONFIDENCE INTERVALS AND HYPOTHESIS TESTING

LAB 4 INSTRUCTIONS CONFIDENCE INTERVALS AND HYPOTHESIS TESTING LAB 4 INSTRUCTIONS CONFIDENCE INTERVALS AND HYPOTHESIS TESTING In this lab you will explore the concept of a confidence interval and hypothesis testing through a simulation problem in engineering setting.

More information

Correlational Research

Correlational Research Correlational Research Chapter Fifteen Correlational Research Chapter Fifteen Bring folder of readings The Nature of Correlational Research Correlational Research is also known as Associational Research.

More information

A MULTIVARIATE TEST FOR SIMILARITY OF TWO DISSOLUTION PROFILES

A MULTIVARIATE TEST FOR SIMILARITY OF TWO DISSOLUTION PROFILES Journal of Biopharmaceutical Statistics, 15: 265 278, 2005 Copyright Taylor & Francis, Inc. ISSN: 1054-3406 print/1520-5711 online DOI: 10.1081/BIP-200049832 A MULTIVARIATE TEST FOR SIMILARITY OF TWO DISSOLUTION

More information

200627 - AC - Clinical Trials

200627 - AC - Clinical Trials Coordinating unit: Teaching unit: Academic year: Degree: ECTS credits: 2014 200 - FME - School of Mathematics and Statistics 715 - EIO - Department of Statistics and Operations Research MASTER'S DEGREE

More information

Statistics Graduate Courses

Statistics Graduate Courses Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.

More information

Exact Nonparametric Tests for Comparing Means - A Personal Summary

Exact Nonparametric Tests for Comparing Means - A Personal Summary Exact Nonparametric Tests for Comparing Means - A Personal Summary Karl H. Schlag European University Institute 1 December 14, 2006 1 Economics Department, European University Institute. Via della Piazzuola

More information

Statistical Models in R

Statistical Models in R Statistical Models in R Some Examples Steven Buechler Department of Mathematics 276B Hurley Hall; 1-6233 Fall, 2007 Outline Statistical Models Structure of models in R Model Assessment (Part IA) Anova

More information

Introduction. Hypothesis Testing. Hypothesis Testing. Significance Testing

Introduction. Hypothesis Testing. Hypothesis Testing. Significance Testing Introduction Hypothesis Testing Mark Lunt Arthritis Research UK Centre for Ecellence in Epidemiology University of Manchester 13/10/2015 We saw last week that we can never know the population parameters

More information

Applied Statistics. J. Blanchet and J. Wadsworth. Institute of Mathematics, Analysis, and Applications EPF Lausanne

Applied Statistics. J. Blanchet and J. Wadsworth. Institute of Mathematics, Analysis, and Applications EPF Lausanne Applied Statistics J. Blanchet and J. Wadsworth Institute of Mathematics, Analysis, and Applications EPF Lausanne An MSc Course for Applied Mathematicians, Fall 2012 Outline 1 Model Comparison 2 Model

More information

QUANTITATIVE METHODS BIOLOGY FINAL HONOUR SCHOOL NON-PARAMETRIC TESTS

QUANTITATIVE METHODS BIOLOGY FINAL HONOUR SCHOOL NON-PARAMETRIC TESTS QUANTITATIVE METHODS BIOLOGY FINAL HONOUR SCHOOL NON-PARAMETRIC TESTS This booklet contains lecture notes for the nonparametric work in the QM course. This booklet may be online at http://users.ox.ac.uk/~grafen/qmnotes/index.html.

More information

Cancer Biostatistics Workshop Science of Doing Science - Biostatistics

Cancer Biostatistics Workshop Science of Doing Science - Biostatistics Cancer Biostatistics Workshop Science of Doing Science - Biostatistics Yu Shyr, PhD Jan. 18, 2008 Cancer Biostatistics Center Vanderbilt-Ingram Cancer Center Yu.Shyr@vanderbilt.edu Aims Cancer Biostatistics

More information

How To Create An Insight Analysis For Cyber Security

How To Create An Insight Analysis For Cyber Security IBM i2 Enterprise Insight Analysis for Cyber Analysis Protect your organization with cyber intelligence Highlights Quickly identify threats, threat actors and hidden connections with multidimensional analytics

More information

Fixed-Effect Versus Random-Effects Models

Fixed-Effect Versus Random-Effects Models CHAPTER 13 Fixed-Effect Versus Random-Effects Models Introduction Definition of a summary effect Estimating the summary effect Extreme effect size in a large study or a small study Confidence interval

More information

Uncertainty quantification for the family-wise error rate in multivariate copula models

Uncertainty quantification for the family-wise error rate in multivariate copula models Uncertainty quantification for the family-wise error rate in multivariate copula models Thorsten Dickhaus (joint work with Taras Bodnar, Jakob Gierl and Jens Stange) University of Bremen Institute for

More information

t Tests in Excel The Excel Statistical Master By Mark Harmon Copyright 2011 Mark Harmon

t Tests in Excel The Excel Statistical Master By Mark Harmon Copyright 2011 Mark Harmon t-tests in Excel By Mark Harmon Copyright 2011 Mark Harmon No part of this publication may be reproduced or distributed without the express permission of the author. mark@excelmasterseries.com www.excelmasterseries.com

More information

University of Chicago Graduate School of Business. Business 41000: Business Statistics

University of Chicago Graduate School of Business. Business 41000: Business Statistics Name: University of Chicago Graduate School of Business Business 41000: Business Statistics Special Notes: 1. This is a closed-book exam. You may use an 8 11 piece of paper for the formulas. 2. Throughout

More information

. (3.3) n Note that supremum (3.2) must occur at one of the observed values x i or to the left of x i.

. (3.3) n Note that supremum (3.2) must occur at one of the observed values x i or to the left of x i. Chapter 3 Kolmogorov-Smirnov Tests There are many situations where experimenters need to know what is the distribution of the population of their interest. For example, if they want to use a parametric

More information

Intrusion Detection via Machine Learning for SCADA System Protection

Intrusion Detection via Machine Learning for SCADA System Protection Intrusion Detection via Machine Learning for SCADA System Protection S.L.P. Yasakethu Department of Computing, University of Surrey, Guildford, GU2 7XH, UK. s.l.yasakethu@surrey.ac.uk J. Jiang Department

More information

Intrusion Detection: Game Theory, Stochastic Processes and Data Mining

Intrusion Detection: Game Theory, Stochastic Processes and Data Mining Intrusion Detection: Game Theory, Stochastic Processes and Data Mining Joseph Spring 7COM1028 Secure Systems Programming 1 Discussion Points Introduction Firewalls Intrusion Detection Schemes Models Stochastic

More information

Two-sample t-tests. - Independent samples - Pooled standard devation - The equal variance assumption

Two-sample t-tests. - Independent samples - Pooled standard devation - The equal variance assumption Two-sample t-tests. - Independent samples - Pooled standard devation - The equal variance assumption Last time, we used the mean of one sample to test against the hypothesis that the true mean was a particular

More information

Introduction. Chapter 1

Introduction. Chapter 1 Chapter 1 Introduction The area of fault detection and diagnosis is one of the most important aspects in process engineering. This area has received considerable attention from industry and academia because

More information

Simple Second Order Chi-Square Correction

Simple Second Order Chi-Square Correction Simple Second Order Chi-Square Correction Tihomir Asparouhov and Bengt Muthén May 3, 2010 1 1 Introduction In this note we describe the second order correction for the chi-square statistic implemented

More information

Big Data-ready, Secure & Sovereign Cloud

Big Data-ready, Secure & Sovereign Cloud Copernicus Big Data Workshop Big Data-ready, Secure & Sovereign Cloud A Technology Enabler for Copernicus Data Innovation March 14 th, 2014 Brussels F. BOUJEMAA R&D Manager E. MICONNET - Head of Cyber

More information

Module 2 Probability and Statistics

Module 2 Probability and Statistics Module 2 Probability and Statistics BASIC CONCEPTS Multiple Choice Identify the choice that best completes the statement or answers the question. 1. The standard deviation of a standard normal distribution

More information

MONT 107N Understanding Randomness Solutions For Final Examination May 11, 2010

MONT 107N Understanding Randomness Solutions For Final Examination May 11, 2010 MONT 07N Understanding Randomness Solutions For Final Examination May, 00 Short Answer (a) (0) How are the EV and SE for the sum of n draws with replacement from a box computed? Solution: The EV is n times

More information

Sample Size and Power in Clinical Trials

Sample Size and Power in Clinical Trials Sample Size and Power in Clinical Trials Version 1.0 May 011 1. Power of a Test. Factors affecting Power 3. Required Sample Size RELATED ISSUES 1. Effect Size. Test Statistics 3. Variation 4. Significance

More information

Factors affecting online sales

Factors affecting online sales Factors affecting online sales Table of contents Summary... 1 Research questions... 1 The dataset... 2 Descriptive statistics: The exploratory stage... 3 Confidence intervals... 4 Hypothesis tests... 4

More information

Hypothesis testing. c 2014, Jeffrey S. Simonoff 1

Hypothesis testing. c 2014, Jeffrey S. Simonoff 1 Hypothesis testing So far, we ve talked about inference from the point of estimation. We ve tried to answer questions like What is a good estimate for a typical value? or How much variability is there

More information

Internet Safety and Security: Strategies for Building an Internet Safety Wall

Internet Safety and Security: Strategies for Building an Internet Safety Wall Internet Safety and Security: Strategies for Building an Internet Safety Wall Sylvanus A. EHIKIOYA, PhD Director, New Media & Information Security Nigerian Communications Commission Abuja, NIGERIA Internet

More information

UNIVERSITY OF NAIROBI

UNIVERSITY OF NAIROBI UNIVERSITY OF NAIROBI MASTERS IN PROJECT PLANNING AND MANAGEMENT NAME: SARU CAROLYNN ELIZABETH REGISTRATION NO: L50/61646/2013 COURSE CODE: LDP 603 COURSE TITLE: RESEARCH METHODS LECTURER: GAKUU CHRISTOPHER

More information

The effect of cybercrime on a Bank's finances

The effect of cybercrime on a Bank's finances ISSN: 2347-3215 Volume-2 Number 2 (February-2014) pp.173-178 www.ijcrar.com The effect of cybercrime on a Bank's finances A.R. Raghavan 1 and Latha Parthiban 2* 1 Flat no 20, Door no 9, Prashanth Manor,

More information

Lecture Notes Module 1

Lecture Notes Module 1 Lecture Notes Module 1 Study Populations A study population is a clearly defined collection of people, animals, plants, or objects. In psychological research, a study population usually consists of a specific

More information

How To Check For Differences In The One Way Anova

How To Check For Differences In The One Way Anova MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. One-Way

More information

Hypothesis Testing. Steps for a hypothesis test:

Hypothesis Testing. Steps for a hypothesis test: Hypothesis Testing Steps for a hypothesis test: 1. State the claim H 0 and the alternative, H a 2. Choose a significance level or use the given one. 3. Draw the sampling distribution based on the assumption

More information

Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed page of such transmission.

Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed page of such transmission. Robust Tests for the Equality of Variances Author(s): Morton B. Brown and Alan B. Forsythe Source: Journal of the American Statistical Association, Vol. 69, No. 346 (Jun., 1974), pp. 364-367 Published

More information

Data Mining: An Overview. David Madigan http://www.stat.columbia.edu/~madigan

Data Mining: An Overview. David Madigan http://www.stat.columbia.edu/~madigan Data Mining: An Overview David Madigan http://www.stat.columbia.edu/~madigan Overview Brief Introduction to Data Mining Data Mining Algorithms Specific Eamples Algorithms: Disease Clusters Algorithms:

More information

Data Quality Assessment: A Reviewer s Guide EPA QA/G-9R

Data Quality Assessment: A Reviewer s Guide EPA QA/G-9R United States Office of Environmental EPA/240/B-06/002 Environmental Protection Information Agency Washington, DC 20460 Data Quality Assessment: A Reviewer s Guide EPA QA/G-9R FOREWORD This document is

More information

The VAR models discussed so fare are appropriate for modeling I(0) data, like asset returns or growth rates of macroeconomic time series.

The VAR models discussed so fare are appropriate for modeling I(0) data, like asset returns or growth rates of macroeconomic time series. Cointegration The VAR models discussed so fare are appropriate for modeling I(0) data, like asset returns or growth rates of macroeconomic time series. Economic theory, however, often implies equilibrium

More information

Statistics 3202 Introduction to Statistical Inference for Data Analytics 4-semester-hour course

Statistics 3202 Introduction to Statistical Inference for Data Analytics 4-semester-hour course Statistics 3202 Introduction to Statistical Inference for Data Analytics 4-semester-hour course Prerequisite: Stat 3201 (Introduction to Probability for Data Analytics) Exclusions: Class distribution:

More information

Data Science Center Eindhoven. Big Data: Challenges and Opportunities for Mathematicians. Alessandro Di Bucchianico

Data Science Center Eindhoven. Big Data: Challenges and Opportunities for Mathematicians. Alessandro Di Bucchianico Data Science Center Eindhoven Big Data: Challenges and Opportunities for Mathematicians Alessandro Di Bucchianico Dutch Mathematical Congress April 15, 2015 Contents 1. Big Data terminology 2. Various

More information