# Monte Carlo testing with Big Data

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Monte Carlo testing with Big Data Patrick Rubin-Delanchy University of Bristol & Heilbronn Institute for Mathematical Research Joint work with: Axel Gandy (Imperial College London) with contributions from: Nick Heard (Imperial College London), Dan Lawson (University of Bristol), Niall Adams (Imperial College London), Melissa Turcotte (Los Alamos National Laboratory) 8th January / 21

2 Michael Jordan: When you have large amounts of data, your appetite for hypotheses tends to get even larger. (IEEE Spectrum, 20th October 2014). His point is about the risk of spurious discoveries when the number of possible hypotheses grows exponentially. Whilst our focus is mostly on the algorithmic challenge of implementing many tests, the theory of multiple testing is at the heart of our approach. 2 / 21

3 The hypothesis testing framework The null hypothesis, H 0, gives a probabilistic description of the data if there is no effect. The alternative, H 1, describes the case where the effect is present. We have a test statistic t which would tend be small under H 0, large under H 1. The p-value of the test is where T is a replicate of t under H 0. p = P(T t), It is straightforward to prove that p is uniformly distributed on [0, 1] under H 0 (if T is absolutely continuous). A low p-value, e.g. 1/100, is considered evidence in favour of the alternative. 3 / 21

4 Monte Carlo test Often P(T t) cannot be calculated exactly. A typical solution: simulate replicates of T under the null hypothesis, T 1,..., T N, and estimate ˆp = 1 N I(T j t), N where I is the indicator function, N large. j=1 4 / 21

5 Permutation test Let A and B denote two groups of data. Under H 0, the data are exchangeable the A/B labelling is arbitrary. Let t be some measure of discrepancy between A and B (e.g. the difference in means). Simulate T j by randomly relabelling the data, and recomputing the discrepancy. Estimate ˆp = 1 N N I(T j t). j=1 5 / 21

6 The permutation test essentially works by combining two important ideas: exchangeability and conditioning. 1 If the false positive rate is controlled at a level α (often 5%) given X = x, for all x, then it is also controlled marginally. 2 The simulated T j, in the permutation test, come from conditioning on the observed values, but not their labelling as A or B. More generally, my view is that exchangeability and other sorts of stochastic orders are key to robust inference on Big Data. 6 / 21

7 Running a large number of Monte Carlo tests Some possible principles: 1 Make each individual test faster, see e.g. Gandy (2009). 2 View task as a resource allocation problem: define an error and try to minimise its mean square. Choosing the error without taking into account the special nature the problem, e.g. using the Euclidean distance from the true vector of p-values, doesn t make much sense. For example, we could easily end up investing most of the effort on p-values around 0.5, which are typically not of interest. I will argue that for some Big Data problems it is natural to define the error on the basis of a method to combine p-values. 7 / 21

8 Running a large number of Monte Carlo tests Some possible principles: 1 Make each individual test faster, see e.g. Gandy (2009). 2 View task as a resource allocation problem: define an error and try to minimise its mean square. Choosing the error without taking into account the special nature the problem, e.g. using the Euclidean distance from the true vector of p-values, doesn t make much sense. For example, we could easily end up investing most of the effort on p-values around 0.5, which are typically not of interest. I will argue that for some Big Data problems it is natural to define the error on the basis of a method to combine p-values. 7 / 21

9 Running a large number of Monte Carlo tests Some possible principles: 1 Make each individual test faster, see e.g. Gandy (2009). 2 View task as a resource allocation problem: define an error and try to minimise its mean square. Choosing the error without taking into account the special nature the problem, e.g. using the Euclidean distance from the true vector of p-values, doesn t make much sense. For example, we could easily end up investing most of the effort on p-values around 0.5, which are typically not of interest. I will argue that for some Big Data problems it is natural to define the error on the basis of a method to combine p-values. 7 / 21

10 Combining p-values Consider the global hypothesis test: H 0 : all hypotheses hold, the p-values are independent and uniformly distributed, versus, H 1 : not all null hypotheses hold. Combining p-values is testing the p-values, rather than the original data. Note that we are using the p-values together to make one, global, decision. We are ignoring the local errors of accepting or rejecting each individual hypothesis. This can be reasonable if: 1 The data are heterogeneous there isn t an obvious way to combine them into an overall test statistic. 2 There is too much data a divide-and-conquer strategy is necessary. 8 / 21

11 Some well-known test statistics and their distributions under H 0 (when known): 1 Fisher s method: 2 log(p i ) χ 2 2m (Mosteller and Fisher, 1948). 2 Stouffer s score: Φ 1 (1 p i )/ m normal(0, 1) (Stouffer et al., 1949). 3 Simes s test: min{p (i) m/i} uniform[0, 1] (Simes, 1989). 4 Donoho s statistic (Donoho and Jin, 2004): / max m{i/m p(i) } p (i) {1 p (i) }. 1 i α 0 m 9 / 21

12 Probabilistic framework 1 Let p = (p 1,..., p m ) T be the vector of true (and unobservable) p-values. 2 We can construct any number of independent Bernoulli variables with success probability p i, for i = 1,..., m. 3 Let f be the function used to combine p-values, and let var N {f (ˆp)} = N 1/2 {f (ˆp) f (p)}, where N is the total number of samples. 4 We want to adaptively allocate the number of samples per test in order to minimise var N {f (ˆp)}. 10 / 21

13 Potential gains over even allocation For tests that focus attention on one p-value, e.g. min{p (i) m/i} (Simes), we can reduce the asymptotic variance by a factor of m (the number of tests). This is also true for some smooth functions of the p-values, including Fisher and Stouffer: Lemma If there exists i 1,..., m such that min k i sup i f (x) / k f (x) =, x [0,1] m then var {f (ˆp naïve )} sup p i (0,1) m var {f (ˆp opt )} = m. 11 / 21

14 Outline of algorithm For known p and f we can calculate the optimal asymptotic allocation. E.g. with Fisher s method p i should get {p i /(1 p i )} 1/2 of samples. In batches of : 1 Decide how to allocate samples amongst m streams using current ˆp. 2 Sample test replicates. 3 Update ˆp. Various optimisations are possible (and necessary), e.g. how to estimate p, how to estimate allocation from p, whether we try to correct for previously wrongly allocated effort. The main point is that our algorithm seems to achieve the optimal asymptotic variance. 12 / 21

15 Outline of algorithm For known p and f we can calculate the optimal asymptotic allocation. E.g. with Fisher s method p i should get {p i /(1 p i )} 1/2 of samples. In batches of : 1 Decide how to allocate samples amongst m streams using current ˆp. 2 Sample test replicates. 3 Update ˆp. Various optimisations are possible (and necessary), e.g. how to estimate p, how to estimate allocation from p, whether we try to correct for previously wrongly allocated effort. The main point is that our algorithm seems to achieve the optimal asymptotic variance. 12 / 21

16 Outline of algorithm For known p and f we can calculate the optimal asymptotic allocation. E.g. with Fisher s method p i should get {p i /(1 p i )} 1/2 of samples. In batches of : 1 Decide how to allocate samples amongst m streams using current ˆp. 2 Sample test replicates. 3 Update ˆp. Various optimisations are possible (and necessary), e.g. how to estimate p, how to estimate allocation from p, whether we try to correct for previously wrongly allocated effort. The main point is that our algorithm seems to achieve the optimal asymptotic variance. 12 / 21

17 Outline of algorithm For known p and f we can calculate the optimal asymptotic allocation. E.g. with Fisher s method p i should get {p i /(1 p i )} 1/2 of samples. In batches of : 1 Decide how to allocate samples amongst m streams using current ˆp. 2 Sample test replicates. 3 Update ˆp. Various optimisations are possible (and necessary), e.g. how to estimate p, how to estimate allocation from p, whether we try to correct for previously wrongly allocated effort. The main point is that our algorithm seems to achieve the optimal asymptotic variance. 12 / 21

18 Cyber-security example: change detection in Netflow Traffic from one computer on Imperial College s network (with thanks to Andy Thomas) over a day. The data has an artificial changepoint where the user knowingly changed behaviour. We split the computer s traffic by edge (the other IP address), bin the data per hour, and throw away any edge with less than three bins. This results in approximately 100 time series. 13 / 21

19 Cyber-security example: change detection in Netflow cont d 1 For any proposed changepoint, count the absolute difference between the number of flows before and after the changepoint on each edge, resulting in a statistic t i. 2 For each edge: 1 Randomly permute the binned data. 2 Re-compute the absolute difference between the number of flows, resulting in a simulated statistic T ij. 3 Get a running estimate of the changepoint p-value for that edge, ˆp i. 3 Use our algorithm to combine the p-values using Fisher s score. 14 / 21

20 p value 1e 07 1e 05 1e 03 1e Time Figure : Overall p-value in favour of a changepoint 15 / 21

21 Time Figure : Most significant edges. Samples taken: 15039, 14767, 11598, 7985, / 21

22 Time Figure : Least significant edges. Samples taken: 55, 56, 58, 50, / 21

23 Other applications in Big Data Detecting changes in user authentification graphs (joint work with Melissa Turcotte and Los Alamos National Laboratory). Hypothesis tests on networks (e.g. finding link predictors, scoring a graph motif). Information flow (e.g. correlations in timing, matching words or phrases). Generic model-checking and anomaly detection. and much more / 21

24 Conclusion The proposed procedure has two advantages: 1 it focusses simulation effort on the interesting hypotheses. 2 it gives a running estimate of overall significance of the experiment, guarding against multiple testing. Some questions: how to integrate with a search (e.g. for a changepoint)? Can this procedure arrive at significance in less samples than a theoretically optimal test? 19 / 21

25 David Donoho and Jiashun Jin. Higher criticism for detecting sparse heterogeneous mixtures. Annals of Statistics, pages , Axel Gandy. Sequential implementation of Monte Carlo tests with uniformly bounded resampling risk. Journal of the American Statistical Association, 104(488): , Michael I. Jordan. Machine-learning maestro Michael Jordan on the delusions of Big Data and other huge engineering efforts. IEEE Spectrum: Accessed: / 21

26 Frederick Mosteller and R. A. Fisher. Questions and answers. The American Statistician, 2(5):pp , R John Simes. An improved bonferroni procedure for multiple tests of significance. Biometrika, 73(3): , Samuel A Stouffer, Edward A Suchman, Leland C DeVinney, Shirley A Star, and Robin M Williams Jr. The American soldier: adjustment during army life.(studies in social psychology in World War II, Vol. 1.). Princeton Univ. Press, / 21

### Finding statistical patterns in Big Data

Finding statistical patterns in Big Data Patrick Rubin-Delanchy University of Bristol & Heilbronn Institute for Mathematical Research IAS Research Workshop: Data science for the real world (workshop 1)

### Anomaly detection for Big Data, networks and cyber-security

Anomaly detection for Big Data, networks and cyber-security Patrick Rubin-Delanchy University of Bristol & Heilbronn Institute for Mathematical Research Joint work with Nick Heard (Imperial College London),

### False Discovery Rates

False Discovery Rates John D. Storey Princeton University, Princeton, USA January 2010 Multiple Hypothesis Testing In hypothesis testing, statistical significance is typically based on calculations involving

### Combining Weak Statistical Evidence in Cyber Security

Combining Weak Statistical Evidence in Cyber Security Nick Heard Department of Mathematics, Imperial College London; Heilbronn Institute for Mathematical Research, University of Bristol Intelligent Data

### Multiple Testing. Joseph P. Romano, Azeem M. Shaikh, and Michael Wolf. Abstract

Multiple Testing Joseph P. Romano, Azeem M. Shaikh, and Michael Wolf Abstract Multiple testing refers to any instance that involves the simultaneous testing of more than one hypothesis. If decisions about

### Least Squares Estimation

Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David

### The Variability of P-Values. Summary

The Variability of P-Values Dennis D. Boos Department of Statistics North Carolina State University Raleigh, NC 27695-8203 boos@stat.ncsu.edu August 15, 2009 NC State Statistics Departement Tech Report

### GLM, insurance pricing & big data: paying attention to convergence issues.

GLM, insurance pricing & big data: paying attention to convergence issues. Michaël NOACK - michael.noack@addactis.com Senior consultant & Manager of ADDACTIS Pricing Copyright 2014 ADDACTIS Worldwide.

### Detection of changes in variance using binary segmentation and optimal partitioning

Detection of changes in variance using binary segmentation and optimal partitioning Christian Rohrbeck Abstract This work explores the performance of binary segmentation and optimal partitioning in the

### Package HHG. July 14, 2015

Type Package Package HHG July 14, 2015 Title Heller-Heller-Gorfine Tests of Independence and Equality of Distributions Version 1.5.1 Date 2015-07-13 Author Barak Brill & Shachar Kaufman, based in part

### Median of the p-value Under the Alternative Hypothesis

Median of the p-value Under the Alternative Hypothesis Bhaskar Bhattacharya Department of Mathematics, Southern Illinois University, Carbondale, IL, USA Desale Habtzghi Department of Statistics, University

### On Small Sample Properties of Permutation Tests: A Significance Test for Regression Models

On Small Sample Properties of Permutation Tests: A Significance Test for Regression Models Hisashi Tanizaki Graduate School of Economics Kobe University (tanizaki@kobe-u.ac.p) ABSTRACT In this paper we

### Notes for STA 437/1005 Methods for Multivariate Data

Notes for STA 437/1005 Methods for Multivariate Data Radford M. Neal, 26 November 2010 Random Vectors Notation: Let X be a random vector with p elements, so that X = [X 1,..., X p ], where denotes transpose.

### Bootstrapping Big Data

Bootstrapping Big Data Ariel Kleiner Ameet Talwalkar Purnamrita Sarkar Michael I. Jordan Computer Science Division University of California, Berkeley {akleiner, ameet, psarkar, jordan}@eecs.berkeley.edu

### Introduction to Hypothesis Testing. Point estimation and confidence intervals are useful statistical inference procedures.

Introduction to Hypothesis Testing Point estimation and confidence intervals are useful statistical inference procedures. Another type of inference is used frequently used concerns tests of hypotheses.

### Invariance and optimality Linear rank statistics Permutation tests in R. Rank Tests. Patrick Breheny. October 7. STA 621: Nonparametric Statistics

Rank Tests October 7 Power Invariance and optimality Permutation testing allows great freedom to use a wide variety of test statistics, all of which lead to exact level-α tests regardless of the distribution

### Correlational Research

Correlational Research Chapter Fifteen Correlational Research Chapter Fifteen Bring folder of readings The Nature of Correlational Research Correlational Research is also known as Associational Research.

### Monte Carlo Simulation

1 Monte Carlo Simulation Stefan Weber Leibniz Universität Hannover email: sweber@stochastik.uni-hannover.de web: www.stochastik.uni-hannover.de/ sweber Monte Carlo Simulation 2 Quantifying and Hedging

### A SURVEY ON CONTINUOUS ELLIPTICAL VECTOR DISTRIBUTIONS

A SURVEY ON CONTINUOUS ELLIPTICAL VECTOR DISTRIBUTIONS Eusebio GÓMEZ, Miguel A. GÓMEZ-VILLEGAS and J. Miguel MARÍN Abstract In this paper it is taken up a revision and characterization of the class of

### Information Theory and Stock Market

Information Theory and Stock Market Pongsit Twichpongtorn University of Illinois at Chicago E-mail: ptwich2@uic.edu 1 Abstract This is a short survey paper that talks about the development of important

### Descriptive Data Summarization

Descriptive Data Summarization (Understanding Data) First: Some data preprocessing problems... 1 Missing Values The approach of the problem of missing values adopted in SQL is based on nulls and three-valued

### Quantitative Biology Lecture 5 (Hypothesis Testing)

15 th Oct 2015 Quantitative Biology Lecture 5 (Hypothesis Testing) Gurinder Singh Mickey Atwal Center for Quantitative Biology Summary Classification Errors Statistical significance T-tests Q-values (Traditional)

### 4. Joint Distributions

Virtual Laboratories > 2. Distributions > 1 2 3 4 5 6 7 8 4. Joint Distributions Basic Theory As usual, we start with a random experiment with probability measure P on an underlying sample space. Suppose

### Econometric Analysis of Cross Section and Panel Data Second Edition. Jeffrey M. Wooldridge. The MIT Press Cambridge, Massachusetts London, England

Econometric Analysis of Cross Section and Panel Data Second Edition Jeffrey M. Wooldridge The MIT Press Cambridge, Massachusetts London, England Preface Acknowledgments xxi xxix I INTRODUCTION AND BACKGROUND

### Exact Nonparametric Tests for Comparing Means - A Personal Summary

Exact Nonparametric Tests for Comparing Means - A Personal Summary Karl H. Schlag European University Institute 1 December 14, 2006 1 Economics Department, European University Institute. Via della Piazzuola

### Message-passing sequential detection of multiple change points in networks

Message-passing sequential detection of multiple change points in networks Long Nguyen, Arash Amini Ram Rajagopal University of Michigan Stanford University ISIT, Boston, July 2012 Nguyen/Amini/Rajagopal

### Checklists and Examples for Registering Statistical Analyses

Checklists and Examples for Registering Statistical Analyses For well-designed confirmatory research, all analysis decisions that could affect the confirmatory results should be planned and registered

### Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Probabilistic Models for Big Data Alex Davies and Roger Frigola University of Cambridge 13th February 2014 The State of Big Data Why probabilistic models for Big Data? 1. If you don t have to worry about

### Package cpm. July 28, 2015

Package cpm July 28, 2015 Title Sequential and Batch Change Detection Using Parametric and Nonparametric Methods Version 2.2 Date 2015-07-09 Depends R (>= 2.15.0), methods Author Gordon J. Ross Maintainer

### A Perspective on Statistical Tools for Data Mining Applications

A Perspective on Statistical Tools for Data Mining Applications David M. Rocke Center for Image Processing and Integrated Computing University of California, Davis Statistics and Data Mining Statistics

### Chapter 1. Introduction

Chapter 1 Introduction 1.1. Motivation Network performance analysis, and the underlying queueing theory, was born at the beginning of the 20th Century when two Scandinavian engineers, Erlang 1 and Engset

### Intrusion Detection: Game Theory, Stochastic Processes and Data Mining

Intrusion Detection: Game Theory, Stochastic Processes and Data Mining Joseph Spring 7COM1028 Secure Systems Programming 1 Discussion Points Introduction Firewalls Intrusion Detection Schemes Models Stochastic

### A Study of Local Optima in the Biobjective Travelling Salesman Problem

A Study of Local Optima in the Biobjective Travelling Salesman Problem Luis Paquete, Marco Chiarandini and Thomas Stützle FG Intellektik, Technische Universität Darmstadt, Alexanderstr. 10, Darmstadt,

### Multivariate Analysis of Ecological Data

Multivariate Analysis of Ecological Data MICHAEL GREENACRE Professor of Statistics at the Pompeu Fabra University in Barcelona, Spain RAUL PRIMICERIO Associate Professor of Ecology, Evolutionary Biology

### Data Analysis and Uncertainty Part 3: Hypothesis Testing/Sampling

Data Analysis and Uncertainty Part 3: Hypothesis Testing/Sampling Instructor: Sargur N. University at Buffalo The State University of New York srihari@cedar.buffalo.edu Topics 1. Hypothesis Testing 1.

### How to Conduct a Hypothesis Test

How to Conduct a Hypothesis Test The idea of hypothesis testing is relatively straightforward. In various studies we observe certain events. We must ask, is the event due to chance alone, or is there some

### 1. How different is the t distribution from the normal?

Statistics 101 106 Lecture 7 (20 October 98) c David Pollard Page 1 Read M&M 7.1 and 7.2, ignoring starred parts. Reread M&M 3.2. The effects of estimated variances on normal approximations. t-distributions.

### Poisson Models for Count Data

Chapter 4 Poisson Models for Count Data In this chapter we study log-linear models for count data under the assumption of a Poisson error structure. These models have many applications, not only to the

### Probabilistic Methods for Time-Series Analysis

Probabilistic Methods for Time-Series Analysis 2 Contents 1 Analysis of Changepoint Models 1 1.1 Introduction................................ 1 1.1.1 Model and Notation....................... 2 1.1.2 Example:

### Component Ordering in Independent Component Analysis Based on Data Power

Component Ordering in Independent Component Analysis Based on Data Power Anne Hendrikse Raymond Veldhuis University of Twente University of Twente Fac. EEMCS, Signals and Systems Group Fac. EEMCS, Signals

### Hypothesis tests, confidence intervals, and bootstrapping

Hypothesis tests, confidence intervals, and bootstrapping Business Statistics 41000 Fall 2015 1 Topics 1. Hypothesis tests Testing a mean: H0 : µ = µ 0 Testing a proportion: H0 : p = p 0 Testing a difference

### LAB 4 INSTRUCTIONS CONFIDENCE INTERVALS AND HYPOTHESIS TESTING

LAB 4 INSTRUCTIONS CONFIDENCE INTERVALS AND HYPOTHESIS TESTING In this lab you will explore the concept of a confidence interval and hypothesis testing through a simulation problem in engineering setting.

### 3.6: General Hypothesis Tests

3.6: General Hypothesis Tests The χ 2 goodness of fit tests which we introduced in the previous section were an example of a hypothesis test. In this section we now consider hypothesis tests more generally.

### Uncertainty quantification for the family-wise error rate in multivariate copula models

Uncertainty quantification for the family-wise error rate in multivariate copula models Thorsten Dickhaus (joint work with Taras Bodnar, Jakob Gierl and Jens Stange) University of Bremen Institute for

### Multiple testing with gene expression array data

Multiple testing with gene expression array data Anja von Heydebreck Max Planck Institute for Molecular Genetics, Dept. Computational Molecular Biology, Berlin, Germany heydebre@molgen.mpg.de Slides partly

### Test of Hypotheses. Since the Neyman-Pearson approach involves two statistical hypotheses, one has to decide which one

Test of Hypotheses Hypothesis, Test Statistic, and Rejection Region Imagine that you play a repeated Bernoulli game: you win \$1 if head and lose \$1 if tail. After 10 plays, you lost \$2 in net (4 heads

### fifty Fathoms Statistics Demonstrations for Deeper Understanding Tim Erickson

fifty Fathoms Statistics Demonstrations for Deeper Understanding Tim Erickson Contents What Are These Demos About? How to Use These Demos If This Is Your First Time Using Fathom Tutorial: An Extended Example

### We shall turn our attention to solving linear systems of equations. Ax = b

59 Linear Algebra We shall turn our attention to solving linear systems of equations Ax = b where A R m n, x R n, and b R m. We already saw examples of methods that required the solution of a linear system

### Counting the number of Sudoku s by importance sampling simulation

Counting the number of Sudoku s by importance sampling simulation Ad Ridder Vrije University, Amsterdam, Netherlands May 31, 2013 Abstract Stochastic simulation can be applied to estimate the number of

### Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts

### E205 Final: Version B

Name: Class: Date: E205 Final: Version B Multiple Choice Identify the choice that best completes the statement or answers the question. 1. The owner of a local nightclub has recently surveyed a random

: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

### MODELING RANDOMNESS IN NETWORK TRAFFIC

MODELING RANDOMNESS IN NETWORK TRAFFIC - LAVANYA JOSE, INDEPENDENT WORK FALL 11 ADVISED BY PROF. MOSES CHARIKAR ABSTRACT. Sketches are randomized data structures that allow one to record properties of

### . (3.3) n Note that supremum (3.2) must occur at one of the observed values x i or to the left of x i.

Chapter 3 Kolmogorov-Smirnov Tests There are many situations where experimenters need to know what is the distribution of the population of their interest. For example, if they want to use a parametric

### Chapter 16 Multiple Choice Questions (The answers are provided after the last question.)

Chapter 16 Multiple Choice Questions (The answers are provided after the last question.) 1. Which of the following symbols represents a population parameter? a. SD b. σ c. r d. 0 2. If you drew all possible

### Femtocell Coverage Optimisation Using Statistical Verification

Femtocell Coverage Optimisation Using Statistical Verification Tiejun Ma and Peter Pietzuch Department of Computing, Imperial College London, London SW7 2AZ, United Kingdom {tma, prp}@doc.ic.ac.uk Abstract.

### Optimal Gateway Selection in Multi-domain Wireless Networks: A Potential Game Perspective

Optimal Gateway Selection in Multi-domain Wireless Networks: A Potential Game Perspective Yang Song, Starsky H.Y. Wong, and Kang-Won Lee Wireless Networking Research Group IBM T. J. Watson Research Center

### The Bonferonni and Šidák Corrections for Multiple Comparisons

The Bonferonni and Šidák Corrections for Multiple Comparisons Hervé Abdi 1 1 Overview The more tests we perform on a set of data, the more likely we are to reject the null hypothesis when it is true (i.e.,

### Report on application of Probability in Risk Analysis in Oil and Gas Industry

Report on application of Probability in Risk Analysis in Oil and Gas Industry Abstract Risk Analysis in Oil and Gas Industry Global demand for energy is rising around the world. Meanwhile, managing oil

### Data Mining. Nonlinear Classification

Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

### Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Unit # 10 Sajjad Haider Fall 2012 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

### Chapter 7 Section 1 Homework Set A

Chapter 7 Section 1 Homework Set A 7.15 Finding the critical value t *. What critical value t * from Table D (use software, go to the web and type t distribution applet) should be used to calculate the

### Competitive Analysis of QoS Networks

Competitive Analysis of QoS Networks What is QoS? The art of performance analysis What is competitive analysis? Example: Scheduling with deadlines Example: Smoothing real-time streams Example: Overflow

### An introduction to OBJECTIVE ASSESSMENT OF IMAGE QUALITY. Harrison H. Barrett University of Arizona Tucson, AZ

An introduction to OBJECTIVE ASSESSMENT OF IMAGE QUALITY Harrison H. Barrett University of Arizona Tucson, AZ Outline! Approaches to image quality! Why not fidelity?! Basic premises of the task-based approach!

### Handling attrition and non-response in longitudinal data

Longitudinal and Life Course Studies 2009 Volume 1 Issue 1 Pp 63-72 Handling attrition and non-response in longitudinal data Harvey Goldstein University of Bristol Correspondence. Professor H. Goldstein

### Department of Economics

Department of Economics On Testing for Diagonality of Large Dimensional Covariance Matrices George Kapetanios Working Paper No. 526 October 2004 ISSN 1473-0278 On Testing for Diagonality of Large Dimensional

### An Introduction to Using WinBUGS for Cost-Effectiveness Analyses in Health Economics

Slide 1 An Introduction to Using WinBUGS for Cost-Effectiveness Analyses in Health Economics Dr. Christian Asseburg Centre for Health Economics Part 1 Slide 2 Talk overview Foundations of Bayesian statistics

### p Values and Alternative Boundaries

p Values and Alternative Boundaries for CUSUM Tests Achim Zeileis Working Paper No. 78 December 2000 December 2000 SFB Adaptive Information Systems and Modelling in Economics and Management Science Vienna

### Statistics in Retail Finance. Chapter 6: Behavioural models

Statistics in Retail Finance 1 Overview > So far we have focussed mainly on application scorecards. In this chapter we shall look at behavioural models. We shall cover the following topics:- Behavioural

Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650

### Lecture Notes on Spanning Trees

Lecture Notes on Spanning Trees 15-122: Principles of Imperative Computation Frank Pfenning Lecture 26 April 26, 2011 1 Introduction In this lecture we introduce graphs. Graphs provide a uniform model

### NPTEL STRUCTURAL RELIABILITY

NPTEL Course On STRUCTURAL RELIABILITY Module # 02 Lecture 6 Course Format: Web Instructor: Dr. Arunasis Chakraborty Department of Civil Engineering Indian Institute of Technology Guwahati 6. Lecture 06:

### Calculating Interval Forecasts

Calculating Chapter 7 (Chatfield) Monika Turyna & Thomas Hrdina Department of Economics, University of Vienna Summer Term 2009 Terminology An interval forecast consists of an upper and a lower limit between

### Package dunn.test. January 6, 2016

Version 1.3.2 Date 2016-01-06 Package dunn.test January 6, 2016 Title Dunn's Test of Multiple Comparisons Using Rank Sums Author Alexis Dinno Maintainer Alexis Dinno

### Permutation Tests for Comparing Two Populations

Permutation Tests for Comparing Two Populations Ferry Butar Butar, Ph.D. Jae-Wan Park Abstract Permutation tests for comparing two populations could be widely used in practice because of flexibility of

### Permutation P-values Should Never Be Zero: Calculating Exact P-values When Permutations Are Randomly Drawn

Permutation P-values Should Never Be Zero: Calculating Exact P-values When Permutations Are Randomly Drawn Gordon K. Smyth & Belinda Phipson Walter and Eliza Hall Institute of Medical Research Melbourne,

### Machine Learning Big Data using Map Reduce

Machine Learning Big Data using Map Reduce By Michael Bowles, PhD Where Does Big Data Come From? -Web data (web logs, click histories) -e-commerce applications (purchase histories) -Retail purchase histories

### Generating Random Samples from the Generalized Pareto Mixture Model

Generating Random Samples from the Generalized Pareto Mixture Model MUSTAFA ÇAVUŞ AHMET SEZER BERNA YAZICI Department of Statistics Anadolu University Eskişehir 26470 TURKEY mustafacavus@anadolu.edu.tr

### Managing uncertainty in call centers using Poisson mixtures

Managing uncertainty in call centers using Poisson mixtures Geurt Jongbloed and Ger Koole Vrije Universiteit, Division of Mathematics and Computer Science De Boelelaan 1081a, 1081 HV Amsterdam, The Netherlands

### LAB 4 ASSIGNMENT CONFIDENCE INTERVALS AND HYPOTHESIS TESTING. Using Data to Make Decisions

LAB 4 ASSIGNMENT CONFIDENCE INTERVALS AND HYPOTHESIS TESTING This lab assignment will give you the opportunity to explore the concept of a confidence interval and hypothesis testing in the context of a

### The Effect Of Different Degrees Of Freedom Of The Chi-square Distribution On The Statistical Power Of The t, Permutation t, And Wilcoxon Tests

Journal of Modern Applied Statistical Methods Volume 6 Issue 2 Article 9 11-1-2007 The Effect Of Different Degrees Of Freedom Of The Chi-square Distribution On The Statistical Of The t, Permutation t,

### Tutorial 5: Hypothesis Testing

Tutorial 5: Hypothesis Testing Rob Nicholls nicholls@mrc-lmb.cam.ac.uk MRC LMB Statistics Course 2014 Contents 1 Introduction................................ 1 2 Testing distributional assumptions....................

### The Relative Worst Order Ratio for On-Line Algorithms

The Relative Worst Order Ratio for On-Line Algorithms Joan Boyar 1 and Lene M. Favrholdt 2 1 Department of Mathematics and Computer Science, University of Southern Denmark, Odense, Denmark, joan@imada.sdu.dk

### From the help desk: Bootstrapped standard errors

The Stata Journal (2003) 3, Number 1, pp. 71 80 From the help desk: Bootstrapped standard errors Weihua Guan Stata Corporation Abstract. Bootstrapping is a nonparametric approach for evaluating the distribution

### Adjusted Means in Analysis of Covariance: Are They Meaningful?

Clason & Mundfrom Adjusted Means in Analysis of Covariance: Are They Meaningful? Dennis L. Clason Daniel J. Mundfrom New Mexico State University Eastern Kentucky University In many data analytical applications

### The alternative hypothesis,, is the statement that the parameter value somehow differs from that claimed by the null hypothesis. : 0.5 :>0.5 :<0.

Section 8.2-8.5 Null and Alternative Hypotheses... The null hypothesis,, is a statement that the value of a population parameter is equal to some claimed value. :=0.5 The alternative hypothesis,, is the

### Dealing with large datasets

Dealing with large datasets (by throwing away most of the data) Alan Heavens Institute for Astronomy, University of Edinburgh with Ben Panter, Rob Tweedie, Mark Bastin, Will Hossack, Keith McKellar, Trevor

### Learning Vector Quantization: generalization ability and dynamics of competing prototypes

Learning Vector Quantization: generalization ability and dynamics of competing prototypes Aree Witoelar 1, Michael Biehl 1, and Barbara Hammer 2 1 University of Groningen, Mathematics and Computing Science

### Statistical Machine Learning

Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

### Randomized algorithms

Randomized algorithms March 10, 2005 1 What are randomized algorithms? Algorithms which use random numbers to make decisions during the executions of the algorithm. Why would we want to do this?? Deterministic

### HOW TO WRITE A LABORATORY REPORT

HOW TO WRITE A LABORATORY REPORT Pete Bibby Dept of Psychology 1 About Laboratory Reports The writing of laboratory reports is an essential part of the practical course One function of this course is to

### ANALYZING NETWORK TRAFFIC FOR MALICIOUS ACTIVITY

CANADIAN APPLIED MATHEMATICS QUARTERLY Volume 12, Number 4, Winter 2004 ANALYZING NETWORK TRAFFIC FOR MALICIOUS ACTIVITY SURREY KIM, 1 SONG LI, 2 HONGWEI LONG 3 AND RANDALL PYKE Based on work carried out

### The Basics of Graphical Models

The Basics of Graphical Models David M. Blei Columbia University October 3, 2015 Introduction These notes follow Chapter 2 of An Introduction to Probabilistic Graphical Models by Michael Jordan. Many figures

### Efficient Fault-Tolerant Infrastructure for Cloud Computing

Efficient Fault-Tolerant Infrastructure for Cloud Computing Xueyuan Su Candidate for Ph.D. in Computer Science, Yale University December 2013 Committee Michael J. Fischer (advisor) Dana Angluin James Aspnes

### OBJECTIVE ASSESSMENT OF FORECASTING ASSIGNMENTS USING SOME FUNCTION OF PREDICTION ERRORS

OBJECTIVE ASSESSMENT OF FORECASTING ASSIGNMENTS USING SOME FUNCTION OF PREDICTION ERRORS CLARKE, Stephen R. Swinburne University of Technology Australia One way of examining forecasting methods via assignments

### Composite performance measures in the public sector Rowena Jacobs, Maria Goddard and Peter C. Smith

Policy Discussion Briefing January 27 Composite performance measures in the public sector Rowena Jacobs, Maria Goddard and Peter C. Smith Introduction It is rare to open a newspaper or read a government

### Chapter 21. More About Tests and Intervals. Copyright 2012, 2008, 2005 Pearson Education, Inc.

Chapter 21 More About Tests and Intervals Copyright 2012, 2008, 2005 Pearson Education, Inc. Zero In on the Null Null hypotheses have special requirements. To perform a hypothesis test, the null must be

### Inferential Statistics

Inferential Statistics Sampling and the normal distribution Z-scores Confidence levels and intervals Hypothesis testing Commonly used statistical methods Inferential Statistics Descriptive statistics are