Program for statistical comparison of histograms



Similar documents
Theory versus Experiment. Prof. Jorgen D Hondt Vrije Universiteit Brussel jodhondt@vub.ac.be

Chapter 6: Point Estimation. Fall Probability & Statistics

Predictor Coef StDev T P Constant X S = R-Sq = 0.0% R-Sq(adj) = 0.

12: Analysis of Variance. Introduction

Measurement of Neutralino Mass Differences with CMS in Dilepton Final States at the Benchmark Point LM9

NCSS Statistical Software

Data Analysis Tools. Tools for Summarizing Data

NCSS Statistical Software

Introduction. The Supplement shows results of locking using a range of smoothing parameters α, and checkerboard tests.

Bowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition

5 Analysis of Variance models, complex linear models and Random effects models

Probability Distribution for Discrete Random Variables

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

1.5 Oneway Analysis of Variance

Statistical Functions in Excel

Recall this chart that showed how most of our course would be organized:

Face Recognition in Low-resolution Images by Using Local Zernike Moments

Java Modules for Time Series Analysis

How To Test For Significance On A Data Set

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Web-based pre-analysis Tools

Measurement of the Mass of the Top Quark in the l+ Jets Channel Using the Matrix Element Method

MEASURES OF LOCATION AND SPREAD

Top rediscovery at ATLAS and CMS

Chapter 3 RANDOM VARIATE GENERATION

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

Simple Linear Regression

Determining Measurement Uncertainty for Dimensional Measurements

Intro Root and statistics

Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median

How To Run Statistical Tests in Excel

Normality Testing in Excel

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Northumberland Knowledge

Lecture 25. December 19, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University.

2 Binomial, Poisson, Normal Distribution

ANALYZING NETWORK TRAFFIC FOR MALICIOUS ACTIVITY

Nonparametric Statistics

5. Linear Regression

Modeling Individual Claims for Motor Third Party Liability of Insurance Companies in Albania

Millikan Oil Drop Experiment Matthew Norton, Jurasits Christopher, Heyduck William, Nick Chumbley. Norton 0

LAB 4 INSTRUCTIONS CONFIDENCE INTERVALS AND HYPOTHESIS TESTING

ON MODELING INSURANCE CLAIMS

HLM software has been one of the leading statistical packages for hierarchical

Quantitative Methods for Finance

Data Preparation and Statistical Displays

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Basic Probability and Statistics Review. Six Sigma Black Belt Primer

Using R for Linear Regression

Prentice Hall Connected Mathematics 2, 7th Grade Units 2009

Notes on the Negative Binomial Distribution

Copyright PEOPLECERT Int. Ltd and IASSC

CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS

Bill Burton Albert Einstein College of Medicine April 28, 2014 EERS: Managing the Tension Between Rigor and Resources 1

Multiple Choice: 2 points each

Gerry Hobbs, Department of Statistics, West Virginia University

Descriptive Statistics

Statistical Distributions in Astronomy

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

Geostatistics Exploratory Analysis

Need for Sampling. Very large populations Destructive testing Continuous production process

Two-Group Hypothesis Tests: Excel 2013 T-TEST Command

Sampling Distributions

A Case Study in Software Enhancements as Six Sigma Process Improvements: Simulating Productivity Savings

Real Time Tracking with ATLAS Silicon Detectors and its Applications to Beauty Hadron Physics

Generalized Linear Models

Simulation Exercises to Reinforce the Foundations of Statistical Thinking in Online Classes

The Assumption(s) of Normality

Simulating Investment Portfolios

Performance Metrics for Graph Mining Tasks

Study of the B D* ℓ ν with the Partial Reconstruction Technique

Some Essential Statistics The Lure of Statistics

STAT 360 Probability and Statistics. Fall 2012

Binomial Distribution n = 20, p = 0.3

Good luck! BUSINESS STATISTICS FINAL EXAM INSTRUCTIONS. Name:

CRLS Mathematics Department Algebra I Curriculum Map/Pacing Guide

The Turning of JMP Software into a Semiconductor Analysis Software Product:

Multiple Optimization Using the JMP Statistical Software Kodak Research Conference May 9, 2005

Winter 2016 MATH 631 Online University of Waterloo

Condensing PQ Data and Visualization Analytics. Thomas Cooke

STAT 830 Convergence in Distribution

Chapter 13 Introduction to Linear Regression and Correlation Analysis

Analysis of Variance. MINITAB User s Guide 2 3-1

Statistical Process Control (SPC) Training Guide

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools

2. Simple Linear Regression

99.37, 99.38, 99.38, 99.39, 99.39, 99.39, 99.39, 99.40, 99.41, cm

Supplementary PROCESS Documentation

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Financial Econometrics MFE MATLAB Introduction. Kevin Sheppard University of Oxford

How To Find The Higgs Boson

RISK MITIGATION IN FAST TRACKING PROJECTS

Transcription:

ROOT 3, Saas-Fee, March, Program for statistical comparison of histograms S. Bityukov (IHEP, INR), N. Tsirova (NPI MSU) Scope Introduction Statistical comparison of two histograms Normalized significance Example Distance measure Internal resolution of the method Script stat analyzer.c Missing E T : large difference between histograms P T light jet: only statistical difference Output of the script Conclusion

ROOT 3, Saas-Fee, March, Introduction Sometimes we have two histograms and are faced with the question: Are the realizations of random variables in the form of these histograms taken from the same statistical population? There are many approaches for comparison of two histograms: Kolmogorov-Smirnov test, Cramer-von- Mises test, Anderson-Darling test, Likelihood ratio test for shape and for slope,.... The review on this subject is in the paper Testing Consistency of Two Histograms by F. Porter. http://arxiv.org/abs/84.38 Usually, a special test-statistic is used for comparing the two histograms (for example, χ, NIM, A64 () 87 ).

ROOT 3, Saas-Fee, March, Statistical comparison of two histograms A method, which uses the distribution of test-statistics for each bin of histograms, is proposed in note arxiv:3.65. These test-statistics obeys the distribution close to standard normal distribution if both histograms are obtained from the same statistical population by the same analyzer of data. Correspondingly, the joint distribution must be close to standard normal distribution. For example, if we consider observed values which are produced by the same Poisson flow of events during equal independent time ranges then a test statistic Ŝ c = ( ˆn i ˆn i ), where ˆn i is a number events in bin i of histogram, ˆn i is a number events in bin i of histogram, is the test statistic of such type. It is shown in paper PoS(ACAT8)8. Often test-statistics of such type are named as significances of a deviation or significances of an enhancement.

ROOT 3, Saas-Fee, March, Normalized significance Let us consider a model with two histograms where the random variable in each bin obeys the normal distribution ϕ(x n ik ) = e (x n ik ) σ ik. () πσik Here the expected value in the bin i is equal to n ik and the variance σ n ik is also equal to n ik. k is the histogram number (k =, ). Let integral in the histogram equals N and integral in the histogram equals N (for example, if we have different integrated luminosity in experiments). We define the normalized significance as Ŝ i (K) = ˆn i Kˆn i ˆσ ni + Kˆσ n i. () Here ˆn ik is an observed value in the bin i of the histogram k, ˆσ n ik = ˆn ik and coefficient of normalization K = N N.

ROOT 3, Saas-Fee, March, Example hist h Entries Mean 666.6 RMS 35.8 s_i 3 u Entries Mean 499 RMS 9.3 8 6 4 - - 3 4 5 6 7 8 9-3 3 4 5 6 7 8 9 hist 9 8 7 h Entries Mean 665.8 RMS 35.6 Signif u Entries Mean -.873 RMS.35 6 8 5 4 6 3 3 4 5 6 7 8 9 4-4 - 4 6 8 4 Figure : Triangle distributions (K =, M = ): the observed values ˆn i in the first histogram (left,up), the observed values n i in the second histogram (left, down), observed normalized significances Ŝi bin-by-bin (right, up), the distribution of observed normalized significances (right, down). The example with histograms produced from the same events flow during unequal independent time ranges shows that the standard deviation (RMS in the ROOT notation) of the distribution in the picture (right, down) can be used as estimator of the statistical difference between histograms (this distribution is close to standard normal distribution in our example).

ROOT 3, Saas-Fee, March, Distance measure The RMS as a distance measure in our case has a clear interpretation (in fact, we set a scale of this differmeter ): RMS = histograms are identical; RMS both histograms are obtained (by the using independent samples) from the same parent distribution; RMS >> histograms are obtained from different parent distributions. An accuracy (internal resolution) of the method depends on the number of bins M in histograms and on the normalized coefficient K. The accuracy can be estimated via Monte Carlo experiments. RMS p Entries 5 Mean.5 RMS.4395 χ / ndf.8 / Constant. ±. Mean.4 ±. Sigma.457 ±.48 8 6 4.6.7.8.9...3.4 Figure : Distribution of RMS for 5 comparisons of histograms (triangle distribution, K =, M = 3).

ROOT 3, Saas-Fee, March, Internal resolution of the method The dependencies of the internal resolution on the bin numbers M and on the value of coefficient of normalization K are shown in Fig. 3. standard deviation of RMS & binning for histogram σ RMS.35.3.5..5 The precision of the RMS estimation Monte Carlo (5 trials for each point). K =.5.5 K = 4 6 8 M, bins Figure 3: The dependence of the standard deviation of RMS on number of bins M. This dependence is shown for two values of normalized coefficient K (K = and K =.5).

ROOT 3, Saas-Fee, March, Script stat analyzer.c Two input files (*.root) with a set of THD histograms to compare. User should indicate which histograms he/she wants to compare. Processing: calculate mean value, RMS and p-value for each pair of histograms; sort variables by RMS in descending order; print sorted results (variable p-value RMS mean). Output: to screen; to text file. Commands for start-up of script in ROOT root[].l stat analyzer.c++ root[] stat analyzer( signal rv, signal )

ROOT 3, Saas-Fee, March, Missing E T : large difference between histograms.8 MET, signal s_i 5 Significances, bin-by-bin s Entries Mean 54.33 RMS 3..7.6 5.5.4.3-5. -. 3 4 5 6 7 8 9-5 3 4 5 6 7 8 9.9 MET, anomalous Wtb coupling Signif 4 Distribution of significances d Entries Mean.993 RMS.74.8 3.5.7 3.6.5 p-value =.73499895835e-3.5.4.3.. 3 4 5 6 7 8 9.5.5 - -5 - -5 5 5 Figure 4: MET: the observed values in the first histogram (left,up), the observed values in the second histogram (left, down), observed normalized significances bin-by-bin (right, up), the distribution of observed normalized significances (right, down). Let us consider the distributions of the probability (with errors) of the missing E T to be in corresponding bin of histogram in the case of registration of Standard Model events and the events which are produced due to the presence of anomalous Wtb coupling. The comparison of these distributions is shown in Fig. 4 (RMS =.74, Mean =.99).

ROOT 3, Saas-Fee, March, P T light jet: only statistical difference.6.4 Pt_LJ, signal s_i.5 Significances, bin-by-bin s Entries Mean 83.3 RMS 55.64...5.8.6 -.5.4 -. -.5 4 6 8 4 6 8 4 6 8 4 6 8.6 Pt_LJ, anomalous Wtb coupling Signif 8 Distribution of significances d Entries Mean.446 RMS.984.4.. 7 6 5 p-value =.543645364349959.8.6 4 3.4. 4 6 8 4 6 8 - -8-6 -4-4 6 8 Figure 5: Pt LJ: the observed values in the first histogram (left,up), the observed values in the second histogram (left, down), observed normalized significances bin-by-bin (right, up), the distribution of observed normalized significances (right, down). The case of the absence of the difference between distributions for the production of single top quark in frame of SM and model with anomalous Wtb coupling is shown in Fig. 5 for P t distribution of jet from light quark (RMS =.98, Mean =.4).

ROOT 3, Saas-Fee, March, Output of the script Figure 6: Output of the script

ROOT 3, Saas-Fee, March, Conclusions We discussed the possible tool for comparison of histograms in frame of the ROOT system (script stat analyzer.c). This tool very easy in use and very easy to understand the results. This tool can be used in tasks of monitoring of the equipment. In this moment the method is used for choice of the most significant variables in multivariate analysis. This tool requires the additional development (now the method works for histograms THD).

ROOT 3, Saas-Fee, March, η of lepton: the difference exits.9 Eta_Lep, signal s_i 3 Significances, bin-by-bin s Entries Mean.45 RMS.6696.8.7.6.5.4.3 -.. - -.5 - -.5 - -.5.5.5.5.5.5.5. Eta_Lep, anomalous Wtb coupling Signif 7 Distribution of significances d Entries Mean.67 RMS.374.8 6 p-value =.6558479587 5.6 4.4 3. -.5 - -.5 - -.5.5.5.5 - -8-6 -4-4 6 8 Figure 7: Eta Lep: the observed values in the first histogram (left,up), the observed values in the second histogram (left, down), observed normalized significances bin-by-bin (right, up), the distribution of observed normalized significances (right, down). The case of small difference between the histograms is shown in Fig. 7 (RMS =.37, Mean =.9).