1 ROOT 3, Saas-Fee, March, Program for statistical comparison of histograms S. Bityukov (IHEP, INR), N. Tsirova (NPI MSU) Scope Introduction Statistical comparison of two histograms Normalized significance Example Distance measure Internal resolution of the method Script stat analyzer.c Missing E T : large difference between histograms P T light jet: only statistical difference Output of the script Conclusion
2 ROOT 3, Saas-Fee, March, Introduction Sometimes we have two histograms and are faced with the question: Are the realizations of random variables in the form of these histograms taken from the same statistical population? There are many approaches for comparison of two histograms: Kolmogorov-Smirnov test, Cramer-von- Mises test, Anderson-Darling test, Likelihood ratio test for shape and for slope,.... The review on this subject is in the paper Testing Consistency of Two Histograms by F. Porter. Usually, a special test-statistic is used for comparing the two histograms (for example, χ, NIM, A64 () 87 ).
3 ROOT 3, Saas-Fee, March, Statistical comparison of two histograms A method, which uses the distribution of test-statistics for each bin of histograms, is proposed in note arxiv:3.65. These test-statistics obeys the distribution close to standard normal distribution if both histograms are obtained from the same statistical population by the same analyzer of data. Correspondingly, the joint distribution must be close to standard normal distribution. For example, if we consider observed values which are produced by the same Poisson flow of events during equal independent time ranges then a test statistic Ŝ c = ( ˆn i ˆn i ), where ˆn i is a number events in bin i of histogram, ˆn i is a number events in bin i of histogram, is the test statistic of such type. It is shown in paper PoS(ACAT8)8. Often test-statistics of such type are named as significances of a deviation or significances of an enhancement.
4 ROOT 3, Saas-Fee, March, Normalized significance Let us consider a model with two histograms where the random variable in each bin obeys the normal distribution ϕ(x n ik ) = e (x n ik ) σ ik. () πσik Here the expected value in the bin i is equal to n ik and the variance σ n ik is also equal to n ik. k is the histogram number (k =, ). Let integral in the histogram equals N and integral in the histogram equals N (for example, if we have different integrated luminosity in experiments). We define the normalized significance as Ŝ i (K) = ˆn i Kˆn i ˆσ ni + Kˆσ n i. () Here ˆn ik is an observed value in the bin i of the histogram k, ˆσ n ik = ˆn ik and coefficient of normalization K = N N.
5 ROOT 3, Saas-Fee, March, Example hist h Entries Mean RMS 35.8 s_i 3 u Entries Mean 499 RMS hist h Entries Mean RMS 35.6 Signif u Entries Mean RMS Figure : Triangle distributions (K =, M = ): the observed values ˆn i in the first histogram (left,up), the observed values n i in the second histogram (left, down), observed normalized significances Ŝi bin-by-bin (right, up), the distribution of observed normalized significances (right, down). The example with histograms produced from the same events flow during unequal independent time ranges shows that the standard deviation (RMS in the ROOT notation) of the distribution in the picture (right, down) can be used as estimator of the statistical difference between histograms (this distribution is close to standard normal distribution in our example).
6 ROOT 3, Saas-Fee, March, Distance measure The RMS as a distance measure in our case has a clear interpretation (in fact, we set a scale of this differmeter ): RMS = histograms are identical; RMS both histograms are obtained (by the using independent samples) from the same parent distribution; RMS >> histograms are obtained from different parent distributions. An accuracy (internal resolution) of the method depends on the number of bins M in histograms and on the normalized coefficient K. The accuracy can be estimated via Monte Carlo experiments. RMS p Entries 5 Mean.5 RMS.4395 χ / ndf.8 / Constant. ±. Mean.4 ±. Sigma.457 ± Figure : Distribution of RMS for 5 comparisons of histograms (triangle distribution, K =, M = 3).
7 ROOT 3, Saas-Fee, March, Internal resolution of the method The dependencies of the internal resolution on the bin numbers M and on the value of coefficient of normalization K are shown in Fig. 3. standard deviation of RMS & binning for histogram σ RMS The precision of the RMS estimation Monte Carlo (5 trials for each point). K =.5.5 K = M, bins Figure 3: The dependence of the standard deviation of RMS on number of bins M. This dependence is shown for two values of normalized coefficient K (K = and K =.5).
8 ROOT 3, Saas-Fee, March, Script stat analyzer.c Two input files (*.root) with a set of THD histograms to compare. User should indicate which histograms he/she wants to compare. Processing: calculate mean value, RMS and p-value for each pair of histograms; sort variables by RMS in descending order; print sorted results (variable p-value RMS mean). Output: to screen; to text file. Commands for start-up of script in ROOT root.l stat analyzer.c++ root stat analyzer( signal rv, signal )
9 ROOT 3, Saas-Fee, March, Missing E T : large difference between histograms.8 MET, signal s_i 5 Significances, bin-by-bin s Entries Mean RMS MET, anomalous Wtb coupling Signif 4 Distribution of significances d Entries Mean.993 RMS p-value = e Figure 4: MET: the observed values in the first histogram (left,up), the observed values in the second histogram (left, down), observed normalized significances bin-by-bin (right, up), the distribution of observed normalized significances (right, down). Let us consider the distributions of the probability (with errors) of the missing E T to be in corresponding bin of histogram in the case of registration of Standard Model events and the events which are produced due to the presence of anomalous Wtb coupling. The comparison of these distributions is shown in Fig. 4 (RMS =.74, Mean =.99).
10 ROOT 3, Saas-Fee, March, P T light jet: only statistical difference.6.4 Pt_LJ, signal s_i.5 Significances, bin-by-bin s Entries Mean 83.3 RMS Pt_LJ, anomalous Wtb coupling Signif 8 Distribution of significances d Entries Mean.446 RMS p-value = Figure 5: Pt LJ: the observed values in the first histogram (left,up), the observed values in the second histogram (left, down), observed normalized significances bin-by-bin (right, up), the distribution of observed normalized significances (right, down). The case of the absence of the difference between distributions for the production of single top quark in frame of SM and model with anomalous Wtb coupling is shown in Fig. 5 for P t distribution of jet from light quark (RMS =.98, Mean =.4).
11 ROOT 3, Saas-Fee, March, Output of the script Figure 6: Output of the script
12 ROOT 3, Saas-Fee, March, Conclusions We discussed the possible tool for comparison of histograms in frame of the ROOT system (script stat analyzer.c). This tool very easy in use and very easy to understand the results. This tool can be used in tasks of monitoring of the equipment. In this moment the method is used for choice of the most significant variables in multivariate analysis. This tool requires the additional development (now the method works for histograms THD).
13 ROOT 3, Saas-Fee, March, η of lepton: the difference exits.9 Eta_Lep, signal s_i 3 Significances, bin-by-bin s Entries Mean.45 RMS Eta_Lep, anomalous Wtb coupling Signif 7 Distribution of significances d Entries Mean.67 RMS p-value = Figure 7: Eta Lep: the observed values in the first histogram (left,up), the observed values in the second histogram (left, down), observed normalized significances bin-by-bin (right, up), the distribution of observed normalized significances (right, down). The case of small difference between the histograms is shown in Fig. 7 (RMS =.37, Mean =.9).
Chapter 4 One-Way ANOVA Recall this chart that showed how most of our course would be organized: Explanatory Variable(s) Response Variable Methods Categorical Categorical Contingency Tables Categorical
STAT355 Chapter 6: Point Estimation Fall 2011 Chapter Fall 2011 6: Point1 Estimat / 18 Chap 6 - Point Estimation 1 6.1 Some general Concepts of Point Estimation Point Estimate Unbiasedness Principle of
Chapter 11 Two-Way ANOVA An analysis method for a quantitative outcome and two categorical explanatory variables. If an experiment has a quantitative outcome and two categorical explanatory variables that
CHAPTER 5. A POPULATION MEAN, CONFIDENCE INTERVALS AND HYPOTHESIS TESTING 5.1 Concepts When a number of animals or plots are exposed to a certain treatment, we usually estimate the effect of the treatment
MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. Design
Chapter 45 Non-Inferiority ests for One Mean Introduction his module computes power and sample size for non-inferiority tests in one-sample designs in which the outcome is distributed as a normal random
MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. One-Way
Paper 69-25 PROC FREQ: It s More Than Counts Richard Severino, The Queen s Medical Center, Honolulu, HI ABSTRACT The FREQ procedure can be used for more than just obtaining a simple frequency distribution
INEB-PSI Technical Report 2005-1 Human Clustering on Bi-dimensional Data: An Assessment by Jorge M. Santos, J. P. Marques de Sá email@example.com October 2005 INEB -Instituto de Engenharia Biomédica FEUP/DEEC,
Version 3.0 The InStat guide to choosing and interpreting statistical tests Harvey Motulsky 1990-2003, GraphPad Software, Inc. All rights reserved. Program design, manual and help screens: Programming:
Introduction An Introduction to Point Pattern Analysis using CrimeStat Luc Anselin Spatial Analysis Laboratory Department of Agricultural and Consumer Economics University of Illinois, Urbana-Champaign
Current Probes, More Useful Than You Think Training and design help in most areas of Electrical Engineering Copyright 1998 Institute of Electrical and Electronics Engineers. Reprinted from the IEEE 1998
IBM SPSS Missing Values 22 Note Before using this information and the product it supports, read the information in Notices on page 23. Product Information This edition applies to version 22, release 0,
XPost: Excel Workbooks for the Post-estimation Interpretation of Regression Models for Categorical Dependent Variables Contents Simon Cheng firstname.lastname@example.org php.indiana.edu/~hscheng/ J. Scott Long email@example.com
Biology 1 Exercise 1: How to Record and Present Your Data Graphically Using Excel Dr. Chris Paradise, edited by Steven J. Price Introduction In this world of high technology and information overload scientists
United States Office of Environmental EPA/240/B-06/002 Environmental Protection Information Agency Washington, DC 20460 Data Quality Assessment: A Reviewer s Guide EPA QA/G-9R FOREWORD This document is
STATISTICAL METHODS FOR ASSESSING AGREEMENT BETWEEN TWO METHODS OF CLINICAL MEASUREMENT J. Martin Bland, Douglas G. Altman Department of Clinical Epidemiology and Social Medicine, St. George's Hospital
The Best of Both Worlds: A Hybrid Approach to Calculating Value at Risk Jacob Boudoukh 1, Matthew Richardson and Robert F. Whitelaw Stern School of Business, NYU The hybrid approach combines the two most
14. Regression A. Introduction to Simple Linear Regression B. Partitioning Sums of Squares C. Standard Error of the Estimate D. Inferential Statistics for b and r E. Influential Observations F. Regression
The Mascot protein identification program (Matrix Science, Ltd.) uses statistical methods to assess the validity of a match. MS/MS data is not ideal. That is, there are unassignable peaks (noise) and usually
Anomalous System Call Detection Darren Mutz, Fredrik Valeur, Christopher Kruegel, and Giovanni Vigna Reliable Software Group, University of California, Santa Barbara Secure Systems Lab, Technical University
A Guide to Sample Size Calculations for Random Effect Models via Simulation and the MLPowSim Software Package William J Browne, Mousa Golalizadeh Lahi* & Richard MA Parker School of Clinical Veterinary
Gregory Carey, 1998 MANOVA: I - 1 Multivariate Analysis of Variance (MANOVA): I. Theory Introduction The purpose of a t test is to assess the likelihood that the means for two groups are sampled from the
TWO L 1 BASED NONCONVEX METHODS FOR CONSTRUCTING SPARSE MEAN REVERTING PORTFOLIOS XIAOLONG LONG, KNUT SOLNA, AND JACK XIN Abstract. We study the problem of constructing sparse and fast mean reverting portfolios.
NSSE Multi-Year Data Analysis Guide About This Guide Questions from NSSE users about the best approach to using results from multiple administrations are increasingly common. More than three quarters of
1 Chapter 1 PRINCIPAL COMPONENT ANALYSIS Introduction: The Basics of Principal Component Analysis........................... 2 A Variable Reduction Procedure.......................................... 2