Program for statistical comparison of histograms

ROOT 3, Saas-Fee, March, Program for statistical comparison of histograms S. Bityukov (IHEP, INR), N. Tsirova (NPI MSU) Scope Introduction Statistical comparison of two histograms Normalized significance Example Distance measure Internal resolution of the method Script stat analyzer.c Missing E T : large difference between histograms P T light jet: only statistical difference Output of the script Conclusion

ROOT 3, Saas-Fee, March, Introduction Sometimes we have two histograms and are faced with the question: Are the realizations of random variables in the form of these histograms taken from the same statistical population? There are many approaches for comparison of two histograms: Kolmogorov-Smirnov test, Cramer-von- Mises test, Anderson-Darling test, Likelihood ratio test for shape and for slope,.... The review on this subject is in the paper Testing Consistency of Two Histograms by F. Porter. http://arxiv.org/abs/84.38 Usually, a special test-statistic is used for comparing the two histograms (for example, χ, NIM, A64 () 87 ).

ROOT 3, Saas-Fee, March, Statistical comparison of two histograms A method, which uses the distribution of test-statistics for each bin of histograms, is proposed in note arxiv:3.65. These test-statistics obeys the distribution close to standard normal distribution if both histograms are obtained from the same statistical population by the same analyzer of data. Correspondingly, the joint distribution must be close to standard normal distribution. For example, if we consider observed values which are produced by the same Poisson flow of events during equal independent time ranges then a test statistic Ŝ c = ( ˆn i ˆn i ), where ˆn i is a number events in bin i of histogram, ˆn i is a number events in bin i of histogram, is the test statistic of such type. It is shown in paper PoS(ACAT8)8. Often test-statistics of such type are named as significances of a deviation or significances of an enhancement.

ROOT 3, Saas-Fee, March, Normalized significance Let us consider a model with two histograms where the random variable in each bin obeys the normal distribution ϕ(x n ik ) = e (x n ik ) σ ik. () πσik Here the expected value in the bin i is equal to n ik and the variance σ n ik is also equal to n ik. k is the histogram number (k =, ). Let integral in the histogram equals N and integral in the histogram equals N (for example, if we have different integrated luminosity in experiments). We define the normalized significance as Ŝ i (K) = ˆn i Kˆn i ˆσ ni + Kˆσ n i. () Here ˆn ik is an observed value in the bin i of the histogram k, ˆσ n ik = ˆn ik and coefficient of normalization K = N N.

ROOT 3, Saas-Fee, March, Example hist h Entries Mean 666.6 RMS 35.8 s_i 3 u Entries Mean 499 RMS 9.3 8 6 4 - - 3 4 5 6 7 8 9-3 3 4 5 6 7 8 9 hist 9 8 7 h Entries Mean 665.8 RMS 35.6 Signif u Entries Mean -.873 RMS.35 6 8 5 4 6 3 3 4 5 6 7 8 9 4-4 - 4 6 8 4 Figure : Triangle distributions (K =, M = ): the observed values ˆn i in the first histogram (left,up), the observed values n i in the second histogram (left, down), observed normalized significances Ŝi bin-by-bin (right, up), the distribution of observed normalized significances (right, down). The example with histograms produced from the same events flow during unequal independent time ranges shows that the standard deviation (RMS in the ROOT notation) of the distribution in the picture (right, down) can be used as estimator of the statistical difference between histograms (this distribution is close to standard normal distribution in our example).

ROOT 3, Saas-Fee, March, Distance measure The RMS as a distance measure in our case has a clear interpretation (in fact, we set a scale of this differmeter ): RMS = histograms are identical; RMS both histograms are obtained (by the using independent samples) from the same parent distribution; RMS >> histograms are obtained from different parent distributions. An accuracy (internal resolution) of the method depends on the number of bins M in histograms and on the normalized coefficient K. The accuracy can be estimated via Monte Carlo experiments. RMS p Entries 5 Mean.5 RMS.4395 χ / ndf.8 / Constant. ±. Mean.4 ±. Sigma.457 ±.48 8 6 4.6.7.8.9...3.4 Figure : Distribution of RMS for 5 comparisons of histograms (triangle distribution, K =, M = 3).

ROOT 3, Saas-Fee, March, Internal resolution of the method The dependencies of the internal resolution on the bin numbers M and on the value of coefficient of normalization K are shown in Fig. 3. standard deviation of RMS & binning for histogram σ RMS.35.3.5..5 The precision of the RMS estimation Monte Carlo (5 trials for each point). K =.5.5 K = 4 6 8 M, bins Figure 3: The dependence of the standard deviation of RMS on number of bins M. This dependence is shown for two values of normalized coefficient K (K = and K =.5).

ROOT 3, Saas-Fee, March, Script stat analyzer.c Two input files (*.root) with a set of THD histograms to compare. User should indicate which histograms he/she wants to compare. Processing: calculate mean value, RMS and p-value for each pair of histograms; sort variables by RMS in descending order; print sorted results (variable p-value RMS mean). Output: to screen; to text file. Commands for start-up of script in ROOT root[].l stat analyzer.c++ root[] stat analyzer( signal rv, signal )

ROOT 3, Saas-Fee, March, Missing E T : large difference between histograms.8 MET, signal s_i 5 Significances, bin-by-bin s Entries Mean 54.33 RMS 3..7.6 5.5.4.3-5. -. 3 4 5 6 7 8 9-5 3 4 5 6 7 8 9.9 MET, anomalous Wtb coupling Signif 4 Distribution of significances d Entries Mean.993 RMS.74.8 3.5.7 3.6.5 p-value =.73499895835e-3.5.4.3.. 3 4 5 6 7 8 9.5.5 - -5 - -5 5 5 Figure 4: MET: the observed values in the first histogram (left,up), the observed values in the second histogram (left, down), observed normalized significances bin-by-bin (right, up), the distribution of observed normalized significances (right, down). Let us consider the distributions of the probability (with errors) of the missing E T to be in corresponding bin of histogram in the case of registration of Standard Model events and the events which are produced due to the presence of anomalous Wtb coupling. The comparison of these distributions is shown in Fig. 4 (RMS =.74, Mean =.99).

ROOT 3, Saas-Fee, March, P T light jet: only statistical difference.6.4 Pt_LJ, signal s_i.5 Significances, bin-by-bin s Entries Mean 83.3 RMS 55.64...5.8.6 -.5.4 -. -.5 4 6 8 4 6 8 4 6 8 4 6 8.6 Pt_LJ, anomalous Wtb coupling Signif 8 Distribution of significances d Entries Mean.446 RMS.984.4.. 7 6 5 p-value =.543645364349959.8.6 4 3.4. 4 6 8 4 6 8 - -8-6 -4-4 6 8 Figure 5: Pt LJ: the observed values in the first histogram (left,up), the observed values in the second histogram (left, down), observed normalized significances bin-by-bin (right, up), the distribution of observed normalized significances (right, down). The case of the absence of the difference between distributions for the production of single top quark in frame of SM and model with anomalous Wtb coupling is shown in Fig. 5 for P t distribution of jet from light quark (RMS =.98, Mean =.4).

ROOT 3, Saas-Fee, March, Output of the script Figure 6: Output of the script

ROOT 3, Saas-Fee, March, Conclusions We discussed the possible tool for comparison of histograms in frame of the ROOT system (script stat analyzer.c). This tool very easy in use and very easy to understand the results. This tool can be used in tasks of monitoring of the equipment. In this moment the method is used for choice of the most significant variables in multivariate analysis. This tool requires the additional development (now the method works for histograms THD).

ROOT 3, Saas-Fee, March, η of lepton: the difference exits.9 Eta_Lep, signal s_i 3 Significances, bin-by-bin s Entries Mean.45 RMS.6696.8.7.6.5.4.3 -.. - -.5 - -.5 - -.5.5.5.5.5.5.5. Eta_Lep, anomalous Wtb coupling Signif 7 Distribution of significances d Entries Mean.67 RMS.374.8 6 p-value =.6558479587 5.6 4.4 3. -.5 - -.5 - -.5.5.5.5 - -8-6 -4-4 6 8 Figure 7: Eta Lep: the observed values in the first histogram (left,up), the observed values in the second histogram (left, down), observed normalized significances bin-by-bin (right, up), the distribution of observed normalized significances (right, down). The case of small difference between the histograms is shown in Fig. 7 (RMS =.37, Mean =.9).