Plug-in martingales for testing exchangeability on-line



Similar documents
Properties of MLE: consistency, asymptotic normality. Fisher information.

Chapter 6: Variance, the law of large numbers and the Monte-Carlo method

I. Chi-squared Distributions

Chapter 7 Methods of Finding Estimators

In nite Sequences. Dr. Philippe B. Laval Kennesaw State University. October 9, 2008

Modified Line Search Method for Global Optimization

THE ABRACADABRA PROBLEM

1. C. The formula for the confidence interval for a population mean is: x t, which was

Asymptotic Growth of Functions

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 13

Maximum Likelihood Estimators.

Confidence Intervals for One Mean

Incremental calculation of weighted mean and variance


Hypothesis testing. Null and alternative hypotheses

Sequences and Series

MARTINGALES AND A BASIC APPLICATION

Chapter 5: Inner Product Spaces

Overview of some probability distributions.

The following example will help us understand The Sampling Distribution of the Mean. C1 C2 C3 C4 C5 50 miles 84 miles 38 miles 120 miles 48 miles

LECTURE 13: Cross-validation

Definition. A variable X that takes on values X 1, X 2, X 3,...X k with respective frequencies f 1, f 2, f 3,...f k has mean

Universal coding for classes of sources

Vladimir N. Burkov, Dmitri A. Novikov MODELS AND METHODS OF MULTIPROJECTS MANAGEMENT

Center, Spread, and Shape in Inference: Claims, Caveats, and Insights

1 Computing the Standard Deviation of Sample Means

Taking DCOP to the Real World: Efficient Complete Solutions for Distributed Multi-Event Scheduling

University of California, Los Angeles Department of Statistics. Distributions related to the normal distribution

Section 11.3: The Integral Test

A Mathematical Perspective on Gambling

PSYCHOLOGICAL STATISTICS

Overview. Learning Objectives. Point Estimate. Estimation. Estimating the Value of a Parameter Using Confidence Intervals

Estimating Probability Distributions by Observing Betting Practices

Approximating Area under a curve with rectangles. To find the area under a curve we approximate the area using rectangles and then use limits to find

Statistical inference: example 1. Inferential Statistics

Inference on Proportion. Chapter 8 Tests of Statistical Hypotheses. Sampling Distribution of Sample Proportion. Confidence Interval

PROCEEDINGS OF THE YEREVAN STATE UNIVERSITY AN ALTERNATIVE MODEL FOR BONUS-MALUS SYSTEM

Department of Computer Science, University of Otago

UC Berkeley Department of Electrical Engineering and Computer Science. EE 126: Probablity and Random Processes. Solutions 9 Spring 2006

Chapter 7 - Sampling Distributions. 1 Introduction. What is statistics? It consist of three major areas:

Week 3 Conditional probabilities, Bayes formula, WEEK 3 page 1 Expected value of a random variable

THE HEIGHT OF q-binary SEARCH TREES

1 The Gaussian channel

The analysis of the Cournot oligopoly model considering the subjective motive in the strategy selection

5: Introduction to Estimation

Hypergeometric Distributions

Soving Recurrence Relations

Non-life insurance mathematics. Nils F. Haavardsson, University of Oslo and DNB Skadeforsikring

A probabilistic proof of a binomial identity

Case Study. Normal and t Distributions. Density Plot. Normal Distributions

INVESTMENT PERFORMANCE COUNCIL (IPC)

Normal Distribution.

Convexity, Inequalities, and Norms

Cooley-Tukey. Tukey FFT Algorithms. FFT Algorithms. Cooley

Chapter 14 Nonparametric Statistics

Measures of Spread and Boxplots Discrete Math, Section 9.4

hp calculators HP 12C Statistics - average and standard deviation Average and standard deviation concepts HP12C average and standard deviation

Determining the sample size

Output Analysis (2, Chapters 10 &11 Law)

SAMPLE QUESTIONS FOR FINAL EXAM. (1) (2) (3) (4) Find the following using the definition of the Riemann integral: (2x + 1)dx

Theorems About Power Series

BASIC STATISTICS. f(x 1,x 2,..., x n )=f(x 1 )f(x 2 ) f(x n )= f(x i ) (1)

5 Boolean Decision Trees (February 11)

CHAPTER 7: Central Limit Theorem: CLT for Averages (Means)

INFINITE SERIES KEITH CONRAD

Practice Problems for Test 3

Introduction to Statistical Learning Theory

An Efficient Polynomial Approximation of the Normal Distribution Function & Its Inverse Function

Lecture 4: Cheeger s Inequality

3 Basic Definitions of Probability Theory

Lecture 4: Cauchy sequences, Bolzano-Weierstrass, and the Squeeze theorem

Z-TEST / Z-STATISTIC: used to test hypotheses about. µ when the population standard deviation is unknown

Your organization has a Class B IP address of Before you implement subnetting, the Network ID and Host ID are divided as follows:

Systems Design Project: Indoor Location of Wireless Devices

A Faster Clause-Shortening Algorithm for SAT with No Restriction on Clause Length

Exam 3. Instructor: Cynthia Rudin TA: Dimitrios Bisias. November 22, 2011

SECTION 1.5 : SUMMATION NOTATION + WORK WITH SEQUENCES

arxiv: v1 [math.st] 21 Aug 2009

The Stable Marriage Problem

Subject CT5 Contingencies Core Technical Syllabus

Entropy of bi-capacities

arxiv: v1 [stat.me] 10 Jun 2015

Designing Incentives for Online Question and Answer Forums

Ekkehart Schlicht: Economic Surplus and Derived Demand

Example 2 Find the square root of 0. The only square root of 0 is 0 (since 0 is not positive or negative, so those choices don t exist here).

4.3. The Integral and Comparison Tests

Running Time ( 3.1) Analysis of Algorithms. Experimental Studies ( 3.1.1) Limitations of Experiments. Pseudocode ( 3.1.2) Theoretical Analysis

, a Wishart distribution with n -1 degrees of freedom and scale matrix.

Infinite Sequences and Series

Permutations, the Parity Theorem, and Determinants

THE REGRESSION MODEL IN MATRIX FORM. For simple linear regression, meaning one predictor, the model is. for i = 1, 2, 3,, n

ON AN INTEGRAL OPERATOR WHICH PRESERVE THE UNIVALENCE

Review: Classification Outline

Transcription:

Plug-i martigales for testig exchageability o-lie Valetia Fedorova, Alex Gammerma, Ilia Nouretdiov, ad Vladimir Vovk Computer Learig Research Cetre Royal Holloway, Uiversity of Lodo, UK {valetia,ilia,alex,vovk}@cs.rhul.ac.uk praktiqeskie vyvody teorii vero toste mogut bytь obosovay v kaqestve sledstvi gipotez o predelьo pri dayh ograiqei h sloжosti izuqaemyh vlei O-lie Compressio Modellig Project (New Series) Workig Paper #4 April 13, 212 Project web site: http://alrw.et

Abstract A stadard assumptio i machie learig is the exchageability of data, which is equivalet to assumig that the examples are geerated from the same probability distributio idepedetly. This paper is devoted to testig the assumptio of exchageability o-lie: the examples arrive oe by oe, ad after receivig each example we would like to have a valid measure of the degree to which the assumptio of exchageability has bee falsified. Such measures are provided by exchageability martigales. We exted kow techiques for costructig exchageability martigales ad show that our ew method is competitive with the martigales itroduced before. Fially we ivestigate the performace of our testig method o two bechmark datasets, USPS ad Statlog Satellite data; for the former, the kow techiques give satisfactory results, but for the latter our ew more flexible method becomes ecessary. Cotets 1 Itroductio 1 1.1 Related work............................. 1 1.2 This paper.............................. 2 2 Exchageability martigales 2 2.1 Exchageability............................ 2 2.2 Martigales for testig........................ 2 2.3 O-lie calculatio of p-values.................... 3 3 Martigales based o p-values 4 3.1 Previous results: power ad simple mixture martigales..... 5 3.2 New plug-i approach........................ 5 3.2.1 Plug-i martigale...................... 5 3.2.2 Assumptios......................... 7 3.2.3 Growth rate of plug-i martigale............. 7 4 Empirical results 1 4.1 USPS dataset............................. 1 4.2 Statlog Satellite dataset....................... 12 5 Discussio ad coclusios 13

1 Itroductio May machie learig algorithms have bee developed to deal with real-life high dimesioal data. I order to state ad prove properties of such algorithms it is stadard to assume that the data satisfy the exchageability assumptio (although some algorithms make differet assumptios or, i the case of predictio with expert advice, do ot make ay statistical assumptios at all). These properties ca be violated if the assumptio is ot satisfied, which makes it importat to test the data o satisfyig it. Note that the popular assumptio that the data is i.i.d. (idepedet ad idetically distributed) has the same meaig for testig as the exchageability assumptio. A joit distributio of a ifiite sequece of examples is exchageable if it is ivariat w.r. to ay permutatio of examples. Hece if the data is i.i.d., its distributio is exchageable. O the other had, by de Fietti s theorem (see, e.g., Schervish, 1995, p. 28) ay exchageable distributio o the data (a potetially ifiite sequece of examples) is a mixture of distributios uder which the data is i.i.d. Therefore, testig for exchageability is equivalet to testig for beig i.i.d. Traditioal statistical approaches to testig are iappropriate for high dimesioal data (see, e.g., Vapik, 1998, pp. 6 7). To address this challege a previous study (Vovk et al., 23) suggested a way of o-lie testig by employig the theory of coformal predictio ad calculatig exchageability martigales. Basically testig proceeds i two steps. The first step is implemeted by a coformal predictor that outputs a sequece of p-values. The sequece is geerated i the o-lie mode: examples are preseted oe by oe ad for each ew example a p-value is calculated from this ad all the previous examples. For the secod step the authors itroduced exchageability martigales that are fuctios of the p-values ad track the deviatio from the assumptio. Oce the martigale grows up to a large value (2 ad 1 are coveiet rules of thumb) the exchageability assumptio ca be rejected for the data. This paper proposes a ew way of costructig martigales i the secod step of testig. To costruct a exchageability martigale based o the sequece of p-values we eed a bettig fuctio, which determies the cotributio of a p-value to the value of the martigale. I cotrast to the previous studies that use a fixed bettig fuctio the ew martigale tues its bettig fuctio to the sequece to detect ay deviatio from the assumptio. We show that this martigale, which we call a plug-i martigale, is competitive with all the martigales covered by the previous studies; amely, asymptotically the former grows faster tha the latter. 1.1 Related work The first procedure of testig exchageability o-lie is described i Vovk et al. (23). The core testig mechaism is a exchageability martigale. Exchageability martigales are costructed usig a sequece of p-values. The algorithm for geeratig p-values assigs small p-values to uusual examples. 1

It implies the idea of desigig martigales that would have a large value if too may small p-values were geerated, ad suggests correspodig power martigales. Other martigales (simple mixture ad sleepy jumper) implemeted more complicated strategies, but followed the same idea of scorig o small p-values. Ho (25) applied power martigales to the problem of chage detectio i time-varyig data streams. The author shows that small p-values iflate the martigale values ad suggests to use the martigale differece as aother test for the problem. 1.2 This paper To the best of our kowledge, o study has aimed to fid ay other ways of traslatig p-values ito a martigale value. I this paper we propose a ew more flexible method of costructig martigales for the give sequece of p- values. The rest of the paper is orgaized as follows. Sectio 2 gives the defiitio of exchageability martigales. Sectio 3 presets the costructio of plug-i exchageability martigales, explais the ratioale behid them, ad compares them to the power martigales that have bee used previously. Sectio 4 shows experimetal results of testig two real-life datasets for exchageability; for oe of these datasets power martigales work satisfactorily ad for the other oe the greater flexibility of plug-i martigales becomes essetial. Sectio 5 summarises the work. 2 Exchageability martigales This sectio outlies ecessary defiitios ad results of the previous studies. 2.1 Exchageability Cosider a sequece of radom variables ( Z 1, Z 2,...) that all take values i the same example space. The the joit probability distributio P(Z 1,..., Z N ) of a fiite umber of the radom variables is exchageable if it is ivariat uder ay permutatio of the radom variables. The joit distributio of ifiite umber of radom variables ( Z 1, Z 2,...) is exchageable if the margial distributio P(Z 1,..., Z N ) is exchageable for every N. 2.2 Martigales for testig As i Vovk et al. (23), the mai tool for testig exchageability o-lie is a martigale. The value of the martigale reflects the stregth of evidece agaist the exchageability assumptio. A exchageability martigale is a sequece of o-egative radom variables S, S 1,... that keep the coditioal expectatio: S 2

S = E(S +1 S 1,..., S ), where E refers to the expected value with respect to ay exchageable distributio o examples. We also assume S = 1. To uderstad the idea behid martigale testig we ca imagie a game where a player starts from the capital of 1, places bets o the outcomes of a sequece of evets, ad ever risks bakruptcy. The a martigale correspods to a strategy of the player, ad its value reflects the acquired capital. Accordig to Ville s iequality (see Ville, 1939, p. 1), { } P : S C 1/C, C 1, it is ulikely for ay S to have a large value. For the problem of testig exchageability, if the fial value of a martigale is large the the exchageability assumptio for the data ca be rejected with the correspodig probability. 2.3 O-lie calculatio of p-values Let (z 1, z 2,...) deote a sequece of examples. Each example z i is the vector represetig a set of attributes x i ad a label y i : z i = (x i, y i ). I the paper we use coformal predictors to geerate a sequece of p-values that correspods to the give examples. The geeral idea of coformal predictio is to test how well a ew example fits to the previously observed examples. For this purpose a ocoformity measure is defied. This is a fuctio that estimates the strageess of oe example with respect to others: ( ) α i = A z i, {z 1,..., z }, where i geeral {...} stads for a multiset (the same elemet may be repeated more tha oce) rather tha a set. Typically, each example is assiged a ocoformity score α i based o some predictio method. I this paper we deal with the classificatio problem ad the 1-Nearest Neighbor (1-NN) algorithm is used as the uderlig method to compute the ocoformity scores. A atural way to defie the ocoformity score of a example is by comparig its distace to the examples with the same label to its distace to the examples with a differet label: α i = mi j i:y i=y j d(x i, x j ) mi j i:yi y j d(x i, x j ), where d(x i, x j ) is the Euclidea distace. The 1-NN is a simple algorithm but it works well eough i may cases (see, e.g., Hastie et al., 21, pp. 422 427). Accordig to the chose ocoformity measure, α i is high if the example is close to aother example with a differet label ad far from ay examples with the same label. Usig the calculated ocoformity scores of all observed examples, the p- value p that correspods to a example z is calculated as p = #{i : α i > α } + θ #{i : α i = α }, 3

Algorithm 1 Geeratig p-values o-lie Iput: (z 1, z 2,...) data for testig Output: (p 1, p 2,...) sequece of p-values for i = 1, 2,... do observe a ew example z i for j = 1( to i do ) α j = A z j, {z 1,..., z i } ed for p i = #{j:αj>αi}+θi#{j:αj=αi} i ed for where θ is a radom umber from [, 1] ad the symbol # meas the cardiality of a set. Algorithm 1 summarises the process of o-lie calculatio of p-values (it is clear that it ca also be applied to a fiite dataset (z 1,..., z ) producig a fiite sequece (p 1,..., p ) of p-values). The followig is a stadard result i the theory of coformal predictio (see, e.g., Vovk et al. 23, Theorem 1). Theorem 1. If examples (z 1, z 2,...) (resp. (z 1, z 2,..., z )) satisfy the exchageability assumptio, Algorithm 1 produces p-values (p 1, p 2,...) (resp. (p 1, p 2,..., p )) that are idepedet ad uiformly distributed i [, 1]. The property that the examples geerated by a exchageable distributio provide uiformly ad idepedetly distributed p-values allows us to test exchageability by calculatig martigales as fuctios of the p-values. 3 Martigales based o p-values This sectio focuses o the secod part of testig: give the sequece of p-values a martigale is calculated as a fuctio of the p-values. For each i {1, 2,...}, let f i : [, 1] i [, ). Let (p 1, p 2,...) be the sequece of p-values geerated by Algorithm 1. We cosider martigales S of the form S = f i (p i ), = 1, 2,..., (1) where we deote f i (p) = f i (p 1,..., p i 1, p) ad call the fuctio f i (p) a bettig fuctio. To be sure that (1) is ideed a martigale we eed the followig costrait o the bettig fuctios f i : The we ca check: f i (p)dp = 1, i = 1, 2,... 4

E(S +1 S,..., S ) = f i (p i ) f +1 (p)dp = f i (p i ) f +1 (p)dp = f i (p i ) = S. Usig represetatio (1) we ca update the martigale o-lie: havig calculated a p-value p i for a ew example i Algorithm 1 the curret martigale value becomes S i = S i 1 f i (p i ). To defie the martigales completely we eed to describe the bettig fuctios f i. 3.1 Previous results: power ad simple mixture martigales Previous studies (Vovk et al., 23) proposed to use a fixed bettig fuctio i : f i (p) = εp ε 1, where ε [, 1]. Several martigales were costructed usig the fuctio. The power martigale for some ε, deoted as M ε, was defied as M ε = εp ε 1 i. The simple mixture martigale, deoted as M, is the mixture of power martigales over differet ε [, 1]: M = M ε dε. We ote that such a martigale will grow oly if there are may small p- values i the sequece. It follows from the shape of the bettig fuctios (see Figure 1). If the geerated p-values cocetrate i ay other part of the uit iterval, we caot expect the martigale to grow. So it might be difficult to reject the assumptio of exchageability for such sequeces. 3.2 New plug-i approach 3.2.1 Plug-i martigale Let us use a estimated probability desity fuctio as the bettig fuctio f i (p). At each step the probability desity fuctio is estimated usig the accumulated p-values: ρ i (p) = ρ(p 1,..., p i 1, p), (2) where ρ(p 1,..., p i 1, p) is the estimate of the probability desity fuctio usig the p-values p 1,..., p i 1 output by Algorithm 1. Substitutig these bettig fuctios ito (1) we get a ew martigale that we call a plug-i martigale. The martigale avoids bettig if the p-values are distributed uiformly, but if there is ay peak it will be used for bettig. 5

power martigale bettig fuctio εp ε 1 1 2 3 4 5 6 7 ε.11.22.33.44.56.67.78.89.1.2.3.4.5.6.7.8.9 1 p values, p Figure 1: The bettig fuctios that are used to costruct the power ad simple mixture martigales. The fuctios provide growth of the martigales for small p-values. Estimatig a probability desity fuctio. I our experimets we have used the statistical eviromet ad laguage R. The desity fuctio i its Stats package implemets kerel desity estimatio with differet parameters. But sice p-values always lie i the uit iterval, the stadard methods of kerel desity estimatio lead to poor results for the poits that are ear the boudary. To get better results for the boudary poits the sequece of p- values is reflected to the left from zero ad to the right from oe. The the kerel desity estimate is calculated usig the bigger sequece { } p i, p i, 2 p i, i = 1,...,. The estimated fuctio is set to zero outside the uit iterval ad the ormalized to itegrate to oe. For the results preseted i this paper the parameters used are the Gaussia kerel ad Silverma s rule of thumb for badwidth selectio. Other settigs have bee tried as well, but the results are comparable ad lead to the same coclusios. The values S of the plug-i martigale ca be updated recursively. Suppose computig the ocoformity scores (α 1,..., α ) from (z 1,..., z ) takes time g() ad evaluatig (2) takes time h(). The updatig S 1 to S takes time O(g() + + h()): ideed, it is easy to see that calculatig the rak of α i the multiset {α 1,..., α } takes time Θ(). 6

The performace of the plug-i martigale o real-life datasets will be preseted i Sectio 4. The rest of the curret sectio proves that the plug-i martigale provides asymptotically a better growth rate tha ay martigale with a fixed bettig fuctio. To prove this asymptotical property of the plug-i martigale we eed the followig assumptios. 3.2.2 Assumptios Cosider a ifiite sequece of p-values (p 1, p 2,...). (This is simply a determiistic sequece.) For its fiite prefix (p 1,..., p ) defie the correspodig empirical probability measure P : for a Borel set A i R, P (A) = #{i = 1,..., : p i A}. We say that the sequece (p 1, p 2,...) is stable if there exists a probability measure P o R such that: 1. P weak P; 2. there exists a positive cotiuous desity fuctio ρ(p) for P: for ay Borel set A i R, P(A) = A ρ(p)dp. Ituitively, the stability meas that asymptotically the sequece of p-values ca be described well by a probability distributio. Cosider a sequece (f 1 (p), f 2 (p),...) of bettig fuctios. (This is simply a determiistic sequece of fuctios f i : [, 1] [, ), although we are particularly iterested i the fuctios f i (p) = ρ i (p), as defied i (2).) We say that this sequece is cosistet if log ( f (p) ) uiformly i p log ( ρ(p)). Ituitively, cosistecy is a assumptio about the algorithm that we use to estimate the fuctio ρ(p); i the limit we wat a good approximatio. 3.2.3 Growth rate of plug-i martigale The followig result says that, uder our assumptios, the logarithmic growth rate of the plug-i martigale is better tha that of ay martigale with a fixed bettig fuctio (remember that by a bettig fuctio we mea ay fuctio mappig [, 1] to [, )). Theorem 2. If a sequece (p 1, p 2,...) [, 1] is stable ad a sequece of bettig fuctios ( f 1 (p), f 2 (p),... ) is cosistet the, for ay positive cotiuous bettig fuctio f, lim if ( 1 log ( f i (p i ) ) 1 log ( f(p i ) )) 7

First we explai the meaig of Theorem 2 ad the prove it. Accordig to represetatio (1) after steps the martigale grows to f i (p i ). (3) Note that if for ay p-value p [, 1] we have f i (p) = the the martigale ca become zero ad will ever chage after that. Therefore, it is reasoable to cosider positive f i (p). The we ca rewrite product (3) as the sum of logarithms, which gives us the logarithmic growth of the martigale: log f i (p i ). We assume that the sequece of p-values is stable ad the sequece of estimated probability desity fuctios that is used to costruct the plug-i martigale is cosistet. The the limit iequality from Theorem 2 states that the logarithmic growth rate of the plug-i martigale is asymptotically at least as high as that of ay martigale with a fixed bettig fuctio (which were suggested i previous studies). To prove Theorem 2 we will use the followig lemma. Lemma 1. For ay probability desity fuctios ρ ad f (so that ρ(p)dp = 1 ad f(p)dp = 1), log ρ(p) ρ(p)dp log f(p) ρ(p)dp. Proof of Lemma 1. It is well kow (Kullback, 1959, p. 14) that the Kullback Leibler divergece is always o-egative: ( ρ(p) ) log ρ(p)dp. f(p) This is equivalet to the iequality asserted by Lemma 1. Proof of Theorem 2. Suppose that, cotrary to the statemet of Theorem 2, there exists δ > such that ( 1 lim if log ( f i (p i ) ) 1 log ( f(p i ) )) < δ. (4) The choose a ε satisfyig < ε < δ/4. Substitutig the defiitio of ρ(p) ito Lemma 1 we obtai log ρ(p) dp 8 log f(p) dp. (5)

From the stability of (p 1, p 2,...) it follows that there exists a umber N 1 = N 1 (ε) such that, for all > N 1, log f(p) dp log f(p) dp < ε ad log ρ(p) dp The iequality (5) implies that, for all N 1, log ρ(p) dp log ρ(p) dp < ε. log f(p) dp 2ε. By the defiitio of the probability measure P, the last iequality is the same thig as 1 log ρ(p i ) 1 log f(p i ) 2ε. (6) By the cosistecy of ( f 1 (p), f 2 (p),... ) there exists a umber N 2 = N 2 (ε) such that, for all i > N 2 ad all p [, 1], log ( f i (p) ) log ( ρ(p) ) < ε. (7) Let us defie the umber M = max log ( f i (p) ) log ( ρ(p) ). (8) i,p From (7) ad (8) we have ( log fi (p) ) log ( ρ(p) ) { M, i N2 ε, i > N 2. (9) Deote N 3 = max(n 1, N 2 ). The, usig (9) ad (6), we obtai, for all > N 3, 1 log f i (p i ) 1 log f(p i ) 3ε MN 3. Deotig N 4 = max(n 3, MN3 ε ), we ca rewrite the last iequality as 1 log f i (p i ) 1 log f(p i ) 4ε, for all > N 4. Fially, recallig that ε < δ 4, we have, for all > N 4, 1 log f i (p i ) 1 log f(p i ) δ. This cotradicts (4) ad therefore completes the proof of Theorem 2. 9

simple mixture plug i log 1 M 2.2 1.8 1.4 1.6.2.2.6 1 2 3 4 5 6 7 8 9 idex of examples Figure 2: The growth of the simple mixture martigale ad the plug-i martigale for the USPS dataset radomly shuffled before o-lie testig. The exchageability assumptio is satisfied: the fial values of the martigales are about.11. 4 Empirical results I this sectio we ivestigate the performace of our plug-i martigale ad compare it with that of the simple mixture martigale. Two real-life datasets have bee tested for the exchageability: the USPS dataset ad the Statlog Satellite dataset. 4.1 USPS dataset Data The US Postal Service (USPS) dataset cosists of 7291 traiig examples ad 27 test examples of hadwritte digits, from to 9. The data were collected from real-life zip codes. Each example is described by the 256 attributes represetig the pixels for displayig a digit o the 16 16 gray-scaled image ad its label. It is well kow that the examples i this dataset are ot perfectly exchageable (Vovk et al., 23), ad ay reasoable test should reject exchageability there. I our experimets we merge the traiig ad test sets ad perform testig for the full dataset of 9298 examples. 1

log 1 M 2 1.8 1.4 1.8.4 simple mixture plug i 5 15 25 35 45 55 idex of examples Figure 3: The growth of the simple mixture martigale ad the plug-i martigale for the Statlog Satellite dataset radomly shuffled before o-lie testig. The exchageability assumptio is satisfied: the fial values of the martigales are about.1. Figure 2 shows the typical performace of the martigales whe the exchageability assumptio is satisfied for sure: all examples were radomly shuffled before the testig. Figure 4 shows the performace of the martigales whe the examples arrive i the origial order: first 7291 of the traiig set ad the 27 of the test set. The p-values are geerated o-lie by Algorithm 1 ad the two martigales are calculated from the same sequece of p-values. The fial value for the simple mixture martigale is 2. 1 1, ad the fial value for the plug-i martigale is 3.9 1 8. Figure 6 shows the bettig fuctios that correspod to the plug-i martigale ad the best power martigale. For the plug-i martigale, the fuctio is the estimated probability desity fuctio calculated usig the whole sequece of p-values. The bettig fuctio for the family of power martigale correspods to the parameter ε that provides the largest fial value amog all power martigales. It gives a clue why we could ot see advatages of the ew approach for this dataset: both martigales grew up to approximately the same level. There is ot much differece betwee the best bettig fuctios for the old ad 11

log 1 M 2 1 1 2 3 4 5 6 7 8 9 1 simple mixture plug i 1 2 3 4 5 6 7 8 9 idex of examples Figure 4: The growth of the simple mixture martigale ad the plug-i martigale for the full USPS dataset. For the examples i the origial order the exchageability assumptio is rejected: the fial values of the martigales are grater the 3.8 1 8. ew methods, ad the ew method suffers because of its greater flexibility. 4.2 Statlog Satellite dataset Data The Satellite dataset (Frak & Asucio, 21) cosists of 6435 satellite images (divided ito 4435 traiig examples ad 2 test examples). The examples are 3 3 pixels sub-areas of the satellite picture, where each pixel is described by four spectral values i differet spectral bads. Each example is represeted by 36 attributes ad a label idicatig the classificatio of the cetral pixel. Labels are umbers from 1 to 7, excludig 6. The testig results are described bellow. Figure 3 shows the performace of the martigales for radomly shuffled examples of the dataset. As expected, the martigales do ot reject the exchageability assumptio there. Figure 5 presets the performace of the martigales whe the examples arrive i the origial order. The fial value for the simple mixture martigale is 5.6 1 2 ad the fial value for the plug-i martigale is 1.8 1 17. Agai, the 12

log 1 M 2 2 4 6 8 1 12 14 16 18 2 simple mixture plug i 5 15 25 35 45 55 idex of examples Figure 5: The growth of the simple mixture martigale ad the plug-i martigale for the Statlog Satellite dataset. For the examples i the origial order the exchageability assumptio is rejected: the fial value of the simple mixture martigale is 5.6 1 2, ad the fial value of the plug-i martigale is 1.8 1 17. correspodig bettig fuctios for the plug-i martigale ad the best power martigale are preseted i Figure 7. For the dataset the geerated p-values have a tricky distributio. The family of power bettig fuctios εp ε 1 caot provide a good approximatio. The power martigales lose o p-values close to the secod peak of the p-values distributio. But the plug-i martigale is more flexible ad eds up with a much higher fial value. It ca be argued that both methods, old ad ew, work for the Satellite dataset i the sese of rejectig the exchageability assumptio at ay of the commoly used thresholds (such as 2 or 1). However, the situatio would have bee differet had the dataset cosisted of oly the first 1 examples: the fial value of the simple mixture martigale would have bee.13 whereas the fial value of the plug-i martigale would have bee 3.74 1 15. 5 Discussio ad coclusios I this paper we have itroduced a ew way of costructig martigales for testig exchageability o-lie. We have show that for stable sequeces of 13

bettig fuctio.1.3.5.7.9 1 1.1 1.3 bettig fuctio from the Power family estimated pdf.1.2.3.4.5.6.7.8.9 1 p values Figure 6: The best bettig fuctios for testig the USPS dataset, the examples arrivig i the origial order. The best power fuctio ε p ε 1 is chose accordig to the maximal fial value of the power martigales. The bettig fuctio for the plug-i martigale is the estimated pdf. p-values the ew more adaptive martigale provides asymptotically the best result compared with ay other martigale with a fixed bettig fuctio. The experimets of testig two real-life datasets have bee preseted. Usig the same sequece of p-values the plug-i martigale extracts approximately the same or more iformatio about the data-geeratig distributio tha the previously itroduced power martigales. Our goal has bee to fid a exchageability martigale that does ot eed ay assumptios about the p-values geerated by the method of coformal predictio. Our proposed martigale adapts to the ukow distributio of the p-values by estimatig a good bettig fuctio from the past data. This is a example of the plug-i approach. It is geerally believed that the Bayesia approach is more efficiet tha the plug-i approach (see, e.g., Berardo & Smith, 2, p. 483). I our preset cotext, the Bayesia approach would ivolve choosig a prior distributio o the bettig fuctios ad itegratig the exchageability martigales correspodig to these bettig fuctios over the prior distributio. It is ot clear yet whether this ca be doe efficietly ad, if yes, whether this ca improve the performace of exchageability martigales. 14

bettig fuctio.1.3.5.7.9 1 1.1 bettig fuctio from the Power family estimated pdf.1.2.3.4.5.6.7.8.9 1 p values Figure 7: The best bettig fuctios for testig the Statlog Satellite dataset, the examples arrivig i the origial order. The best power fuctio ε p ε 1 is chose accordig to the maximal fial value of the power martigales. The bettig fuctio for the plug-i martigale is the estimated pdf. Refereces Berardo, José M. ad Smith, Adria F. M. Bayesia Theory. Wiley, Chichester, 2. Frak, A. ad Asucio, A. UCI repository, 21. URL http://archive.ics. uci.edu/ml. Hastie, T., Tibshirai, R., ad Friedma, J. The Elemets of Statistical Learig: Data Miig, Iferece, ad Predictio. Spriger, 21. Ho, S.-S. A martigale framework for cocept chage detectio i time-varyig data streams. I Proceedigs of the 22d Iteratioal Coferece o Machie Learig (ICML 25), pp. 321 327, 25. Kullback, S. Iformatio theory ad statistics. Wiley, New York, 1959. Schervish, M. J. Theory of statistics. Spriger, New York, 1995. Vapik, V. N. Statistical learig theory. Wiley, New York, 1998. 15

Ville, J. Etude critique de la otio de collectif. Gauthier-Villars, Paris, 1939. Vovk, V., Nouretdiov, I., ad Gammerma, A. Testig exchageability olie. I Proceedigs of the 2th Iteratioal Coferece o Machie Learig (ICML 23), pp. 768 775, 23. 16