Statistical foundations of machine learning

Machine learning p. 1/45 Statistical foundations of machine learning INFO-F-422 Gianluca Bontempi Département d Informatique Boulevard de Triomphe - CP 212 http://www.ulb.ac.be/di

Machine learning p. 2/45 Testing hypothesis Hypothesis testing is the second major area of statistical inference. A statistical hypothesis is an assertion or conjecture about the distribution of one or more random variables. A test of a statistical hypothesis is a rule or procedure for deciding whether to reject the assertion on the basis of the observed data. The basic idea is formulate some statistical hypothesis and look to see if the data provides any evidence to reject the hypothesis.

Machine learning p. 3/45 An hypothesis testing problem Consider the model of the traffic in the boulevard. Suppose that the measures of the inter-arrival times are D N = {10, 11, 1, 21, 2,... } seconds. Can we say that the mean inter-arrival time θ is different from 10? Consider the grades of two different school sections. Section A had {15, 10, 12, 19, 5, 7}. Section B had {14, 11, 11, 12, 6, 7}. Can we say that Section A had better grades than Section B? Consider two protein coding genes and their expression levels in a cell. Are the two genes differentially expressed? A statistical test is a procedure that aims to answer such questions.

Machine learning p. 4/45 Types of hypothesis We start by declaring the working (basic, null) hypothesis H to be tested, in the form θ = θ 0 or θ ω Θ, where θ 0 or ω are given. The hypothesis can be Simple. It fully specifies the distribution of z. Composite. It partially specifies the distribution of z. Example: if D N constitutes a random sample of size N from N(µ, σ 2 ) the hypothesis H : µ = µ 0, σ = σ 0, (with µ 0 and σ 0 known values) is simple while the hypothesis H : µ = µ 0 is composite since it leaves open the value of σ in (0, ).

Machine learning p. 5/45 Types of statistical test Suppose we have collected N samples D N = {z 1,...,z N } from a distribution F z and we have declared a null hypothesis H about F. Three are the most common types of statistical test: Pure significance test: data D N are used to assess the inferential evidence against H. Significance test: the inferential evidence against H is used to judge whether H is inappropriate. In other words it is a rule for rejecting H. Hypothesis test: data D N are used to assess the hypothesis H against a specific alternative hypothesis H. In other words this is a rule for rejecting H in favour of H.

Machine learning p. 6/45 Pure significance test Suppose that the null hypothesis H is simple. Let t(d N ) be a statistic such that the larger its value the more it casts doubt on H. The quantity t(d N ) is called test statistic or discrepancy measure. Let t N = t(d N ) the value of t calculated on the basis of the sample data D N. Let us consider the p-value quantity p = Prob {t(d N ) > t N H} If p is small the sample data D N are highly inconsistent with H and p (significance probability or significance level ) is the measure of such inconsistency.

Machine learning p. 7/45 Some considerations p is the proportion of situations under the hypothesis H where we would observe a degree of inconsistency at least to the extent represented by t N. t N is the observed value of the statistic for a given D N. Different D N yield different values of p (0, 1). it is essential that the distribution of t(d N ) under H is known. We cannot say that p is the probability that H is true but better that p is the probability that the dataset D N is observed given that H is true Open issues 1. What if H is composite? 2. how to choose t(d N ).

Machine learning p. 8/45 Tests of significance Suppose that the value p is known. If p is small either a rare event has occured or perhaps H is not true. Idea: if p is less than some stated value α, we reject H. We choose a critical level α, we observe D N and we reject H at level α if P {t(d N ) > t N H) α This is equivalent to choose some critical value t α and we reject H if t N > t α. We obtain two regions in the space of sample data: critical region S 0 where if D N S 0 we reject H. non-critical region S 1 where the sample data D N gives us no-reason to reject H on the basis of the level-α test.

Machine learning p. 9/45 Some considerations The principle is that we will accept H unless we witness some event that has sufficiently small probability of arising when H is true. If H were true we could still obtain data in S 0 and consequently wrongly reject H with probability Prob {D N S 0 H} = Prob {t(d N ) > t α H} = α The significance level α provides an upper bound to the maximum probability of incorrectly rejecting H. The p-value is the probability that the test statistic is more extreme than its observed value. The p-value changes with the observed data (i.e. it is a random variable) while α is a level fixed by the user.

Machine learning p. 10/45 Standard normal distribution 1 Normal distribution function (µ=0, σ=1) 0.4 Normal density function (µ=0, σ=1) 0.9 0.35 0.8 0.7 0.3 0.6 0.25 0.5 0.2 0.4 0.15 0.3 0.1 0.2 0.1 0.05 0 5 4 3 2 1 0 1 2 3 4 5 0 5 4 3 2 1 0 1 2 3 4 5 Remember that z 0.05 1.64. This means that, if z N(0, 1), then Prob {z z 0.05 } = 0.05 and also that For a generic z N(µ, σ 2 ) Prob { z z 0.05 } = 2 0.05 = 0.1 Prob { z µ σ z 0.05 } = 2 0.05 = 0.1

Machine learning p. 11/45 TP: example Let D N consist of N independent observations of x N(µ, σ 2 ), with known variance σ 2. We want to test the hypothesis H : µ = µ 0 with µ 0 known. Consider as test statistic t(d N ), the quantity ˆµ µ 0 where ˆµ is the sample average estimator. If H is true we know that ˆµ N(µ 0, σ 2 /N). Let us calculate the value t(d N ) = ˆµ µ 0 and assume that the rejection region is S 0 = { ˆµ µ 0 ˆµ µ 0 > t α }. Let us put a significance level α = 10% = 0.1. This means that t α should satisfy Prob {t(d N ) > t α H} = Prob { ˆµ µ 0 > t α H} = Prob {(ˆµ µ 0 > t α ) (OR) (ˆµ µ 0 < t α ) H} = 0.1

Machine learning p. 12/45 TP: example (II) For a normal variable x N(µ, σ 2 ) Prob {x µ > 1.645σ} = 1 F x (1.645σ) = 0.05 and consequently Prob {x µ > 1.645σ (OR) x µ < 1.645σ} = 0.05 + 0.05 = 0.1 It follows that being ˆµ N(µ 0, σ 2 /N) (i.e. ˆµ µ 0 σ/ N N(0, 1)) once we put t α = 1.645σ/ N we have Prob { ˆµ µ 0 > t α H} = 0.1 and that the critical region is S 0 = { D N : ˆµ µ 0 > 1.645σ/ } N

Machine learning p. 13/45 TP: example (III) Suppose that σ = 0.1 and that we want to test if µ = µ 0 = 10 with a significance level 10%. After N = 6 observations we have D N = {10, 11, 12, 13, 14, 15}. On the basis of the dataset we compute ˆµ = 10 + 11 + 12 + 13 + 14 + 15 6 = 12.5 and t(d N ) = ˆµ µ 0 = 2.5 Since t α = 1.645 0.1/ 6 = 0.0672, and t(d N ) > t α, the observations D N are in the critical region. The hypothesis is rejected.

Machine learning p. 14/45 Hypothesis testing: types of error So far we considered a single hypothesis. Let us now consider two alternative hypothesis: H and H. Type I error. It is the error we make when we reject H if it is true. Significance level represents the probability of making the type I error. Type II error. It is the error we make when we accept H if it is false. In order to define this error, we are forced to declare an alternative hypothesis H as a formal definition of what is meant by H being false. The probability of type II error is the probability that the test leads to acceptance of H when in fact H prevails. When the alternative hypothesis is composite, there is no unique Type II error.

Machine learning p. 15/45 An analogy Consider the analogy with a murder trial, where we have as suspect Mr. Bean. The null hypothesis H is Mr. Bean is innocent. The dataset is the amount of evidence collected by the police against Mr. Bean. The Type I error is the error that we make if, being Mr. Bean innocent, we send him to penalty death. The Type II error is the error that we make if, being Mr. Bean guilty, we acquit him.

Machine learning p. 16/45 Hypothesis testing Suppose we have some data {z 1,...,z N } F from a distribution F. H and H represent two hypotheses about F. On the basis of the data, one is accepted and one is rejected. Note that the two hypotheses have different philosophical status (asymmetry). H is a conservative hypothesis, not to be rejected unless evidence is clear. This means that a type I error is more serious than a type II error (benefit of the doubt). It is often assumed that F belongs to a parametric family F(z, θ). The test on F becomes a test on θ. A particular example of hypothesis test is the goodness of fit test where we test H : F = F 0 against H : F F 0.

Machine learning p. 17/45 The five steps of hypothesis testing 1. Declare the null (e.g. H: honest student) and the alternative hypothesis ( H: cheat student) 2. Choose the numeric value of the type I error (e.g. the risk I want to run). 3. Choose a procedure to obtain test statistic (e.g. number of similar lines). 4. Determine the critical value of the test statistic (e.g. 4 identical lines) that leads to a rejection of H. This is done in order to ensure the Type I error defined in Step 2. 5. Obtain the data and determine whether the observed value of the test statistic leads to an acceptation or rejection of H.

Machine learning p. 18/45 Quality of the test Suppose that N students took part to the exam, N N did not copy, N P copied, ˆNN were considered not guilty and passed the exam ˆNP were considered guilty and rejected F P honest students were refused F N cheat students passed.

Machine learning p. 19/45 Confusion matrix Then we have Not refused Refused H: Not guilty student (-) T N F P N N H: Guilty student (+) F N T P N P ˆN N ˆNP N F P is the number of False Positives and the ratio F P /N N represents the type I error. F N is the number of False Negatives and the ratio F N /N P represents the type II error.

Machine learning p. 20/45 Specificity and sensitivity Specificity: the ratio (to be maximized) SP = T N F P + T N = T N N N = N N F P N N = 1 F P N N, 0 SP 1 It increases by reducing the number of false positive. Sensitivity: the ratio (to be maximized) SE = T P T P + F N = T P N P = N P F N N P = 1 F N N P, 0 SE 1 It increases by reducing the number of false negatives and corresponds to the power of the test (i.e. it estimates the quantity 1-Type II error).

Machine learning p. 21/45 Specificity and sensitivity (II) There exists a trade-off between these two quantities. In the case of a test who return always H (e.g. very kind professor) we have ˆN P = 0, ˆN N = N, F P = 0, T N = N N and SP = 1 but SE = 0. In the case of a test who return always H (e.g. very suspicious professor) we have ˆN P = N, ˆN N = 0, F N = 0, T P = N P and SE = 1 but SP = 0.

Machine learning p. 22/45 False Positive and False Negative Rate False Positive Rate: FPR = 1 SP = 1 T N F P + T N = F P F P + T N = F P N N, 0 FPR 1 It decreases by reducing the number of false positive and estimates the Type I error. False Negative Rate FNR = 1 SE = 1 T P T P + F N = F N T P + F N = F N N P 0 FPR 1 It decreases by reducing the number of false negative.

Machine learning p. 23/45 Predictive value Positive Predictive value: the ratio(to be maximized) PPV = T P T P + F P = T P ˆN P, 0 PPV 1 Negative Predictive value: the ratio (to be maximized) PNV = T N T N + F N = T N ˆN N, 0 PNV 1 False Discovery Rate: the ratio (to be minimized) FDR = F P T P + F P = F P ˆN P = 1 PPV, 0 FDR 1

Machine learning p. 24/45 Receiver Operating Characteristic curve The Receiver Operating Characteristic (also known as ROC curve) is a plot of the true positive rate (i.e. sensitivity or power) against the false positive rate (Type I error) for the different possible decision thresholds of a test. Consider an example where t + N(1, 1) and t N( 1, 1). Suppose that the examples are classed as positive if t > THR and negative if t < THR, where THR is a threshold. If THR =, all the examples are classed as positive: TN = FN = 0 which implies SE = T P N P = 1 and FPR = F P F P +T N = 1. If THR =, all the examples are classed as negative: TP = FP = 0 which implies SE = 0 and FPR = 0.

Machine learning p. 25/45 ROC curve SE 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 FPR R script roc.r

Machine learning p. 26/45 Choice of test The choice of test and consequently the choice of the partition {S 0, S 1 } is based on two steps 1. Define a significance level α, that is the probability of type I error Prob {reject H H} = Prob {D N S 0 H} α that is the probability of incorrectly rejecting H 2. Among the set of tests {S 0, S 1 } of level α, choose the test that minimizes the probability of type II error Prob { accept H H } = Prob { D N S 1 H } that is the probability of incorrectly accepting H. This is equivalent to look for maximizing the power of the test Prob { reject H H } = Prob { D N S 0 H } = 1 Prob { D N S 1 H } which is the probability of correctly rejecting H. The higher the power, the better!

Machine learning p. 27/45 TP example Consider a r.v. z N(µ, σ 2 ), where σ is known and a set of N iid observations are given. We want to test the null hypothesis µ = µ 0 = 0, with α = 0.1 Consider the 3 critical regions S 0 1. ˆµ µ 0 > 1.645σ/ N 2. ˆµ µ 0 > 1.282σ/ N 3. ˆµ µ 0 < 0.126σ/ N For all these tests Prob {D N S 0 H} α, hence the significance level is the same. However if H : µ = 10 the type II error of the three tests is significantly different. What is the best one?

Machine learning p. 28/45 µ: H S TP example (II) 01 00 11 00 11 00 11 01 00000 11111 00000 11111 000000 111111 0000000 1111111 00000000 11111111 00000000000 11111111111 0000000000000000 1111111111111111 00000000000000000 11111111111111111 0 10 µ: H 01 0000000000000000000000 1111111111111111111111 1 0 On the left: distribution of the test statistic ˆµ if H : µ 0 = 0 is true. On the right: distribution of the test statistic ˆµ if H : µ 1 = 10 is true. The interval marked by S 1 denotes the set of observed ˆµ values for which H is accepted (non-critical region). The interval marked by S 0 denotes the set of observed ˆµ values for which H is rejected (critical region). The area of the black pattern region on the right equals Prob {D N S 0 H}, i.e. the probability of rejecting H when H is true (Type I error). The area of the grey shaded region on the left equals the probability of accepting H when H is false (Type II error). S

Machine learning p. 29/45 TP example (III) S 1 µ: H 0 10 S 0 S 1 µ: H On the left: distribution of the test statistic ˆµ if H : µ 0 = 0 is true. On the right: distribution of the test statistic ˆµ if H : µ 1 = 10 is true. The two intervals marked by S 1 denote the set of observed ˆµ values for which H is accepted (non-critical region). The interval marked by S 0 denotes the set of observed ˆµ values for which H is rejected (critical region). The area of the pattern region equals Prob {D N S 0 H}, i.e. the probability of rejecting H when H is true (Type I error). Which area corresponds to the probability of the Type II error?

Machine learning p. 30/45 Type of parametric tests Consider random variables with a parametric distribution F(, θ). One-sample vs. two-sample: in the one-sample test we consider a single r.v. and we formulate hypothesis about its distribution. In the two-samples test we consider 2 r.v. z 1 and z 2 and we formulate hypothesis about their differences/similarities. Simple vs composite: the test is simple if H describes completely the distributions of the involved r.v. otherwise it is composite. Single-sided (or one-tailed) vs Two-sided (or two-tailed): in the single-sided test the region of rejection concerns only one tail of the distribution of the null distribution. This means that H indicates the predicted direction of the difference (e.g. H : θ > θ 0 ). In the two-sided test, the region of rejection concern both tails of the null distribution. This means that H does not indicate the predicted direction of the difference (e.g. H : θ θ 0 ).

Machine learning p. 31/45 Example of parametric test Consider a parametric test on the distribution of a gaussian r.v., and suppose that the null hypothesis is H : θ = θ 0 where θ 0 is given and represents the mean. The test is one-sample and composite. In order to know whether it is one or two-sided we have to define the alternative configuration: if H : θ < θ 0 the test is one-sided down, if H : θ > θ 0 the test is one-sided up, if H : θ θ 0 the test is double-sided.

Machine learning p. 32/45 z-test (one-sample and one-sided) Consider a random sample D N x N(µ, σ 2 ) with µ unknown et σ 2 known. STEP 1: Consider the null hypothesis and the alternative (composite and one-sided) H : µ = µ 0 ; H : µ > µ0 STEP 2: fix the value α of the type I error. STEP 3: choose a test statistic: If H is true then the distribution of ˆµ is N(µ 0, σ 2 /N). This means that the variable z is z = (ˆµ µ 0) N N(0, 1) σ It is convenient to rephrase the test in terms of the test statistic z.

Machine learning p. 33/45 z-test (one-sample and one-sided) (II) STEP 4: determine the critical value for z. We reject the hypothesis H is rejected if z N > z α where z α is such that Prob {N(0, 1) > z α } = α. Ex: for α = 0.05 we would take z α = 1.645 since 5% of the standard normal distribution lies to the right of 1.645. R command: z α =qnorm(alpha,lower.tail=false) STEP 5: Once the dataset D N is measured, the value of the test statistic is z N = (ˆµ µ 0) N σ

Machine learning p. 34/45 TP: example z-test Consider a r.v. z N(µ, 1). We want to test H : µ = 5 against H : µ > 5 with significance level 0.05. Supose that the data is D N = {5.1, 5.5, 4.9, 5.3}. Then ˆµ = 5.2 and z N = (5.2 5) 2/1 = 0.4. Since this is less than z α = 1.645, we do not reject the null hypothesis.

Machine learning p. 35/45 Two-sided parametric tests Assumption: all the variables are normal! Name one/two sample known H H z-test one σ 2 µ = µ 0 µ µ 0 z-test two σ1 2 = σ2 2 µ 1 = µ 2 µ 1 µ 2 t-test one µ = µ 0 µ µ 0 t-test two µ 1 = µ 2 µ 1 µ 2 χ 2 -test one µ σ 2 = σ0 2 σ 2 σ0 2 χ 2 -test one σ 2 = σ0 2 σ 2 σ0 2 F-test two σ1 2 = σ2 2 σ1 2 σ2 2

Machine learning p. 36/45 Student s t-distribution If x N(0, 1) and y χ 2 N are independent then the Student s t-distribution with N degrees of freedom is the distribution of the r.v. z = x y/n We denote this with z t N. If z 1,...,z N are i.i.d. N(µ, σ 2 ) then N(ˆµ µ) ŜS/(N 1) = N(ˆµ µ) ˆσ t N 1

Machine learning p. 37/45 t-test: one-sample and two-sided Consider a random sample from N(µ, σ 2 ) with σ 2 unknown. Let H : µ = µ 0 ; H : µ µ0 Let t(d N ) = T = 1 N 1 N(ˆµ µ0 ) = (ˆµ µ 0) N i=1 (z i ˆµ) 2 ˆσ 2 N a statistic computed using the data set D N.

Machine learning p. 38/45 t-test: one-sample and two-sided (II) It can be shown that if the hypothesis H holds, T T N 1 is a r.v. with a Student distribution with N 1 degrees of freedom. The size α t-test consists in rejecting H if T > k = t α/2,n 1 where t α/2,n 1 is the upper α point of a T -distribution on N 1 degrees of freedom, i.e. Prob { t N 1 > t α/2,n 1 } = α/2. where t N 1 T N 1. In other terms H is rejected when T is large. R command: t α/2,n 1 =qt(alpha/2,n-1,lower.tail=true)

Machine learning p. 39/45 TP example Does jogging lead to a reduction in pulse rate? Eight non jogging volunteers engaged in a one-month jogging programme. Their pulses were taken before and after the programme pulse rate before 74 86 98 102 78 84 79 70 pulse rate after 70 85 90 110 71 80 69 74 decrease 4 1 8-8 7 4 10-4 Suppose that the decreases are samples from N(µ, σ 2 ) for some unknown σ 2. We want to test H : µ = µ 0 = 0 against H : µ 0 with a significance α = 0.05. We have N = 8, ˆµ = 2.75, T = 1.263, t α/2,n 1 = 2.365 Since T t α/2,n 1, the data is not sufficient to reject the hypothesis H. In other terms we have not enough evidence to show that there is a reduction in pulse rate.

Machine learning p. 40/45 The chi-squared distribution For a N positive integer, a r.v. z has a χ 2 N distribution if z = x 2 1 + + x 2 N where x 1,x 2,...,x N are i.i.d. random variables N(0, 1). The probability distribution is a gamma distribution with parameters ( 1 2 N, 1 2 ). E[z] = N and Var[z] = 2N. The distribution is called a chi-squared distribution with N degrees of freedom.

Machine learning p. 41/45 χ 2 -test: one-sample and two-sided Consider a random sample from N(µ, σ 2 ) with µ known. Let H : σ 2 = σ0; 2 H : σ2 σ0 2 Let ŜS = i (z i µ) 2. It can be shown that if H is true then ŜS/σ2 0 χ 2 N The size α χ 2 -test rejects H if ŜS/σ2 0 < a 1 or ŜS/σ2 0 > a 2 where Prob {ŜS σ 2 0 < a 1 } + Prob {ŜS σ 2 0 > a 2 } = α If µ is unknown, you must 1. replace µ with ˆµ in the quantity ŜS 2. use a χ 2 N 1 distribution.

Machine learning p. 42/45 t-test: two-samples, two-sided Consider two r.v.s x N(µ 1, σ 2 ) and y N(µ 2, σ 2 ) with the same variance. Let DN x and Dy M two independent sets of samples. We want to test H : µ 1 = µ 2 against H : µ 1 µ 2. Let ˆµ x = N i=1 x i N, SS x = N (x i ˆµ x ) 2, ˆµ y = i=1 M i=1 y i M, SS y = M (y i ˆµ y ) 2 i=1 Once defined the statistic T = ( 1 M + 1 N ˆµ x ˆµ y ) ( SS x +SS y M+N 2 ) T M+N 2 it can be shown that a test of size α rejects H if T > t α/2,m+n 2

Machine learning p. 43/45 F-distribution Let x χ 2 M and y χ2 N be two independent r.v.. A r.v. z has a F-distribution F m,n with M and N degrees of freedom if If z F M,N then 1/z F N,M. If z T N then z 2 F 1,N. z = x/m y/n

Machine learning p. 44/45 F-distribution 0.9 F M,N density: M=20 N=10 1 F M,N cumulative distribution: M=20 N=10 0.8 0.9 0.7 0.8 0.6 0.7 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 R script s_f.r.

Machine learning p. 45/45 F-test: two-samples, two-sided Consider a random sample x 1,...,x M from N(µ 1, σ 2 1) and a random sample y 1,...,y N from N(µ 2, σ 2 2) with µ 1 and µ 2 unknown. Suppose we want to test Let us consider the statistic H : σ 2 1 = σ 2 2; H : σ 2 1 σ 2 2 f = ˆσ2 1 ˆσ 2 2 = ŜS 1/(M 1) ŜS 2 /(N 1) σ2 1 χ2 M 1 /(M 1) σ2 2χ2 N 1 /(N 1) = σ2 1 σ2 2 F M 1,N 1 It can be shown that if H is true, the ratio f has a F-distribution F M 1,N 1 We reject H if the ratio f is large, i.e. f > F α,m 1,N 1 where if z F M 1,N 1. Prob {z > F α,m 1,N 1 } = α