Prof. Dr. J. Franke All of Statistics 1.52 Binary response variables - logistic regression Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit scoring, contracting an infection,...) Model: EY j = pr(y j = 1) = π(x j ) [0, 1] π(x) = p(x, b) = ψ(m(x, b)) where m(x, b) = b 1 + d k=2 b k f k (x) as in linear regression, and ψ(u) is the logistic function: ψ(u) = eu 1 + e u = 1 1 + e u
Prof. Dr. J. Franke All of Statistics 1.53 logistic function ψ(u) : R [0, 1]
Prof. Dr. J. Franke All of Statistics 1.54 Y 1,..., Y N independent Bernoulli (0-1) variables with parameters pr(y j = 1) = p(x j, b). Likelihood = probability for observing the data as a function of the parameter: L(b Y 1,..., Y N ) = N j=1 Maximizing the log-likelihood l(b Y 1,..., Y N ) = N j=1 ( p(x j, b) Y j (1 p(x j, b)) 1 Y j Y j log p(x j, b) + (1 Y j ) log(1 p(x j, b)) maximum likelihood estimate ˆb of b Quite similar to non-gaussian linear regression, e.g. ˆb approximately normal for large N, model selection using AIC, BIC or crossvalidation... )
Prof. Dr. J. Franke All of Statistics 1.55 logistic regression p(x) = ψ(5 x 15), x j uniformly distributed
Prof. Dr. J. Franke All of Statistics 1.55 logistic regression, including estimated p(x)
Prof. Dr. J. Franke All of Statistics 1.56 logistic regression with overfitted estimate ˆp(x) = 3 l=0ˆb l+1 x l
Prof. Dr. J. Franke All of Statistics 1.57 Regression and classification Classification problem: Given some object belonging to one of classes C 1,..., C m. Decide to which one! Based on observed features ξ 1,..., ξ p class indicator: Y = k object belongs to class C k class probabilities given the feature values Bayes classifier pr(y = k ξ 1,..., ξ p ) = p k (ξ 1,..., ξ p ) Ŷ = arg max p k(ξ 1,..., ξ p ) {1,..., p} k=1,...,p
Prof. Dr. J. Franke All of Statistics 1.58 p k, k = 1,..., p are estimated from a training set (data) Y j, ξ j1,..., ξ jp, j = 1,..., p which are assumed to be independent. For m = 2 classes, e.g. logistic regression may be used: pr(y j = 1 ξ j1,..., ξ jp ) = p 1 (ξ j1,..., ξ jp, b) = ψ(m(ξ j1,..., ξ jp, b)) where m(u 1,..., u p, b) = b 1 + d k=2 b k f k (u 1,..., u p ), and pr(y j = 2 ξ j1,..., ξ jp ) = p 2 (ξ j1,..., ξ jp, b) = 1 p 1 (ξ j1,..., ξ jp, b) For general m, p m = 1 (p 1 +... + p m 1 ), and pr(y j = l ξ j1,..., ξ jp ) = p l (ξ j1,..., ξ jp, b) = ψ(m l (ξ j1,..., ξ jp, b(l)))
Prof. Dr. J. Franke All of Statistics 1.59 Using qualitative information In regression and classification, qualitative predictor variables or features appear. Example (regression): Steel rods of various material characteristics ξ 1,... ξ p (usually quantitative) and shape, e.g. quadratic, hexagonal, octagonal and circular cross section (qualitative) How does bending strength depend on, in particular, the shape? Transform qualitative into quantitative variables using dummy variables, e.g. in the example quadratic: ξ p+1 = 0, ξ p+2 = 0, hexagonal: ξ p+1 = 0, ξ p+2 = 1, octagonal: ξ p+1 = 1, ξ p+2 = 0, circular: ξ p+1 = 1, ξ p+2 = 1.
Prof. Dr. J. Franke All of Statistics 1.60 Quick MATLAB regression Linear regression: Y j = d k=1 b k f k (ξ j ) + Z j design matrix X j,k = f k (ξ j ), j = 1,..., N, k = 1,..., d b = regress(y, X) least squares estimate [b, bint, r, rint, stats] = regress(y, X) vector r of sample residuals, confidence intervals bint, rint for coordinates of b, r (the latter for outlier detection) stats=(r 2, F-statistic, p-value, ˆσ 2 ) Logistic regression: pr(y j = 1) = ψ ( b 1 + d k=2 b k f k (ξ j ) ) design matrix X j,k = f k (ξ j ), j = 1,..., N, k = 2,..., d b = glmfit(x, Y, binomial ) ML estimate glmfit also covers other generalized linear models
Prof. Dr. J. Franke All of Statistics 2.1 Design of Experiments (Versuchsplanung) Limited amount of time/money sample size N fixed Given this constraint, can we increase the quality of an estimate or the power of a test by a clever choice of observations? Example 1: Linear regression Y j = m(x j, b) + Z j, j = 1,..., N. How to choose x 1,..., x N such that the mean-squared estimation error of the least squares estimate ˆb mse(ˆb) = E ˆb b 2 = d k=1 var ˆb k (due to unbiasedness of least squares) is minimal?
Prof. Dr. J. Franke All of Statistics 2.2 equidistant design on [0,1]: x j x j 1 = 1 N
Prof. Dr. J. Franke All of Statistics 2.3 optimal design on [0,1]: x 1 =... = x 50 = 0, x 51 =... = x 100 = 1
Prof. Dr. J. Franke All of Statistics 2.4 least squares regression lines for both designs and true curve
Prof. Dr. J. Franke All of Statistics 2.5 other realization
Prof. Dr. J. Franke All of Statistics 2.6 data generating mechanism: Y j = 0.5 + 1 x j + Z j, Z 1,..., Z N i.i.d. N (0, 1) where the same Z j have been chosen for both designs For general linear regression, always Eˆb = b, covariance matrix of ˆb = σ 2 (X T X) 1 1 σ 2 mse(ˆb) = tr(x T X) 1 where tr = trace = sum of diagonal elements. In the example: equidistant: var ˆb 1 = 0.04, var ˆb 2 = 0.12, mse(ˆb) = 0.16 optimal: var ˆb 1 = 0.02, var ˆb 2 = 0.04, mse(ˆb) = 0.06
Prof. Dr. J. Franke All of Statistics 2.7 optimal design
Prof. Dr. J. Franke All of Statistics 2.8 residual plot - no warning for model misspecification possible
Prof. Dr. J. Franke All of Statistics 2.9 equidistant design
Prof. Dr. J. Franke All of Statistics 2.10 residual plot
Prof. Dr. J. Franke All of Statistics 2.11 90% optimal design, 10 % safeguard for model misspecification
Prof. Dr. J. Franke All of Statistics 2.12 residual plot
Prof. Dr. J. Franke All of Statistics 2.13 Example 2: Classification 2 classes C 0, C 1, class indicator Y j {0, 1}, class probabilities depending on feature values pr(y j = 1 ξ j1,..., ξ jp ) = p(ξ j1,..., ξ jp ) = p(ξ j ) pr(y j = 0 ξ j1,..., ξ jp ) = 1 p(ξ j ) p(ξ j ) = ψ ( b 1 + d k=2 b k f k (ξ j ) ) Bayes classification: If p(ξ j ) > 1 2 Ŷj = 1 and = 0, else Misclassification probability: pr(y j Ŷj) small! Frequently, one misclassification type more important, e.g. pr(y j Ŷj Y j = 1) small! Problem if most ξ j lie in {z, p(z) 2 1 }
Prof. Dr. J. Franke All of Statistics 2.14 logistic regression with overfitted estimate ˆp(x) = ψ( 3 l=0ˆb l+1 x l )
Prof. Dr. J. Franke All of Statistics 2.15 Misclassification error probabilities classification rule applied to 100000 new (x i, Y i ) using ˆp(x) using true p(x) 0.011 all 0.366 only Y j = 1 0.001 only Y j = 0 0.009 all 0.222 only Y j = 1 0.003 only Y j = 0 Unbalanced design favours the majority use more balanced design or advanced techniques to improve the function estimate in particular regions