Comment on On Discriminative vs. Generative Classifiers: A Comparison of Logistic Regression and Naive Bayes

Transcription

1 Coent on On Discriinative vs. Generative Classifiers: A Coparison of Logistic Regression and Naive Bayes Jing-Hao Xue (jinghao@stats.gla.ac.uk) and D. Michael Titterington (ike@stats.gla.ac.uk) Departent of Statistics, University of Glasgow, Glasgow G12 8QQ, UK Abstract. Coparison of generative and discriinative classifiers is an everlasting topic. As an iportant contribution to this topic, based on their theoretical and epirical coparisons between the naïve Bayes classifier and linear logistic regression, Ref. [6] claied that there exist two distinct regies of perforance between the generative and discriinative classifiers with regard to the trainingset size. In this paper, our epirical and siulation studies, as a copleent of their work, however, suggest that the existence of the two distinct regies ay not be so reliable. In addition, for real world datasets, so far there is no theoretically correct, general criterion for choosing between the discriinative and the generative approaches to classification of an observation x into a class y; the choice depends on the relative confidence we have in the correctness of the specification of either p(y x) or p(x, y) for the data. This can be to soe extent a deonstration of why Ref. [3] and [7] prefer noral-based linear discriinant analysis (LDA) when no odel is-specification occurs but other epirical studies ay prefer linear logistic regression instead. Furtherore, we suggest that pairing of either LDA assuing a coon diagonal covariance atrix (LDA-Λ) or the naïve Bayes classifier and linear logistic regression ay not be perfect, and hence it ay not be reliable for any clai that was derived fro the coparison between LDA-Λ or the naïve Bayes classifier and linear logistic regression to be generalised to all generative and discriinative classifiers. Keywords: Asyptotic relative efficiency; Discriinative classifiers; Generative classifiers; Logistic regression; Noral-based discriinant analysis; Naïve Bayes classifier c 2008 Kluwer Acadeic Publishers. Printed in the Netherlands. Xue-Titterington-SCH2008.tex; 5/08/2008; 22:19; p.1

2 2 Xue and Titterington Abbreviations: LDA/QDA Noral-based Linear/Quadratic Discriinant Analysis; AIC Akaike Inforation Criterion; GAM Generalised Additive Model 1. Introduction Classification is a ubiquitous proble tackled in statistics, achine learning, pattern recognition and data ining [4]. Generative classifiers, also tered the sapling paradig [2], such as noral-based discriinant analysis and the naïve Bayes classifier, odel the joint distribution p(x,y) of the easured features x and the class labels y factorised in the for p(x y)p(y), and learn the odel paraeters through axiisation of the likelihood given by p(x y)p(y). Discriinative classifiers, also tered the diagnostic paradig [2], such as logistic regression, odel the conditional distribution p(y x) of the class labels given the features, and learn the odel paraeters through axiising the conditional likelihood based on p(y x). Coparison of generative and discriinative classifiers is an everlasting topic [3,6,7,10,12]. Results fro such coparisons, in particular in ters of isclassification rates, can not only guide the selection of an appropriate classifier, either generative or discriinative, but also shed light on how to exploit the best of both worlds of classifiers, and thus has been attracting long-standing interest fro both researchers and practitioners. Xue-Titterington-SCH2008.tex; 5/08/2008; 22:19; p.2

3 Discriinative vs. Generative Classifiers 3 An iportant contribution to this topic is fro Ref. [6], presenting soe theoretical and epirical coparisons between linear logistic regression and the naïve Bayes classifier. The results in [6] suggested that, between the two classifiers, there were two distinct regies of discriinant perforance with respect to the training-set size. More precisely, they proposed that the discriinative classifier had lower asyptotic rate while the generative classifier ay approach its (higher) asyptotic rate uch faster. In other words, the discriinative classifier perfors better with larger training sets while the generative classifier does better with saller training sets. However, Ref. [3] and [7] presented soe theoretical and siulation studies showing that noral-based linear discriinant analysis (LDA), a generative classifier, has better asyptotic efficiency (i.e., perfors better with larger training sets) when no odel is-specification occurs. Our epirical and siulation studies, as presented in this paper, suggest that it ay not be so reliable to clai such an existence of the two distinct regies. Furtherore, we suggest that pairing of either LDA assuing a coon diagonal covariance atrix Λ (denoted by LDA-Λ hereafter) or the naïve Bayes classifier and linear logistic regression ay not be perfect, and hence it ay not be reliable for any clai that was derived fro the coparison between LDA-Λ or the naïve Bayes classifier and linear logistic regression to be generalised to all generative and discriinative classifiers. Xue-Titterington-SCH2008.tex; 5/08/2008; 22:19; p.3

4 4 Xue and Titterington 2. Setting for Coparison 2.1. Setting used by Ref. [6] The setting for the theoretical proof and epirical evidence in [6] includes a binary class label y, e.g., y {1,2}, a p-diensional feature vector x and the assuption of conditional independence aongst x y, the features within a class. The naïve Bayes classifier, a generative classifier defined as in equation (4) in Section 2.2, assues statistically independent features x within classes y and thus diagonal covariance atrices within classes. By contrast, linear logistic regression, a discriinative classifier defined as in equation (1), ay not assue such conditional independence of the coponents of x. Both classifiers can be applied to discrete, continuous or ixed-valued features x. In the case of discrete features x, each feature x i,i = 1,...,p, independently of other features within x, is assued within a class to be a binoial variable such that its value x i {0,1}. However, this ay not guarantee the discriinant function λ(α) = log{p(y = 1 x)/p(y = 2 x)} of the naïve Bayes classifier, where α is a paraeter vector, to be linear. As linear logistic regression uses a linear discriinant function, the naïve Bayes classifier ay not be a partner of linear logistic regression as a generative-discriinative pair (see Section 2.2 for ore discussion about this pairing). In the case of continuous features x, x y is assued to follow Gaussian distributions with equal covariance atrices across the two classes, i.e., Σ 1 = Σ 2 and, in view of the conditional independence assuption, Xue-Titterington-SCH2008.tex; 5/08/2008; 22:19; p.4

5 Discriinative vs. Generative Classifiers 5 both covariance atrices are equal to a diagonal atrix Λ. Soe algebra shows that, under the assuption of a coon diagonal covariance atrix Λ for norally distributed data, the naïve Bayes ethod is equivalent to LDA-Λ (defined as equation (2)), and, under the assuption of unequal diagonal within-class covariance atrices, it is equivalent to quadratic discriinant analysis. For the experients in [6], all of the observed values of the features are rescaled so that x i [0,1]. Based on such a setting, Ref. [6] copared two so-called generativediscriinative pairs: one is for the continuous case, coparing LDA assuing a coon diagonal covariance atrix (LDA-Λ) vs. linear logistic regression, and the other is for the discrete case, coparing the naïve Bayes classifier vs. linear logistic regression. We shall next ake soe coents on these pairings On the Pairing of LDA-Λ/Naïve Bayes and Linear Logistic Regression/GAM As entioned in Section 2.1, first, the naïve Bayes classifier cannot guarantee the linear for of the discriinant function λ(α) = log{p(y = 1 x)/p(y = 2 x)}, and, secondly, the conditional independence aongst the ultiple features within a class is a necessary condition for the validity of the naïve Bayes classifier and LDA-Λ but not for linear logistic regression, although in the latter the discriinant function λ(α) is odelled as a linear cobination of separate features. Therefore, the coparison between a generative-discriinative pair of LDA-Λ/naïve Bayes classifier vs. linear logistic regression should be interpreted with caution, in particular when the data do not support the assuption of conditional Xue-Titterington-SCH2008.tex; 5/08/2008; 22:19; p.5

6 6 Xue and Titterington independence of x y that ay shed unfavourable light on the siplified generative version, LDA-Λ and the naïve Bayes classifier. In this section, we will illustrate two such generative-discriinative pairs: one is LDA-Λ vs. linear logistic regression [6], and the other is the naïve Bayes classifier vs. the generalised additive odel (GAM) [10] LDA-Λ vs. Linear Logistic Regression Consider a feature vector x = (x 1,...,x p ) T and a binary class label y = 1,2. Linear logistic regression, one of the discriinative classifiers that do not assue any distribution p(x y) of the data, is odelled directly with a linear discriinant function as λ dis (α) = log p(y = 1 x) p(y = 2 x) = log π 1 p(x y = 1) + log π 2 p(x y = 2) = β 0 + β T x, (1) where p(y = k) = π k, α T = (β 0,β T ) and β is a paraeter vector of p eleents. By linear, we ean a scalar-valued function of a linear cobination of the features x 1,...,x p of an observed feature vector x. By contrast, LDA-Λ, one of the generative classifiers, assues that the data arise fro two p-variate noral distributions with different eans but the sae diagonal covariance atrix such that (x y = k;θ) N(µ k,λ), k = 1,2, where θ = (µ k,λ); this iplies an assuption of conditional independence between any two features x i y and x j y, i j, within a class. The density function of (x y = k;θ) can be written as p(x y = k;θ) = {e µt k Λ 1 x } { 1 (2π) p Λ e 1 2 µt k Λ 1 µ k } { e 1 2 xt Λ 1 x }, Xue-Titterington-SCH2008.tex; 5/08/2008; 22:19; p.6

7 Discriinative vs. Generative Classifiers 7 which leads to a linear discriinant function, λ gen (α) = log p(y = 1 x) p(y = 2 x) = log π 1 π 2 + log A(θ 1,η) A(θ 2,η) + (θ 1 θ 2 ) T x, (2) where θ k = µ T k Λ 1, η = Λ 1 1 and A(θ k,η) = (2π) p Λ e 1 2 µt k Λ 1 µ k. Siilarly, by assuing that the data arise fro two p-variate noral distributions with different eans but the sae full covariance atrix such that (x y = k;θ) N(µ k,σ), k = 1,2, we can obtain the sae forula as λ gen (α) but with θ k = µ T k Σ 1, η = Σ 1 and A(θ k,η) = 1 (2π) p Σ e 1 2 µt k Σ 1 µ k, which leads to the linear discriinant function of LDA with a coon full covariance atrix Σ (denoted by LDA-Σ hereafter). Therefore, we could rewrite θ as θ = (θ k,η), where θ k is a class-dependent paraeter vector while η is a coon paraeter vector across the classes. It is clear that the assuption of conditional independence aongst the features within a class is not a necessary condition for a generative classifier to attain a linear λ gen (α). In fact, as pointed out by [7], if the feature vector x follows a ultivariate exponential faily distribution with the density or probability ass function within a class being p(x y = k,θ k ) = e θt k x A(θ k,η)h(x,η),k = 1,2, the generative classifiers will attain a linear λ gen (α) Naïve Bayes vs. Generalised Additive Model (GAM) As with logistic regression, a GAM does not assue any distribution p(x y) for the data; it is odelled directly with a discriinant function that is a su of p functions f(x i ),i = 1,...,p, of the p features x i Xue-Titterington-SCH2008.tex; 5/08/2008; 22:19; p.7

8 8 Xue and Titterington separately [10]; that is λ dis (α) = log p(y = 1 x) p(y = 2 x) = log π 1 π 2 + p f(x i ). (3) i=1 Meanwhile, along with the assuption of the distribution of (x y), a fundaental assuption underlying the naïve Bayes classifier is that of the conditional independence aongst the p features within a class, so that the joint probability is p(x y) = p i=1 p(x i y). It follows that the discriinant function λ(α) is λ gen (α) = log p(y = 1 x) p(y = 2 x) = log π 1 π 2 + p i=1 log p(x i y = 1) p(x i y = 2). (4) It is clear, as pointed out by [10], that the naïve Bayes classifier is a specialised case of a GAM, with f(x i ) = log{p(x i y = 1)/p(x i y = 2)}. Furtherore, GAMs ay not necessarily assue conditional independence. One sufficient condition that leads to another specialised case of a GAM (we call it Q-GAM) is that p(x y) = q(x) p i=1 q(x i y), where q(x) is coon across the classes but cannot be further factorised into a product of functions of individual features as p i=1 q(x i). In such a case, the assuption of conditional independence between x i y and x j y, i j, is invalid but we still have f(x i ) = log{q(x i y = 1)/q(x i y = 2)}, where q(x i y) is different fro the arginal probability p(x i y) that is used by the naïve Bayes classifier. Xue-Titterington-SCH2008.tex; 5/08/2008; 22:19; p.8

9 Discriinative vs. Generative Classifiers Suary First, the conditional independence aongst the features within a class is a necessary condition for the naïve Bayes classifier and LDA-Λ, but it is not a necessary condition for linear logistic regression. Therefore, the generative-discriinative pair of LDA with a coon full covariance atrix Σ (LDA-Σ) vs. linear logistic regression also erits investigation. Secondly, given the parity between λ gen (α) and λ dis (α) and thus that, between two pairs, LDA-Σ vs. linear logistic regression and Q- GAM vs. GAM in ters of classification, neither classifier assues conditional independence of x y aongst the features within a class, which is an eleentary assuption underlying LDA-Λ and the naïve Bayes classifier. Therefore, it ay not be reliable for any clai that is derived fro the coparison between LDA-Λ or the naïve Bayes classifier and linear logistic regression to be generalised to all generative and discriinative classifiers. Thirdly, a coparison of quadratic noral discriinant analysis (QDA) with unequal diagonal atrices Λ 1 and Λ 2 (denoted by QDA- Λ g hereafter) and unequal full covariance atrices Σ 1 and Σ 2 (denoted by QDA-Σ g hereafter) with quadratic logistic regression ay provide an interesting extension of the work of [6] Our ipleentation Ref. [6] reported experiental results on 15 real-world datasets, 8 with only continuous and binary features and 7 with only discrete features, fro the UCI achine learning repository [1]; this repository stores Xue-Titterington-SCH2008.tex; 5/08/2008; 22:19; p.9

10 10 Xue and Titterington ore than 100 datasets contributed and widely used by the achine learning counity, as a benchark for epirical studies of achine learning approaches. As pointed out in [6], there were a few cases (2 out of 8 continuous cases and 4 out of 7 discrete cases) that did not support the better asyptotic perforance of the discriinative classifier, priarily because of the lack of sufficiently large training sets. However, it is known that the perforance of a classifier varies to soe extent with the features selected and a generally-valid epirical evaluation of classifiers is always an iportant but difficult proble [4] In this context, we first replicate experients on these 15 datasets, with and without stepwise variable selection being perfored on the full linear logistic regression odel using all the observations of each dataset. In the stepwise variable selection process, the decision to include or exclude a variable is based on the calculation of the Akaike inforation criterion (AIC). Furtherore, in the 8 continuous cases, both LDA-Λ and LDA-Σ are copared with linear logistic regression. Then we will extend the coparison to between QDA and quadratic logistic regression for the 8 continuous UCI datasets and finally to siulated continuous datasets. The ipleentations in R ( of LDA and QDA are rewritten fro a Matlab function cda for classical linear and quadratic discriinant analysis [13]. Logistic regression is ipleented by an R function gl fro a standard package stats in R, and the naïve Bayes classifier is ipleented by an R function naivebayes fro a contributed package e1071 for R. In addition, siilarly to what was done by [6], for each sapled training-set size, we perfor 1000 rando splits of each dataset into Xue-Titterington-SCH2008.tex; 5/08/2008; 22:19; p.10

11 Discriinative vs. Generative Classifiers 11 a training set of size and a test set of size N, where N is the nuber of observations in the whole dataset, and report the average of the isclassification rates over these 1000 test sets. The training set is required to have at least 1 observation fro each of the two classes, and, for discrete datasets, to have all the levels of the features presented by the training observations, otherwise the prediction for the test set ay be asked to predict on soe new levels for which no inforation has been provided in the training process. In order to have all the coefficients of predictor variables in the odel estiated in our ipleentation of logistic regression by gl, the nuber of training observations should be larger than the nuber p of predictor variables, where p = p for the continuous cases if all p features are used for the linear odel. More attention should be paid to the discrete cases with ultinoial features in the odel, where ore duy variables have to be used as the predictor variables, with the consequence that p could be uch larger than p, e.g., p = 3p for the linear odel if all the features have 4 levels. In other words, although we ay report isclassification rates for logistic regression with sall, it is not reliable for us to base any general clai on those of saller than p, the actual nuber of predictor variables used by the logistic regression odel. Xue-Titterington-SCH2008.tex; 5/08/2008; 22:19; p.11

12 12 Xue and Titterington 3. Linear/Quadratic Discriination for Epirical Datasets 3.1. Linear Discriination for Continuous Datasets For the continuous datasets, as was done by [6], all the ultinoial features are reoved so that only continuous and binary features x i are kept and their values x i are rescaled into [0,1]. Any observation with issing features is reoved fro the datasets, as is any feature with only a single value for all the observations. In addition, as Gaussian distributions and equal within-class covariance atrices are assued for x y for LDA-Λ and LDA-Σ, testing such assuptions can help the interpretation of classification perforance of relevant classifiers. Therefore, before carrying out the classification, we perfor the Shapiro-Wilk test for within-class norality for each feature x i y [11] and Levene s test for hoogeneity of variance across the two classes [5]. For the datasets discussed below, the significance level is set at 0.05, and we observe that null hypotheses of norality and hoogeneity of variance are ostly rejected by the tests at that significance level. A brief description of the continuous datasets can be found in Table I, which lists, for each dataset, the total nuber N 0 of the observations, the nuber N of the observations that we use after the pre-processing entioned above, the total nuber p of continuous or binary features, the nuber p AIC of features selected by AIC, the nuber p SW of features for which the null hypotheses were rejected by the Shapiro-Wilk test and the corresponding nuber p L for Levene s test, the indicator 1 {2R Λ} {1,0} of whether or not the two regies Xue-Titterington-SCH2008.tex; 5/08/2008; 22:19; p.12

13 Table I. Description of continuous datasets. Discriinative vs. Generative Classifiers 13 Dataset N 0 N p p AIC p SW p L 1 {2R Λ} 1 {2R Σ} Pia Adult Boston Optdigits Optdigits Ionosphere Liver disorders Sonar are observed between LDA-Λ and linear logistic regression and the indicator 1 {2R Σ} {1,0} with regard to LDA-Σ. Note that, for soe large datasets such as Adult (and Sick in Section 3.3), in order to reduce coputational coplexity without degrading the validity of the coparison between the classifiers, we randoly saple observations with the class prior probability kept unchanged. Our results are shown in Figure 1. Since with variable selection by AIC the results confor ore to the clai of two regies ade by [6], we show such results only if they are different fro those without variable selection. Meanwhile, in the figures hereafter we use the sae annotation of the vertical and horizontal axes and the sae line type as those in [6]. For the reason given at the end of Section 2.3, Figure 1 is only drawn for > p, with the intercept in λ(α) taken into account. In general, our study of these continuous datasets suggests the following conclusions. First, in the coparison of LDA-Λ vs. linear logistic regression, the pattern of our results can be said to be siilar to that of [6]. Xue-Titterington-SCH2008.tex; 5/08/2008; 22:19; p.13

14 14 Xue and Titterington pia adult logistic reg logistic reg boston optdigits 0 vs. 1 (AIC) logistic reg logistic reg. optdigits 2 vs. 3 (AIC) ionosphere (AIC) logistic reg logistic reg. liver disorders sonar (AIC) logistic reg logistic reg Figure 1. Plots of isclassification rate vs. training-set size (averaged over 1000 rando training/test set splits) for the continuous UCI datasets, with regard to linear discriination. Xue-Titterington-SCH2008.tex; 5/08/2008; 22:19; p.14

15 Discriinative vs. Generative Classifiers 15 Secondly, the perforance of LDA-Σ is worse than that of LDA-Λ when the training-set size is sall, but better than that of the latter when is large. Thirdly, the perforance of LDA-Σ is better than that of linear logistic regression when is sall, but is ore or less coparable with that of the latter when is large. Fourthly, pre-processing with variable selection can reveal the distinction in perforance between generative and discriinative classifiers with fewer training observations. Therefore, considering LDA-Λ vs. linear logistic regression, there is strong evidence to support the clai that the discriinative classifier has lower asyptotic rate while the generative classifier ay approach its (higher) asyptotic rate uch faster. However, considering LDA-Σ vs. linear logistic regression, the evidence is not so strong, although the clai ay still be ade Quadratic Discriination On Continuous Datasets As a natural extension of the coparison between LDA-Λ (with a coon diagonal covariance atrix Λ across the two classes), LDA-Σ (with a coon full covariance atrix Σ) and linear logistic regression that was presented in Section 3.1, this section presents the coparison between QDA-Λ g (with two unequal diagonal covariance atrices Λ 1 and Λ 2 ), QDA-Σ g (with two unequal full covariance atrices Σ 1 and Σ 2 ) and quadratic logistic regression. Using the 8 continuous UCI datasets, all the settings are the sae as those in Section 3.1 except for the following aspects. Xue-Titterington-SCH2008.tex; 5/08/2008; 22:19; p.15

16 16 Xue and Titterington First, considering that in the quadratic logistic regression odel there are p(p 1)/2 interaction ters between the features in a p- diensional feature space, a large nuber of interactions when the diensionality p is high, the odel is constrained to contain only the intercept, the p features and their p squared ters, so as to ake the estiation of the odel ore feasible and interpretable. Secondly, for the sae reason as explained at the end of Section 1, in the reported plots of isclassification rate vs. without variable selection, only the results for > 2p are reliable for coparison since there are 2p predictor variables in the quadratic logistic regression odel. Hence, only the results for > 2p are shown in Figure 2. Thirdly, the datasets are randoly split into training sets and test sets 100 ties rather than 1000 ties for each sapled training-set size because of the higher coputational coplexity of the quadratic odels copared with that of the linear odels. In general, our study of these continuous datasets, as shown in Figure 2, suggests quite siilar conclusions to those in Section 3.1, through substituting QDA-Λ g for LDA-Λ, QDA-Σ g for LDA-Σ, and quadratic logistic regression for linear logistic regression Linear Discriination On Discrete Datasets For the discrete datasets, as was done by [6], all the continuous features are reoved and only the discrete features are used. The results are entitled ultinoial in the following figures if a dataset includes ultinoial features, and otherwise are entitled binoial. Meanwhile, Xue-Titterington-SCH2008.tex; 5/08/2008; 22:19; p.16

17 Discriinative vs. Generative Classifiers 17 pia (AIC) adult QDA: Σ 1 = Λ 1, Σ 2 = Λ 2 quadratic logistic reg. QDA: Σ 1, Σ QDA: Σ 1 = Λ 1, Σ 2 = Λ 2 quadratic logistic reg. QDA: Σ 1, Σ boston (AIC) optdigits 0 vs. 1 (AIC) QDA: Σ 1 = Λ 1, Σ 2 = Λ 2 quadratic logistic reg. QDA: Σ 1, Σ QDA: Σ 1 = Λ 1, Σ 2 = Λ 2 quadratic logistic reg. QDA: Σ 1, Σ 2 optdigits 2 vs. 3 (AIC) ionosphere (AIC) QDA: Σ 1 = Λ 1, Σ 2 = Λ 2 quadratic logistic reg. QDA: Σ 1, Σ QDA: Σ 1 = Λ 1, Σ 2 = Λ 2 quadratic logistic reg. QDA: Σ 1, Σ 2 liver disorders sonar (AIC) QDA: Σ 1 = Λ 1, Σ 2 = Λ 2 quadratic logistic reg. QDA: Σ 1, Σ QDA: Σ 1 = Λ 1, Σ 2 = Λ 2 quadratic logistic reg. QDA: Σ 1, Σ Figure 2. Plots of isclassification rate vs. training-set size (averaged over 100 rando training/test set splits) for the continuous UCI datasets, with regard to quadratic discriination. Xue-Titterington-SCH2008.tex; 5/08/2008; 22:19; p.17

18 18 Xue and Titterington any observation with issing features is reoved fro the datasets, as is any feature with only a single value for all the observations. Table II. Description of discrete datasets. Dataset N 0 N p p p AIC p AIC 1 {2R NB} Prooters Lyphography Breast cancer Voting recorders Lenses Sick Adult A brief description of the discrete datasets can be found in Table II, which includes the indicator 1 {2R NB} {1,0} of whether or not the two regies are observed between the naïve Bayes classifier and linear logistic regression. Our results are shown in Figure 3 for soe > p or > p AIC, with duy variables taken into account for the ultinoial features. In general, our study of these discrete datasets suggests that, in the coparison of the naïve Bayes classifier vs. linear logistic regression, the pattern of our results can be said to be siilar to that of [6]. 4. Linear Discriination On Siulated Datasets In this section, 16 siulated datasets are used to copare the perforance of LDA-Λ, LDA-Σ and linear logistic regression. The saples are siulated fro bivariate noral distributions, bivariate Student s t-distributions, bivariate log-noral distributions and ixtures of 2 Xue-Titterington-SCH2008.tex; 5/08/2008; 22:19; p.18

19 Discriinative vs. Generative Classifiers prooters (ultinoial, AIC) naive Bayes logistic reg lyphography (ultinoial, AIC) naive Bayes logistic reg breast cancer (ultinoial) voting records (binoial, AIC) naive Bayes logistic reg naive Bayes logistic reg lenses hard+soft vs. no (ultinoial) sick (ultinoial, AIC) naive Bayes logistic reg naive Bayes logistic reg adult (ultinoial) naive Bayes logistic reg Figure 3. Plots of isclassification rate vs. training-set size (averaged over 1000 rando training/test set splits) for the discrete UCI datasets, with regard to linear discriination. Xue-Titterington-SCH2008.tex; 5/08/2008; 22:19; p.19

20 20 Xue and Titterington bivariate noral distributions, with 4 datasets for each of these 4 types of distribution. Within each dataset there are 1000 siulated saples, which are divided equally into 2 classes. The siulations fro the bivariate log-noral distributions and noral ixtures are based on an R function vrnor for siulating fro a ultivariate noral distribution fro a contributed R package MASS, and the siulation fro the bivariate Student s t-distribution is ipleented by an R function rvt fro a contributed R package vtnor. Differently fro the UCI datasets, the siulated data are not rescaled into the range [0,1] and no variable selection is used since the feature space is only of diension two Norally Distributed Data Four siulated datasets are randoly generated fro two bivariate noral distributions, N(µ 1,Σ 1 ) and N(µ 2,Σ 2 ), where µ 1 = (1,0) T, µ 2 = ( 1,0) T and Σ 1 and Σ 2 are subject to four different types of constraint specified as having equal diagonal or full covariance atrices Σ 1 = Σ 2 and having unequal diagonal or full covariance atrices Σ 1 Σ 2. Siilarly to what was done for the UCI datasets, for each sapled training-set size, we perfor 1000 rando splits of the 1000 saples of each siulated dataset into a training set of size and a test set of size 1000, and report the average isclassification rates over these 1000 test sets. The training set is required to have at least 1 saple fro each of the two classes. In such a way, LDA-Λ and LDA-Σ are Xue-Titterington-SCH2008.tex; 5/08/2008; 22:19; p.20

21 Discriinative vs. Generative Classifiers 21 copared with linear logistic regression, in ters of isclassification rate, with the following results shown in Figure 4. Noral: Σ 1 = Λ, Σ 2 = Λ Noral: Σ 1 = Σ, Σ 2 = Σ, Σ Λ linear logistic reg linear logistic reg. Noral: Σ 1 = Λ 1, Σ 2 = Λ 2, Σ 1 Σ 2 Noral: Σ 1 Λ 1, Σ 2 Λ 2, Σ 1 Σ linear logistic reg linear logistic reg. Figure 4. Plots of isclassification rate vs. training-set size (averaged over 1000 rando training/test set splits) for siulated bivariate norally distributed data for two classes. The dataset for the top-left panel of Figure 4 has Σ 1 = Σ 2 = Λ with a diagonal atrix Λ = Diag(1, 1), such that the data satisfy the assuptions underlying LDA-Λ. The dataset for the top-right panel has Σ 1 = Σ 2 = Σ with a full atrix Σ = 1 0.5, such that the data satisfy the assuptions underlying LDA-Σ. The dataset for the bottoleft panel has Σ 1 = Λ 1,Σ 2 = Λ 2 with diagonal atrices Λ 1 = Diag(1,1) and Λ 2 = Diag(0.25,0.75), such that the hoogeneity of the covariance atrices is violated. The dataset for the botto-right panel has Xue-Titterington-SCH2008.tex; 5/08/2008; 22:19; p.21

22 22 Xue and Titterington Σ 1 = and Σ 2 =, such that both the hoogeneity of the covariance atrices and the conditional independence (uncorrelatedness) of the features within a class are violated Student s t-distributed Data Four siulated datasets are randoly generated fro two bivariate Student s t-distributions, both distributions with degrees of freedo ν = 3. The values of class eans µ 1 and µ 2, the four types of constraint on Σ 1 and Σ 2, and other settings of the experients are all the sae as those in Section 4.1. The results are shown in Figure 5, where for each panel the constraint with regard to Σ 1 and Σ 2 is the sae as the corresponding one in Figure 4, except for a scalar ultiplier ν/(ν 2) Log-norally Distributed Data Four siulated datasets are randoly generated fro two bivariate log-noral distributions, whose logariths are norally distributed as N(µ 1,Σ 1 ) and N(µ 2,Σ 2 ), respectively. The values of µ 1 and µ 2, the four types of constraint on Σ 1 and Σ 2, and other settings of the experients are all the sae as those in Section 4.1. By definition, if a p-variate rando vector x N(µ(x), Σ(x)), then a p-variate vector x of the exponentials of the coponents of x follows a p-variate log-noral distribution, i.e., x = exp(x) log N(µ( x), Σ( x)), where the i-th eleent µ (i) ( x) of the ean vector and the (i,j)-th Xue-Titterington-SCH2008.tex; 5/08/2008; 22:19; p.22

23 Discriinative vs. Generative Classifiers 23 Student s t: Σ 1 = Λ, Σ 2 = Λ Student s t: Σ 1 = Σ, Σ 2 = Σ, Σ Λ linear logistic reg linear logistic reg. Student s t: Σ 1 = Λ 1, Σ 2 = Λ 2, Σ 1 Σ 2 Student s t: Σ 1 Λ 1, Σ 2 Λ 2, Σ 1 Σ linear logistic reg linear logistic reg. Figure 5. Plots of isclassification rate vs. training-set size (averaged over 1000 rando training/test set splits) for siulated bivariate Student s t-distributed data for two classes. eleent Σ (i,j) ( x) of the covariance atrix, i,j = 1,...,p, are µ (i) ( x) = e µ(i) (x)+ Σ(i,i) (x) 2, Σ (i,j) ( x) = (e Σ(i,j) (x) 1)e µ(i) (x)+µ (j) (x)+ Σ(i,i) (x)+σ (j,j) (x) 2. It follows that, if the coponents of its logarith x are independent and norally distributed, the coponents of the log-norally distributed ultivariate rando variable x are uncorrelated. In other words, if x N(µ(x),Λ(x)), then x = exp(x) log N(µ( x),λ( x)). However, as shown by the equations above, Λ( x) is deterined by both µ(x) and Λ(x), so that Σ 1 (x) = Σ 2 (x) ay not ean Σ 1 ( x) = Σ 2 ( x). Xue-Titterington-SCH2008.tex; 5/08/2008; 22:19; p.23

24 24 Xue and Titterington Therefore, if we consider in our cases µ 1 µ 2, it can be expected that the pattern of perforance of the classifiers for the datasets with equal covariance atrices Σ 1 = Σ 2 in the underlying noral distributions could be siilar to that for the datasets with unequal covariance atrices Σ 1 Σ 2, since in both cases the covariance atrices of the log-norally distributed variables are in fact unequal. In this context, it akes ore sense to copare the classifiers in situations with diagonal and full covariance atrices of the underlying norally distributed data, respectively, rather than those with equal and unequal covariance atrices. Log noral: Σ 1 = Λ, Σ 2 = Λ Log noral: Σ 1 = Σ, Σ 2 = Σ, Σ Λ linear logistic reg linear logistic reg. Log noral: Σ 1 = Λ 1, Σ 2 = Λ 2, Σ 1 Σ 2 Log noral: Σ 1 Λ 1, Σ 2 Λ 2, Σ 1 Σ linear logistic reg linear logistic reg. Figure 6. Plots of isclassification rate vs. training-set size (averaged over 1000 rando training/test set splits) for siulated bivariate log-norally distributed data for two classes. Xue-Titterington-SCH2008.tex; 5/08/2008; 22:19; p.24

25 Discriinative vs. Generative Classifiers 25 The results are shown in Figure 6, where for each panel the constraint with regard to Σ 1 and Σ 2 is the sae as the corresponding one in Figure Noral Mixture Data Copared with the noral distribution, the Student s t-distribution and the log-noral distribution used in Sections 4.1, 4.2 and 4.3 for the coparison of the classifiers, the ixture of noral distributions is a better approxiation to real data in a variety of situations. In this section, 4 siulated datasets, each consisting of 1000 saples, are randoly generated fro two ixtures, each of two bivariate noral distributions, with 250 saples fro each ixture coponent. The two coponents, A and B, of the ixture for Class 1 are norally distributed with distributions N(µ 1A,Σ 1 ) and N(µ 1B,Σ 1 ), respectively, where µ 1A = (1,0) T and µ 1B = (3,0) T ; and the two coponents, C and D, of the ixture for Class 2 are norally distributed with probability density functions N(µ 2C,Σ 2 ) and N(µ 2D,Σ 2 ), respectively, where µ 2C = ( 1,0) T and µ 2D = ( 3,0) T. In such a way, when Σ 1 and Σ 2 are subject to the four different types of constraint with regard to Σ 1 and Σ 2 as previously discussed, the covariance atrices of the two ixtures will be subject to the sae constraints. Other settings of the experients are all the sae as that in Section 4.1. The results are shown in Figure 7, where for each panel the constraint with regard to Σ 1 and Σ 2 is the sae as the corresponding one in Figure 4. Xue-Titterington-SCH2008.tex; 5/08/2008; 22:19; p.25

26 26 Xue and Titterington Mixture: Σ 1 = Λ, Σ 2 = Λ Mixture: Σ 1 = Σ, Σ 2 = Σ, Σ Λ linear logistic reg linear logistic reg. Mixture: Σ 1 = Λ 1, Σ 2 = Λ 2, Σ 1 Σ 2 Mixture: Σ 1 Λ 1, Σ 2 Λ 2, Σ 1 Σ linear logistic reg linear logistic reg. Figure 7. Plots of isclassification rate vs. training-set size (averaged over 1000 rando training/test set splits) for siulated bivariate 2-coponent noral ixture data for two classes Suary of Linear Discriination for Siulated Datasets In general, our study of these siulated continuous datasets suggests the following conclusions. First, when the data are consistent with the assuptions underlying LDA-Λ or LDA-Σ, as shown in the top-left and top-right panels of Figure 4, both ethods can perfor the best aong the and linear logistic regression, throughout the range of the training-set size in our study. In these cases, there is no evidence to support the clai that the discriinative classifier has lower asyptotic rate while the Xue-Titterington-SCH2008.tex; 5/08/2008; 22:19; p.26

27 Discriinative vs. Generative Classifiers 27 generative classifier ay approach its (higher) asyptotic rate uch faster. Secondly, when the data violate the assuptions underlying the LDAs, linear logistic regression generally perfors better than the LDAs, in particular when is large. This pattern is especially clear, as shown in Figure 6 for the log-norally distributed data, the distributions of which are heavy-tailed, asyetric and thus in soe sense less Gaussian than Student s t and noral-ixture distributions in our experients. In this case, there is strong evidence to support the clai that the discriinative classifier has lower asyptotic rate, but there is no convincing evidence to support the clai that the generative classifier ay approach its (higher) asyptotic rate uch faster. Finally, when the covariance atrices are non-diagonal, LDA-Σ perfors rearkably better than LDA-Λ and ore rearkably when is large; when the covariance atrices are diagonal, LDA-Λ perfors generally better than LDA-Σ and ore so when is large. 5. Coents on the Two Regies of Perforance regarding Training-Set Size Based on the theoretical analysis and epirical coparison between LDA-Λ or the naïve Bayes classifiers and linear logistic regression, Ref. [6] clai that there are two distinct regies of perforance with regard to the training-set size. However, our epirical results, as shown in Tables I and II, could not convincingly support the clai. Furtherore, our siulation studies, as presented in Section 4, failed to Xue-Titterington-SCH2008.tex; 5/08/2008; 22:19; p.27

28 28 Xue and Titterington find the two regies when the data either confored to the assuptions underlying the generative classifiers, as shown in Figure 4, or heavily violated the assuptions, as shown in Figure 6. Therefore, besides coenting on the pairing of the copared classifiers in Section 2.2, we shall clarify the clai further through coenting on the reliability of the two regies. Suppose we have a training set {(y (i) tr,x(i) tr )} i=1 of independent observations and a test set {(y (i) te,x(i) te )}N i=1 of N independent observations, where x (i) = (x (i) 1,...,x(i) p ) T is the i-th observed p-variate feature vector x, and y (i) {1,2} is its observed univariate class label. Let us also assue that each observation {(y (i),x (i) )} follows an identical distribution so that testing based on the training results akes sense. In order to siplify the notation, let x tr denote {(x (i) tr )} i=1, and siilarly define x te, y tr and y te. Meanwhile, a discriinant function λ(α) = log{p(y = 1 x)/p(y = 2 x)}, which is equivalent to a Bayes classifier ŷ(x) = argax y p(y x), is used for the 2-class classification For Discriinative Classifiers Discriinative classifiers estiate the paraeter α of the discriinant function λ(α) through ˆα = argax α p(y tr x tr,α), the axiisation of a conditional probability; such an estiation procedure can be regarded as a kind of axiu likelihood estiation with p(y tr x tr,α) as the likelihood function. It is well known that, if the 0 1 loss function is used so that the isclassification rate is the total risk, the Bayes classifiers will attain the iniu rate [9]. This iplies that, under such a loss function, the discriinative classifiers are in fact Xue-Titterington-SCH2008.tex; 5/08/2008; 22:19; p.28

29 Discriinative vs. Generative Classifiers 29 using the sae criterion to optiise the estiation of the paraeter α and the perforance of classification. In this context, the following clais, supported by the siulation study in Section 4, can be proposed. First, if the sae dataset is used to train and test, i.e., x tr as x te and y tr as y te, then the discriinative classifiers should always provide the best perforance, no atter how large the training-set size is, provided that the 0 1 loss function is used and the odelling of p(y x,α), such as the linearity of λ(α), is correctly specified for all the observations, and thus the only work that reains is to estiate accurately the paraeter α. Secondly, if is large enough to ake (y tr,x tr ) representative of all the observations including (y te,x te ), then the discriinative classifiers should also provide the best prediction perforance on (y te,x te ), i.e., with the best asyptotic perforance, provided that the odelling of p(y x,α) is correctly specified for all the observations. Finally, if is not large enough to ake (y tr,x tr ) representative of all the observations, and (y te,x te ) is not exactly the sae as (y tr,x tr ), then the discriinative classifiers ay not necessarily provide the best prediction perforance on (y te,x te ), even though the odelling of p(y x,α) ay be correct For Generative Classifiers Generative classifiers estiate the paraeter α of the discriinant function λ(α) through first axiising a joint probability function, i.e. ˆθ = argax θ p(y tr,x tr θ), to obtain a axiu likelihood estiate Xue-Titterington-SCH2008.tex; 5/08/2008; 22:19; p.29

30 30 Xue and Titterington (MLE) ˆθ of θ, the paraeter of the joint distribution of (y,x), and then calculate ˆα as a function α(θ) at ˆθ. Under soe regularity conditions, such as the existence of the first and second derivatives of the loglikelihood function and the inverse of the Fisher inforation atrix I(θ), the MLE ˆθ is asyptotically unbiased, efficient and norally distributed. Accordingly, by the delta ethod, ˆα is also asyptotically norally distributed, unbiased and efficient, given the existence of the first derivative of the function α(θ). Therefore, the following clais, supported by the siulation study in Section 4, can be proposed. First, asyptotically, the generative classifiers will provide the best prediction perforance on (y te,x te ), dependent on the preise that the odelling of p(y,x θ), instead of p(y x,α), is correctly specified for all the observations. Secondly, if is large enough to ake (y tr,x tr ) representative of all the observations including (y te,x te ), then the generative classifiers should also provide the best prediction perforance on (y te,x te ), i.e., with the best asyptotic perforance, given that the odelling of p(y, x θ), instead of p(y x, α), is correctly specified for all the observations. Finally, if is not large enough to ake (y tr,x tr ) representative of all the observations, then the generative classifiers ay not necessarily provide the best prediction perforance on (y te,x te ). Xue-Titterington-SCH2008.tex; 5/08/2008; 22:19; p.30

31 Discriinative vs. Generative Classifiers Suary In suary, it ay not be so reliable to clai the existence of the two distinct regies of perforance between the generative and discriinative classifiers with regard to the training-set size. For real world datasets such as those deonstrated in Sections 3.1 and 3.3, so far there is no theoretically correct, general criterion for choosing between the discriinative and the generative classifiers; the choice depends on the relative confidence we have in the correctness of the specification of either p(y x) or p(y,x) for the data. This can be to soe extent a deonstration of why Ref. [3] and [7] prefer LDA when no odel is-specification occurs but other epirical studies ay prefer linear logistic regression instead. Ref. [6] provided theoretical proof that the discriinative classifiers need Ω(p) (i.e., M 1 p where M 1 > 0) training observations to approach its asyptotic rate with high probability, whereas the generative classifiers need only Ω(log(p)) (i.e., M 2 log(p) where M 2 > 0) training observations. We observe the following. First, for two distinct regies to occur, it is necessary that M 1 p M 2 log(p). Secondly, such a higher efficiency of the generative classifiers ight be also attained because of the bias induced by its odel is-specification, such as using LDA-Λ/the naïve Bayes classifiers for the cases in which it would be better to adopt LDA-Σ/QDA-Σ g. For real-world data, application of such a is-specified odel is likely; the bias-variance tradeoff ay then play a role in deterining the occurrence of the two distinct regies. Xue-Titterington-SCH2008.tex; 5/08/2008; 22:19; p.31

32 32 Xue and Titterington In addition, a siilar pattern of two distinct regies with regard to was also reported in Ref. [8], based on the perforance of logistic regression and tree induction; they found that logistic regression perfors better with saller and tree induction with larger. Therefore, although tree induction and logistic regression are not a pair of generative and discriinative classifiers, it could be interesting to explore such a pattern for other pairs of classifiers. Acknowledgeents The authors thank Andrew Y. Ng for counication about the ipleentation of the epirical studies in this paper. We are grateful to the associate editor, anonyous referees and David J. Hand for advice about the structure, references, experiental illustration and interpretation of this anuscript. The work also benefited fro our participation in the Research Prograe on Statistical Theory and Methods for Coplex, High-Diensional Data at the Isaac Newton Institute for Matheatical Sciences in Cabridge. References 1. Asuncion, A. and D. J. Newan: 2007, UCI Machine Learning Repository. Irvine, CA: University of California, School of Inforation and Coputer Science. learn/mlrepository.htl. 2. Dawid, A. P.: 1976, Properties of diagnostic data distributions. Bioetrics 32(3), Xue-Titterington-SCH2008.tex; 5/08/2008; 22:19; p.32

33 Discriinative vs. Generative Classifiers Efron, B.: 1975, The Efficiency of Logistic Regression Copared to Noral Discriinant Analysis. Journal of the Aerican Statistical Association 70(352), Hand, D. J.: 2006, Classifier technology and illusion of progress (with discussion). Statistical Science 21, Li, T.-S. and W.-Y. Loh: 1996, A coparison of tests of equality of variances. Coputational Statistics & Data Analysis 22(3), Ng, A. Y. and M. I. Jordan: 2001, On discriinative vs. generative classifiers: a coparison of logistic regression and naive Bayes. In: NIPS. pp O Neill, T. J.: 1980, The general distribution of the rate of a classification procedure with application to logistic regression discriination. Journal of the Aerican Statistical Association 75(369), Perlich, C., F. Provost, and J. S. Sionoff: 2003, Tree induction vs. logistic regression: A learning-curve analysis. Journal of Machine Learning Research 4( ). 9. Ripley, B. D.: 1996, Pattern Recognition and Neural Networks. New York: Cabridge University Press. 10. Rubinstein, Y. D. and T. Hastie: 1997, Discriinative vs. inforative learning. In: KDD. pp Shapiro, S. S. and M. B. Wilk: 1965, An analysis of variance test for norality (coplete saples). Bioetrika 52(3-4), Titterington, D. M., G. D. Murray, L. S. Murray, D. J. Spiegelhalter, A. M. Skene, J. D. F. Habbea, and G. J. Gelpke: 1981, Coparison of Discriination Techniques Applied to a Coplex Data Set of Head Injured Patients (with discussion). Journal of the Royal Statistical Society. Series A (General) 144(2), Verboven, S. and M. Hubert: 2005, LIBRA: a MATLAB Library for Robust Analysis. Cheoetrics and Intelligent Laboratory Systes 75(2), Xue-Titterington-SCH2008.tex; 5/08/2008; 22:19; p.33