Weighted Pseudo-Metric Discriminatory Power Improvement Using a Bayesian Logistic Regression Model Based on a Variational Method

Transcription

1 Weighted Pseudo-Metric Discriminatory Power Improvement Using a Bayesian Logistic Regression Model Based on a Variational Method R. Ksantini 1, D. Ziou 1, B. Colin 2, and F. Dubeau 2 1 Département d Informatique 2 Département de Mathématiques Université de Sherbrooke Université de Sherbrooke Sherbrooke (Qc), Canada, J1K 2R1 Sherbrooke (Qc), Canada, J1K 2R1 {riadh.ksantini, djemel.ziou}@usherbrooke.ca {bernard.colin, francois.dubeau}@usherbrooke.ca - Technical Report No Abstract Distance measures like the nearest neighbor rule distance and the Euclidean distance have been the most widely used to measure similarities between feature vectors in the content-based image retrieval (CBIR) systems. However, in these similarity measures no assumption is made about the probability distributions and the local relevances of the feature vectors, thereby irrelevant features might hurt retrieval performance. Probabilistic approaches have proven to be an effective solution to this CBIR problem. In this paper, we use a Bayesian logistic regression model based on a variational method, in order to compute the weights of a pseudo-metric, to improve its discriminatory capacity and then to increase image retrieval accuracy. This pseudo-metric makes use of the compressed and quantized versions of the Daubechies-8 wavelet decomposed feature vectors, and its weights were adjusted by the classical logistic regression. The evaluation and comparison of the Bayesian logistic regression model and the classical logistic regression one are performed independently of and in image retrieval context. Experimental results show that the Bayesian logistic regression model is a significantly better tool than the classical logistic regression model to compute the pseudo-metric weights and to improve the retrieval and classification performances. Keywords: Image Retrieval, Logistic Regression, Variational Method, Weighted Pseudo-Metric. Corresponding author. Tel.: x2853; fax:

2 Contents List of Tables 3 List of Figures 3 1 Introduction 4 2 The pseudo-metric and the fast feature vector querying algorithm The pseudo-metric The fast feature vector querying algorithm Pseudo-metric weight adjustment The classical logistic regression model The Bayesian logistic regression model Training Color image retrieval method Color image database preprocessing The querying algorithm Experimental results Decontextualized classification performance evaluation of both models Contextualized evaluation and comparison of both models Used feature vectors The choices of q 0 and q Querying evaluation and comparison Conclusion 27 Bibliography 29 2

3 List of Tables 5.1 Comparison between the CLR and the BLR (synthetic data 1) Comparison between the CLR and the BLR (synthetic data 2) Comparison between the CLR and the BLR (extracted data sets from the Iris data) List of Figures 5.1 Synthetic data Synthetic data Extracted data sets from the Iris data Evaluation (ZuBud database) Evaluation (WANG database) Comparison (ZuBud database) (m = 30) Comparison (WANG database) (m = 30)

4 1 Introduction The rapid expansion of the Internet and the wide use of digital data in many real world applications in the field of medecine, weather prediction, communications, commerce and academia, increased the need for both efficient image database creation and retrieval procedures. For this reason, content-based image retrieval (CBIR) approach was proposed [25], [3]. In this approach, the first step is to compute for each database image a feature vector capturing certain visual features of the image such as color, texture and shape. This feature vector is stored in a featurebase, and then given a query image chosen by a user, its feature vector is computed, compared to the featurebase feature vectors by a similarity measure, and finally the most similar database images to the query image are returned to the user. Distance measures like the nearest neighbor rule distance and the Euclidean distance have been the most widely used for feature vector comparison in the CBIR systems. However, these similarity measures are only based on the distances between feature vectors in the feature space. Therefore, because of the lack of information about the relative relevances of the featurebase feature vectors and because of the noise in these vectors, distance measures can fail and irrelevant features might hurt retrieval performance. Probabilistic approaches are a promising solution to this CBIR problem [15], that when compared to the standard CBIR methods based on the distance measures, can lead to a significant gain in retrieval accuracy. In fact, these approaches are capable of generating probabilistic similarity measures and highly customized metrics for computing image similarity based on the consideration and distinction of the relative feature vector relevances. In the following we review some of the CBIR systems based on these probabilistic approaches. J. Peng et al. [11] used a binary classification to classify the database color image feature vectors as relevant or irrelevant. The classified feature vectors and the query image feature vector constitute training data. From this latter, relevance weights of different features are computed. The components of the weight vector represent local relevance of each feature. They are adaptive to the location of the query image feature vector in the feature space. After the feature relevance has been determined, a weighted similarity metric is selected using reinforcement learning which is based on the classical logistic regression. Three different metrics were chosen: weighted Euclidean metric, weighted city-block metric and weighted dominance metric. G. Caenen and E. J. Pauwels [8] used the classical quadratic logistic regression model, in order to classify database image feature vectors as relevant or irrelevant. Based on this classification, a total relevance probability is generated to each image in the database. This total relevance probability is a linear combination of weights used to fine-tune the influence of each individual feature, with the natural logarithms of the logistic relevance probabilities of the feature vector components. Database images are ranked according to their total relevance probabilities. S. Aksoy et al. [21] used weighted L 1 and L 2 distances, in order to measure the similarity degree between two images, where the weights are the ratios of standard deviations of the feature values both for the whole database and also among the images selected as relevant. Each weight vector component represents the local relevance of each feature, and it assigns more importance to this latter if it is relevant. S. Aksoy and R. M. Haralick [20] investigated five different normalization method effectiveness in com- 4

5 bination with two different likelihood-based similarity measures that compute the likelihood of two images being similar or dissimilar, one being the query image and the other one being an image in the database. First, two classes are defined, the relevance class and the irrelevance class, and then the likelihood values are derived from the Bayesian classifier. Two different methods were used to estimate the conditional probabilities used in the classifier. The first method used multivariate normal assumption and the second one used independently fitted distributions for each feature. The similarity degree between a query image and a database image is measured by the likelihood ratio. In this paper, we investigate the effectiveness of a Bayesian logistic regression model based on a variational method [24], in order to adjust the weights of a pseudo-metric used in [18], and then to improve its discriminatory capacity and to increase image retrieval accuracy. This pseudo-metric makes use of the compressed and quantized versions of the Daubechies-8 wavelet decomposed feature vectors, and its weights were adjusted by the classical logistic regression. The basic principle of using the Bayesian logistic regression model and the classical logistic regression one is to allow a good separation between a relevance class and an irrelevance class extracted from the featurebase representing the color image database, and then to compute the pseudo-metric weights which represent the local relevances of scaling factor differences and the resolution level coefficient mismatches of the Daubechies-8 wavelet decomposed, compressed and quantized feature vectors. We will show that thanks to the variational method and to the prior information incorporation, the used Bayesian logistic regression model is a significantly better tool than the classical logistic regression model to compute the pseudo-metric weights and to improve the classification performance and the querying results. The classification performance evaluation and comparison of both models are performed independently of image retrieval context, using a cumulative distance error and a classification accuracy used in [14], and the evaluation of the retrieval method using both models separately, is performed using precision and scope curves as defined in [12]. In the next section, we briefly redefine the pseudo-metric and explain the fast feature vector querying algorithm. In section 3, we briefly describe the pseudo-metric weight adjustment using the classical logistic regression model, while showing the limitations of this latter and that the Bayesian logistic regression model based on a variational method is more appropriate for the pseudo-metric weight computation. Then, we detail the Bayesian logistic regression model based on a variational method and we present the weight computation algorithm. Finally, we describe the data training performed for both models, in order to validate the pseudo-metric weight computation. The color image retrieval method is briefly presented in section 4. Finally, in section 5, the decontextualized classification performance evaluation and comparison of both models are performed using a cumulative distance error and a classification accuracy [14]. Then, feature vectors that we use to represent the database color images are summarized. Moreover, we will perform some experiments to validate the Bayesian logistic regression model and we will use the precision and scope [12], in order to show the advantage of the Bayesian logistic regression model over the classical logistic regression one, in terms of querying results. 5

6 2 The pseudo-metric and the fast feature vector querying algorithm 2.1 The pseudo-metric Let us consider Q and T as the query and the target feature vectors, respectively, having 2 J components each. The pseudo-metric is given by the following expression Q, T = w 0 Q[0] T [0] w bin(i) ( Q c q[i] = T q c [i]), (2.1) where ( i: Q c q[i] 0 ) { Q c q[i] = T 1 if q c Qc [i] = q [i] = T q c [i] 0 otherwise, Q[0] and T [0] are the scaling factors of Q and T, Qc q [i] and T c q [i] represent the i-th coefficients of their Daubechies-8 wavelets decomposed, compressed to m coefficients and quantified versions, w 0 and the w bin(i) s are the weights to compute, and the bucketing function bin() groups these latters according to the J resolution levels, such as (2.2) bin(i) = log 2 (i) with i = 1,..., 2 J 1. (2.3) 2.2 The fast feature vector querying algorithm In order to optimize the pseudo-metric computation process, we introduce two search arrays denoted by Θ + and Θ as defined in [18]. After the creation of these latters and the weights w 0 and {w j } J 1 j=0 computation, the retrieval procedure of a query feature vector Q in the featurebase of feature vectors T k (k = 1,..., DB ), where DB denotes the featurebase size, is defined as follows Procedure Retrieval(Q: array [1..2 J ] of reals, m : integer,θ,θ + ) Q FastDaubechiesWaveletsDecomposition(Q) Initialize Score[k] = 0, for each k {1,..., DB } For each k {1,..., DB } do Score[position of T k in the (DB)] = w 0 Q[0] T k [0] end for Q c Compress( Q,m) 6

7 Q c q Quantify( Q c ) For each Q c q[i] 0 do If Q c q[i] > 0 then List Θ + [i] Else List Θ [i] End if for each l of List do End for Score[position of l in the (DB)] = Score[position of l in the (DB)] - w bin(i) End for Return Score End procedure This procedure returns an array Score such that Score[k] = Q, T k for each k {1,..., DB }. The array Score elements which are the similarity degrees between the query Q and the featurebase feature vectors T k (k {1,..., DB }), can be negative or positive. The most negative similarity degree corresponds to the closest target to the query Q. 3 Pseudo-metric weight adjustment In this section, we briefly describe the pseudo-metric weight adjustment using the classical logistic regression model, while showing the limitations of this latter and that the variational method based Bayesian logistic regression model is more appropriate for the pseudo-metric weight computation. Then, we detail the Bayesian logistic regression model based on a variational method and we present the weight computation algorithm. Finally, we describe the data training performed for both models. 3.1 The classical logistic regression model The weights w 0 and {w k } J 1 k=0 are adjusted such a way that the pseudo-metric should be efficient enough to match similar feature vectors as well as being able to discriminate dissimilar ones. The weight w 0 represents the relative relevance of the scaling factor differences of the featurebase feature vectors, and 7

8 a weight w k represents the relative relevance of the mismatches of the k-th resolution level coefficients of these vectors. We define two classes, the relevance class denoted by Ω 0 and the irrelevance class denoted by Ω 1, in order to classify the feature vector pairs as similar or dissimilar. In the classical logistic regression model, each feature vector pair is represented by an explanatory vector and a binary target variable. Specifically, for the i-th feature vector pair we have { 1 case of dissimilarity, S i = 0 case of similarity, X i = ( X 0,i, X 0,i,..., X J 1,i, 1) R J {1} : explanatory vector, where X 0,i is the absolute value of the difference between the scaling factors of the Daubechies-8 wavelets decomposed, compressed and quantified versions of the two feature vectors, {X k,i } J 1 k=0 are the numbers of mismatches between the J resolution level coefficients of these latters, and S i is the binary target variable, it s either 0 or 1, depending on whether or not the two feature vectors are intended to be similar. We suppose that we have n 0 pairs of similar feature vectors and n 1 pairs of dissimilar ones. Thus, the class Ω 0 contains n 0 explanatory vectors and their associated binary target variables {X i, S i = 0} n 0 i=1 to represent the pairs of the similar feature vectors, and the class Ω 1 contains n 1 explanatory vectors and their associated binary target variables {X j, S j = 1} n 1 j=1 to represent the pairs of the dissimilar feature vectors. Given the explanatory vector X i, we model the binary target variable S i by an irrelevance probability defined as follows where F (x) is given by and ) J 1 p i = P (S i = 1 X i = F ( w 0 X0,i + w k X k,i + v), (3.1) F (x) = k=0 ex 1 + e x, (3.2) ) J 1 p i = 1 p i = P (S i = 0 X i = F ( w 0 X0,i w k X k,i v), (3.3) where v is an unknown intercept and it will be considered as a weight when we will compute the pseudometric weights, but it will be not considered when we will use the pseudo-metric for feature vector comparison, since it is a constant for all query-target pairs and F ( x) is a decreasing function. The weights and the intercept are then determined using maximum likelihood estimation, i.e. they optimize the probability of the actual configuration occurring. More precisely, if we look up the relevance and irrelevance class explanatory vectors and their associated binary target variable values and use the equation (3.1) to compute the probabilities p i s, then the weights and the intercept are chosen to optimize the k=0 8

9 following conditional log-likelihood log ( L( w 0, w 0,..., w J 1, v) ) = = n ( (1 Sd )log(1 p d ) + S d log(p d ) ) (3.4) d=1 n 0 n 1 log(1 p i ) + log(p j ), (3.5) i=1 j=1 where n = (n 0 +n 1 ). Standard optimization algorithms such as the Fisher scoring and Newton-Raphson algorithms [5], [4], can be invoked to maximize the likelihood values for the weights and the intercept. However, according to [19], [7] and [4], in several cases, especially because of the exponential in the likelihood function or because this latter contains several local maxima or because of the existence of many zero explanatory vectors, these standard optimization algorithms converge to local maxima instead of absolute ones and may even have difficulties to converge. Thus, the maximum likelihood can fail and estimates of the parameters of interest (weights and intercept) may not be optimal or may not exist or may be on the boundary of the parameter space. These problems can be solved by using a Bayesian logistic regression model instead of the classical one. In fact, in the Bayesian logistic regression framework, there are three main components which are a chosen prior distribution over the parameters of interest, the likelihood function and the posterior distribution. These three components are formally combined by Bayes rule [1]. The posterior distribution contains all the available knowledge about the parameters of interest in the model. Therefore, in the Bayesian logistic regression model, the estimates of the parameters of interests which are considered as random variables, represent the components of the posterior distribution mean or mode [7]. Thus, the problems of nonexistence and overflow of the maximum likelihood routine estimates can be solved by smoothing these latters by assuming a certain prior distribution for the parameters of interest [7], and the problem of the non-optimality of these estimates can be solved by choosing a posterior distribution having a unique mode. In the literature, there have been many priors having different distributional forms chosen for different applications based on the Bayesian logistic regression. Examples include the Dirichlet prior, Jeffrey s prior, Uniform prior and Normal prior. Dirichlet prior was chosen for the log-linear analysis of sparse frequency tables in [7]. In fact, in this application the likelihood function is a multinomial density function which is conjugate of the Dirichlet prior and therefore the posterior distribution has an analytically tractable Dirichlet form. This has the effect of smoothing the estimates to a specific model [5]. Jeffrey s prior is based on a structural rule and has a good theoretical justification [7], [13]. However, in larger problems when the number of parameters of interest is large, it is difficult to apply because of its computational intensity. Using Uniform priors leads to the classical logistic regression model [13], [7]. In this case we will face again the problems of the nonexistence and overflow of the maximum likelihood function estimates. Normal prior has become popular in the logit modelling [7], [16], [19], [9], [17]. It has the advantage of having low computational intensity and of smoothing the estimates toward a fixed mean and away from unreasonable extremes. However, when the likelihood function is not conjugate of the gaussian prior, the posterior distribution has no tractable form, and then its mean and mode computations become inaccurate or expensive in term of computational intensity. For example, F. Galindo-Garre et al. [7] used a modified Newton-Raphson algorithm to compute the posterior mode, and then used the Markov Chain Monte Carlo method of sam- 9

10 pling to compute the posterior mean, which has a high computational complexity, R. Weiss et al. [19] used the maximum likelihood estimate and an asymptotic covariance matrix in order to approximate the posterior mean, which is an inaccurate approximation. According to [24], it s possible to use accurate variational transformations and the Jensen s inequality in order to approximate the likelihood function with a simpler tractable exponential form. In this case, thanks to the conjugacy, with a gaussian prior distribution over the parameters of interest combined with the likelihood approximation, we obtain a closed gaussian form approximation to the posterior distribution. This gaussian form is called the variational posterior approximation and is considered as a proper distribution and as an adjustable lower bound on the posterior distribution. The variational posterior approximation mean is easily calculated by an iterative algorithm based on the Bayesian learning, while using the EM algorithm to maximize the variational posterior approximation with respect to the variational parameters, in order to have optimal approximation. The iterative algorithm has low computational complexity and converges rapidly. Also, in this Bayesian logistic regression model which is detailed in the next subsection, explanatory vectors of the relevance and irrelevance classes are not observed but instead are distributed according to specific distributions, and then their prior knowledge is incorporated in order to optimize the computation of the mean of the variational posterior approximation. 3.2 The Bayesian logistic regression model Let us denote the random vectors whose realizations represent the explanatory vectors {X i } n 0 i=1 of the relevance class Ω 0 and the explanatory vectors {X j } n 1 j=1 of the irrelevance class Ω 1, by X 0 = ( X 0,0, X 0,0,..., X J 1,0, 1) and X 1 = ( X 0,1, X 0,1,..., X J 1,1, 1), respectively. We suppose that X 0 q 0 (X 0 ) and X 1 q 1 (X 1 ), where q 0 and q 1 are two chosen distributions. For X 0 we associate a binary random variable S 0 whose realizations are the target variables {S i = 0} n 0 i=1, and for X 1 we associate a binary random variable S 1 whose realizations are the target variables {S j = 1} n 1 j=1. We set S 0 equal to 0 for similarity and we set S 1 equal to 1 for dissimilarity. Parameters of interest (weights and intercept) are considered as random variables and are denoted by the random vector W = ( w 0, w 0,..., w J 1, v). We assume that W π(w), where π is a gaussian prior with prior mean µ and prior covariance matrix Σ. Using Bayes rule, the posterior distribution over W is given by where P (S 0 = 0, S 1 = 1 W) = P (W S 0 = 0, S 1 = 1) = P (S 0 = 0, S 1 = 1 W)π(W), (3.6) P (S 0 = 0, S 1 = 1) = = 1 P (S i = i W), x 0 Ω 0,x 1 Ω 1 x 0 Ω 0,x 1 Ω 1 1 P (S i = i, X i = x i, W) (π(w)) 2, 1 10 [ ] P (Si = i, X i = x i, W) P (W, X P (W, X i = x i )π(w) i = x i ).

11 Since in the Bayesian approach we generally suppose that the space of unknown parameters is independent from the space of observations, we assume that W and X i are independent for each i {0, 1}, and then the joint probability P (W, X i = x i ) = π(w)q i (X i = x i ) for each i {0, 1}. So we obtain P (S 0 = 0, S 1 = 1 W) = = 1 x 0 Ω 0,x 1 Ω 1 x 0 Ω 0,x 1 Ω 1 [ ] P (Si = i, X i = x i, W) π(w)q (π(w)) 2 i (X q i (X i = x i ) i = x i ) 1 P (S i = i X i = x i, W)q i (X i = x i ), where P (S i = i X i = x i, W) = F ((2i 1)W t x i ) for each i {0, 1}. Therefore [ 1 x P (W S 0 = 0, S 1 = 1) = 0 Ω 0,x 1 Ω 1 P (S i = i X i = x i, W)q i (X i = x i ) ] π(w), (3.7) P (S 0 = 0, S 1 = 1) where [ P (S 0 = 0, S 1 = 1) = x 0 Ω 0,x 1 Ω 1 1 P (S i = i X i = x i, W)q i (X i = x i ) ] π(w)dw. (3.8) The computation of the posterior distribution P (W S 0 = 0, S 1 = 1) is intractable. For this reason, we can approximate it by a variational posterior approximation having a gaussian form whose mean and covariance matrix computation is feasible. To obtain this variational posterior approximation, we perform two successive approximations to the posterior distribution nominator term [ x 0 Ω 0,x 1 Ω 1 1 P (S i = i X i = x i, W)q i (X i = x i ) ] in the formula (3.7), in order to bound it by an exponential form which is conjugate of the gaussian prior π(w). First approximation: This first approximation is based on a variational transformation of the sigmoid function F (x) of the logistic regression. Let us consider the logarithm of this latter log(f (x)) = log(1 + e x ) = x 2 log(e x 2 + e x 2 ). (3.9) The function g(x) = log(e x 2 + e x x 2 ) = log(e e x 2 2 ) is a convex function in the variable x 2 as its second derivative in x 2 is strictly positive. The tangent line to the convex function g(x) in x 2 is a global lower bound for this latter. Consequently, we can bound the function g(x) globally with a first order Taylor expansion in the variable x 2 and at the neighborhood of a variational parameter ɛ > 0. Explicitly g(x) g(ɛ) + (x 2 ɛ 2 ) g(x) x 2 (ɛ2 ), (3.10) = ɛ 2 + log(f (ɛ)) + 1 4ɛ (x2 ɛ 2 )tanh( ɛ ), (3.11) 2 11

12 where tanh( ɛ ) = e 2 ɛ e 2 ɛ. So we obtain 2 e 2 ɛ +e 2 ɛ log(f (x)) log(f (ɛ)) + 1 4ɛ (x2 ɛ 2 )tanh( ɛ ). (3.12) 2 Substituting x by H i = (2i 1)W t x i and ɛ by ɛ i and exponentiating yields the following desired variational approximation to the sigmoid function P (S i = i X i = x i, W) = F (H i ), (3.13) [ ( ) ] (H i ɛ i ) ϕ(ɛ 2 i ) Hi 2 ɛ2 i F (ɛ i )e = P (Si = i Xi = xi, W, ɛi), where ϕ(ɛ i ) = tanh( ɛ i 2 ) 4ɛ i. So the posterior distribution nominator in the formula (3.7) can be approximated as follows [ x 0 Ω 0,x 1 Ω 1 1 P (S i = i X i = x i, W)q i (X i = x i ) ] π(w) (3.14) [ x 0 Ω 0,x 1 Ω 1 1 P (S i = i X i = x i, W, ɛ i )q i (X i = x i ) ] π(w). Second approximation: In order to approximate the term [ x 0 Ω 0,x 1 Ω 1 1 P (S i = i X i = x i, W)q i (X i = x i ) ] by an exponential form, the first approximation is insufficient. For this reason, we perform a second approximation. This latter is based on the Jensen s inequality which uses the convexity of the function e x. Using the Jensen s inequality, we obtain x 0 Ω 0,x 1 Ω 1 [ 1 P (S i = i X i = x i, W, ɛ i )q i (X i = x i ) ] = = [ 1 [ 1 [ 1 ] F (ɛ i ) ] F (ɛ i ) ] F (ɛ i ) e x 0 Ω 0,x 1 Ω 1 x 0 Ω 0,x 1 Ω 1 [ 1 [ E q i [H i ] ɛ i 2 [ [ [ 1 (Hi ( ɛ i ) ϕ(ɛ 2 i ) e = P ( W S 0 = 0, S 1 = 1, { ɛ i, { q i ), 12 H 2 i ɛ2 i)] ] 1 ] q i (X i = x i ), [ [ [ 1 (Hi ( )] ] ɛ i ) ϕ(ɛ 2 i ] ) H i 2 ɛ2 1 i q i(x i =x i ) e, ] [ ( )] ] 1 ϕ(ɛ i ) E qi [Hi 2] ɛ2 i,

13 where E q0 and E q1 are the expectations with respect to the distributions q 0 and q 1, respectively. Finally, thanks to the two above approximations, the posterior distribution numerator in the formula (3.7) can be approximated as follows [ x 0 Ω 0,x 1 Ω 1 1 P (S i = i X i = x i, W)q i (X i = x i ) ] π(w) P ( W S 0 = 0, S 1 = 1, { ɛ i, { q i ) π(w). Thus, the variational posterior approximation is given by P ( W S 0 = 0, S 1 = 1, { ɛ i, { P q i ) ( W S 0 = 0, S 1 = 1, { ɛ i, { ) q = i π(w). P (S 0 = 0, S 1 = 1) Since P (S 0 = 0, S 1 = 1) is a constant which doesn t affect the form of the variational posterior approximation, we can ignore it and then we obtain P ( W S 0 = 0, S 1 = 1, { ɛ i, { q i ) P ( W S0 = 0, S 1 = 1, { ɛ i, { q i ) π(w). Finally, the posterior distribution is approximated as follows P (W S 0 = 0, S 1 = 1) P ( W S 0 = 0, S 1 = 1, { ɛ i, { q i ), (3.15) P ( W S 0 = 0, S 1 = 1, { ɛ i, { q i ) π(w). (3.16) Since π(w) is a gaussian which is conjugate of P ( W S 0 = 0, S 1 = 1, { ɛ i, { q i ) which has an exponential form, the variational posterior approximation is a gaussian with a posterior mean µ post and a posterior covariance matrix Σ post. Substituting π(w) and P ( W S 0 = 0, S 1 = 1, { ɛ i, { q i ) by their gaussian forms in the formula (3.16), we obtain e 1 2 (W µpost)t Σ 1 post (W µpost) P(W S 0 = 0, S 1 = 1, { ɛ i, { q i )e 1 2 (W µ)t Σ 1 (W µ). Thus, omitting the algebra, Σ post and µ post are given by the following Bayesian update equations (Σ post ) 1 = (Σ) [ ϕ(ɛi )E qi [x i (x i ) t ] ], (3.17) µ post = Σ post [(Σ) 1 µ + 1 [ 1 (i 2 )E q i [x i ] ] ]. (3.18) According to equation (3.17), Σ post depends on the variational parameters { ɛ i. For this reason, we have to specify these latters. We have to find the values of { ɛ i that yields a tight lower bound in 13

14 equation (3.15) and then an optimal approximation to the posterior distribution. This can be done by an EM algorithm which is derived in the Appendix A. The variational parameters are given by ɛ 2 i = E ( ) old, { ) P W S 0 =0,S 1 =1,({ [ E qi [(W t x i ) 2 ] ] (3.19) ɛ i q i ] = E qi [(x i ) t Σ post x i ] + (µ post ) [E t qi [x i (x i ) t ] µ post, i {0, 1}, where P ( W S 0 = 0, S 1 = 1, ({ ɛ i ) old, { qi ) is the variational posterior approximation based on the previous values of { ɛ i. The weight and intercept computation algorithm is in two phases. The first phase is the initialization and the second phase is iterative and allows the computation of Σ post and µ post through the Bayesian update equations (3.17) and (3.18), respectively, while using the equation (3.19), in order to find the variational parameters at each iteration. The values of µ post components are the desired estimates of the pseudo-metric weights w 0 and {w k } J 1 k=0 and the intercept v. Initialization: 1. Compute the parameters of the distributions q 0 and q 1 which model the relevance class Ω 0 and irrelevance class Ω 1 explanatory vectors, respectively. 2. Initialize the covariance matrix Σ old to the identity matrix and the mean µ old to a vector with components equal to Initialize the variational parameters as follows For each i {0, 1} do ] ɛ 2 i E qi [(x i ) t Σ old x i ] + (µ old ) [E t qi [x i (x i ) t ] End for Computation of Σ post and µ post : µ old 1. Do (Σ new post) 1 (Σ old ) [ ϕ(ɛi )E qi [x i (x i ) t ] ] [ µ new post Σ new post (Σ old ) 1 µ old + For each i {0, 1} do 1 [ 1 (i 2 )E q i [x i ] ] ] ɛ 2 i E qi [(x i ) t Σ new postx i ] + (µ new post) t [E qi [x i (x i ) t ] ] µ new post 14

15 End for While( Σ old Σ new post > threshold or µ old µ new post > threshold) Return µ new post 2. Affect the µ post component values to the pseudo-metric weights w 0 and {w k } J 1 k=0 and to the intercept v. 3.3 Training Let us consider a color image database which consists of several color image sets such that each set contains color images which are perceptually close to each other in terms of object shapes and colors. In order to compute the pseudo-metric weights and the intercept by the classical logistic regression model, we have to create the relevance class Ω 0 and the irrelevance class Ω 1. To create Ω 0, we draw all possible pairs of feature vectors representing color images belonging to the same database color image sets, and for each pair we compute an explanatory vector and we associate to this latter a binary target variable equal to 0. For example, for the i-th drawn feature vector we compute an explanatory vector X i and we associate to this latter an S i = 0. Similarly, to create Ω 1, we draw all possible pairs of feature vectors representing color images belonging to different database color image sets, and for each pair we compute an explanatory vector and we associate to this latter a binary target variable equal to 1. For example, for the j-th drawn feature vector we compute an explanatory vector X j and we associate to this latter an S j = 1. For the Bayesian logistic regression model, we create the Ω 0 and Ω 1 with the same way, but instead of associating a binary target variable value to each explanatory vector of Ω 0 and Ω 1, we associate a binary target variable S 0 equal to 0 to all Ω 0 explanatory vectors and we associate a binary target variable S 1 equal to 1 to all Ω 1 explanatory vectors. 4 Color image retrieval method The querying method is in two phases. The first phase is a preprocessing phase done once for the entire database containing DB color images. The second phase is the querying phase. 4.1 Color image database preprocessing We detail the preprocessing phase done once for all the database color images before the querying in a general case by the following steps. 1. Choose N feature vectors for comparison. 2. Compute the N feature vectors T li (l {1,..., N}) for each i-th color image of the database, where i {1,..., DB }. 3. The feature vectors representing the database color images are Daubechies-8 wavelets decomposed, compressed to m coefficients each and quantified. 15

16 4. Organize ( the decomposed, ) compressed and quantified feature vectors into search arrays Θ l + and Θ l l = 1,..., N. 5. Adjustment of the metric weights w l 0 and {w l k }J 1 k=0 for each featurebase T li (i = 1,..., DB ) representing the database color images, where l {1,..., N}. 4.2 The querying algorithm We detail the querying algorithm in a general case by the following steps. 1. Given a query color image, we denote the feature vectors representing the query image by Q l ( l = 1,..., N ). 2. The feature vectors representing the query image are Daubechies-8 wavelets decomposed, compressed to m coefficients each and quantified. 3. The similarity degrees between Q l ( l = 1,..., N ) and the database color image feature vectors T li ( l = 1,..., N ) (i = 1,..., DB ) are represented by the arrays Scorel ( l = 1,..., N ) such that Score l [i] = Q l, T li for each i {1,..., DB }. These arrays are returned by the procedure Retrieval(Q l, m, Θ l +, Θ l ) ( l = 1,..., N ), respectively. 4. The similarity degrees between the query color image and the database color images are represented by a resulted array T otalscore, such as, T otalscore[i] = N l=1 γ lscore l [i] for each i {1,..., DB }, where {γ l } N l=1 are weightfactors used to fine-tune the influence of each individual feature. 5. Organize the database color images in order of increasing resulted similarity degrees of the array T otalscore. The most negative resulted similarity degrees correspond to the closest target images to the query image. Finally, return to the user the closest target color images to the query color image and whose number is denoted by RI and chosen by the user. 5 Experimental results In this section, we will present the decontextualized classification performance evaluation and comparison of the Bayesian logistic regression model and the classical logistic regression one. Then, we will perform an evaluation and comparison of both models in the image retrieval context. 5.1 Decontextualized classification performance evaluation of both models The aim of the current section is to evaluate and to compare the classification performances of the Bayesian logistic regression (BLR) model and the classical logistic regression (CLR) one. To achieve this, we use a classification accuracy used in [14], and a cumulative distance error. Given K-dimensional 16

17 success class and failure class having Np s points {x i = (x i 1,..., x i s K )}N p i=1 and N p f points {y j = (x j 1,..., x j f K )}N p j=1, respectively, in the CLR model we associate to each point x i of the success class a binary target variable S i = 1, and to each point y j of the failure class a binary target variable S j = 0, while in the BLR model we associate to all of the success class points a binary target variable S 1 = 1, and to all of the failure class points a binary target variable S 0 = 0, and then we model the failure class and the success class by the distributions q 0 and q 1, respectively. The success classification accuracy (SCA) is defined as the percentage of the points correctly classified in the success class. To verify if a point x i is correctly classified or and v are respectively the weights and the intercept to adjust. Precisely, if P i > 0.5, x i is considered correctly classified and otherwise it is considered incorrectly classified. The failure classification accuracy (FCA) is defined as the success one, but it is computed over the failure class and the verification of a point y j if it is correctly not, we compute its success probability P i = F ( K k=1 w kx i k + v), where {w k} K k=1 classified or not is performed by computing its failure probability P j = F ( K k=1 w ky j k v). Precisely, if P j > 0.5, y j is considered correctly classified and otherwise it is considered incorrectly classified. The success and failure cumulative distance errors are denoted by D E and D E, respectively, and are defined as follows D E = N s p i=1 (P i 1) 2, Np s D E = N f p j=1 ( P j 1) 2. Thus, for both classes, we compute the failure and success classification accuracies and the failure and success cumulative distance errors, using weights and intercept adjusted by the CLR model, and then we compute them again using weights and intercept adjusted by the BLR model. The model whose weights and intercept lead to greater failure and success classification accuracies and to lower failure and success cumulative distance errors, is supposed to be better in term of classification performance. The validation of both models and the computation of the classification accuracies and the cumulative distance errors, are performed on synthetic data and on two extracted data sets from the well-known Iris data [6]. The synthetic data consists of two groups of artificial data sets. The first group which is labelled synthetic data 1 and shown in Figure 5.1, is a collection of three two-dimensional data sets. Each set has two clusters with a total of 1000 points and generated from two Gaussians, 500 points per Gaussian. The first set two clusters are well separated, while the second set two clusters are slightly overlapped and the third set ones are overlapped. The overlap between two clusters is controlled by moving a cluster towards the other after translating its mean. The second group which is labelled synthetic data 2 and shown in Figure 5.2, is a collection of three three-dimensional data sets which are generated with the same way as the synthetic data 1, but in higher dimension. Iris data consists of 50 samples for each of the three classes present in the data, Iris Versicolor, Iris Virginica and Iris Setosa. Each datum is four-dimensional and consists of measures of the plants morphology. The overlap here is very high between the Versicolor class and the Virginica class, while the Setosa class is well separated. In order to allow the validation of the BLR and CLR models, we extract two data sets from the Iris data. 17 N f p

18 The first set is shown in Figure 5.3(b) and consists of the Setosa class and the Virginica class, and the second set is shown in Figure 5.3(c) and consists of the classes Versicolor and Virginica. In order to ease the representation, both sets are represented in a three-dimensional space after discarding the forth dimension. For each data set of the synthetic data 1 and synthetic data 2, we consider the cluster which is closer to the reference system origin as failure class, while the other cluster is considered as success class. For the two extracted data sets from the Iris data, we consider the Virginica class as the failure class in the first data set, while the Setosa class is considered as the success class, and we consider the Versicolor class as the failure class in the second data set, while the Virginica class is considered as the success class. The computation of the weights and the intercept by the CLR model for each pair of failure and success classes, is performed by optimizing the likelihood function using the Fisher scoring algorithm. Before the computation of the weights and the intercept by the BLE model for each data set of the synthetic data 1, the distributions q 0 and q 1 are chosen as two-dimensional Gaussians whose parameters are computed directly from the failure and success class points, while for each data set of the synthetic data 2, the distributions q 0 and q 1 are chosen as three-dimensional Gaussians whose parameters are also computed directly from the failure and success class points. For each data set extracted from the Iris data, the distributions q 0 and q 1 are chosen as three-dimensional Gaussians whose parameters are computed by the expectation-maximization (EM) algorithm [2]. The Table 5.1, Table 5.2 and Table 5.3 illustrate the different failure and success classification accuracies and failure and success cumulative distance errors computed for the synthetic data 1, synthetic data 2 and the data sets extracted from the Iris data, using weights and intercepts adjusted by the CLR model and the BLR model. (a) (b) (c) Figure 5.1: Synthetic data 1: (a) well separated classes, (b) slightly overlapped classes and (c) overlapped classes. 18

19 (a) (b) (c) Figure 5.2: Synthetic data 2: (a) well separated classes, (b) slightly overlapped classes and (c) overlapped classes. (a) (b) (c) Figure 5.3: Extracted data sets from the Iris data: (a) Iris data represented in a three-dimensional space, (b) the Virginica class and the Setosa class and (c) the Virginica class and the Versicolor class. Well separated classes Slightly overlapped classes Overlapped classes SCA (CLR model) 75% 71% 67% FCA (CLR model) 74% 68% 64% D E (CLR model) D E (CLR model) SCA (BLR model) 100% 99% 97% FCA (BLR model) 100% 99% 96% D E (BLR model) D E (BLR model) Table 5.1: Comparison between the classification performances of the CLR model and the BLR model which are validated on the synthetic data 1. 19

20 Well separated classes Slightly overlapped classes Overlapped classes SCA (CLR model) 73% 69% 65% FCA (CLR model) 71% 67% 61% D E (CLR model) D E (CLR model) SCA (BLR model) 100% 98% 95% FCA (BLR model) 100% 97% 95% D E (BLR model) D E (BLR model) Table 5.2: Comparison between the classification performances of the CLR model and the BLR model which are validated on the synthetic data 2. Setosa-Virginica Versicolor-Virginica SCA (CLR model) 70% 46% FCA (CLR model) 69% 41% D E (CLR model) D E (CLR model) SCA (BLR model) 100% 72% FCA (BLR model) 99% 69% D E (BLR model) D E (BLR model) Table 5.3: Comparison between the classification performances of the CLR model and the BLR model which are validated on the extracted data sets from the Iris data. 20

21 If we analyze the Table 5.1, we can see that for the well separated classes, the average of the SCA (BLR model) and the FCA (BLR model) drops from 100% to 74.5% which is the average of the SCA (CLR model) and the FCA (CLR model), while the average of the D E (BLR model) and the D E (BLR model) grows from to which is the average of the D E (CLR model) and the D E (CLR model). Then, for the slightly overlapped classes, the average of the SCA (BLR model) and the FCA (BLR model) drops from 99% to 69.5% which is the average of the SCA (CLR model) and the FCA (CLR model), while the average of the D E (BLR model) and the D E (BLR model) grows from to 0.2 which is the average of the D E (CLR model) and the D E (CLR model). Finally, for the overlapped classes, the average of the SCA (BLR model) and the FCA (BLR model) drops from 96.5% to 65.5% which is the average of the SCA (CLR model) and the FCA (CLR model), while the average of the D E (BLR model) and the D E (BLR model) grows from 0.13 to which is the average of the D E (CLR model) and the D E (CLR model). In Table 5.2, for each data set, the average of the SCA (BLR model) and FCA (BLR model) drops to the CLR model one, and the average of the D E (BLR model) and D E (BLR model) grows to the CLR model one, with almost the same rates as in Table 5.1. Moreover, comparing Table 5.1 and Table 5.2, we notice that for each data set, the former SCA (BLR model) and FCA (BLR model) average is slightly higher and D E (BLR model) and D E (BLR model) average is slightly lower, which is also the case for the CLR model. In Table 5.3, we can see that for the Setosa and Virginica classes, the average of the SCA (BLR model) and the FCA (BLR model) drops from 99.5% to 69.5% which is the average of the SCA (CLR model) and the FCA (CLR model), while the average of the D E (BLR model) and the D E (BLR model) grows from to 0.16 which is the average of the D E (CLR model) and the D E (CLR model). Then, for the Versicolor and Virginica classes, the average of the SCA (BLR model) and the FCA (BLR model) drops from 70.5% to 43.5% which is the average of the SCA (CLR model) and the FCA (CLR model), while the average of the D E (BLR model) and the D E (BLR model) grows from 0.18 to 0.36 which is the average of the D E (CLR model) and the D E (CLR model). Moreover, comparing Table 5.3 Setosa-Virginica and Table 5.2 Well separated classes, we notice that the former SCA (BLR model) and the FCA (BLR model) average is slightly lower and D E (BLR model) and D E (BLR model) average is slightly higher. In the three tables, we notice that the average of the SCA (BLR model) and the FCA (BLR model) decreases and the average of the D E (BLR model) and the D E (BLR model) increases, as the overlap between classes becomes higher, which is also the case for the CLR model. Finally, from our tables, we can draw three main conclusions. First, we conclude that the BLR model is significantly better than the CLR model in term of classification performance. Second, we conclude that the BLR model classification performance decreases as the overlap between classes becomes higher, and it is slightly degraded as the dimension increases. Third, we conclude that the validation of the BLR model on real data (Iris data) has negligible effect on its classification performance. 5.2 Contextualized evaluation and comparison of both models In this subsection, we will briefly present the feature vectors that we will use for color image representation. Then, we will discuss the choices of the distributions q 0 and q 1, in order to validate the Bayesian logistic regression model in the image retrieval context. Finally, we will use the precision and scope as defined in [12], to evaluate the querying method using both models separately. The choices of the distri- 21

22 butions q 0 and q 1 and the querying evaluation will be conducted in the WANG and ZuBuD color image databases proposed by [23]. The WANG database is a subset of the Corel database of DB = 1000 color images which were selected manually to form 10 sets (e.g. Africa, beach, ruins, food) of 100 images each. The Zurich Building Image Database (ZuBuD) was created by the Swiss Federal Institute of Technology in Zurich and it contains a training part of DB = 1005 color images and query part of 115 color images. The training part consists of 201 building image sets, where each set contains 5 color images of the same building taken from different positions using different cameras under different lighting conditions. Before the feature vector extractions, we represent the WANG and ZuBuD database color images in the perceptually uniform LAB color space Used feature vectors Luminance histogram and weighted histograms that were described in detail by [18], are used for image color description in this paper. Moreover, image texture description is performed by kurtosis and skewness histograms [10]. Given an M N pixel LAB color image, its luminance histogram h L contains the number of pixels of the luminance L, and can be written as follows h L (c) = M 1 N 1 j=0 δ(i L (i, j) c) for each c {0,..., 255}, (5.1) where I L is the luminance image and δ is the Kronecker symbol at 0. The weighted histograms are the color histogram constructed after edge region elimination and the multispectral gradient module mean histogram. The former is given by and the latter is given by h h k(c) = M 1 N 1 j=0 ) δ(i k (i, j) c)χ [0,η] (λ max (i, j), (5.2) h e k(c) = where N p,k (c) is the number of the edge region pixels and is defined as and h e k(c) = N p,k (c) = M 1 M 1 N 1 j=0 N 1 j=0 he k (c) N p,k (c), (5.3) ) δ(i k (i, j) c)χ ]η,+ [ (λ max (i, j), (5.4) ) δ(i k (i, j) c)λ max (i, j) χ ]η,+ [ (λ max (i, j), (5.5) for each c {0,..., 255} and k = a, b, where λ max represents the multispectral gradient module, η is a threshold defined by the mean of the multispectral gradient modules computed over all image pixels, 22

23 I a and I b are the images of the chrominances a red/green and b yellow/blue, respectively, and χ is the characteristic function. The multispectral gradient module mean histogram provides information about the overall contrast in the chrominance and the edge region elimination allows the avoidance of overlappings or noises between the color histogram populations caused by the edge pixels. The LAB color image kurtosis and skewness histograms are given by and h κ k(c) = h s k(c) = M 1 N 1 j=0 M 1 N 1 j=0 δ(i κ k (i, j) c), (5.6) δ(i s k(i, j) c), (5.7) respectively, for each c {0,..., 255} and k = L, a, b, where IL κ, Iκ a and Ib κ are the kurtosis images of the luminance L and the chrominances a and b, respectively, and IL s, Is a and Ib s are the skewness images of these latters. They are obtained by local computations of the kurtosis and skewness values at the luminance and chrominance image pixels. Then, a linear interpolation is used to represent the kurtosis and skewness values between 0 and 255. Since each used feature vector is a histogram having 256 components, we set J equal to 8 in the following subsections The choices of q 0 and q 1 Since from each color image of the WANG and ZuBuD databases we extract N = 11 feature vectors which are the histograms h L, h h a, h h b, h e a, h e b, hκ L, hκ a, h κ b, hs L, hs a and h s b given by (5.1), (5.2), (5.3), (5.6) and (5.7) respectively, each database is represented by eleven histogram featurebases. The choices of q 0 and q 1 will be separately performed for each of these latters. According to the training, there are no mutual direct effects between the realizations of the random variable X 0,0 and the realizations of the random vector (X 0,0,..., X J 1,0 ). Therefore, we assume that X 0,0 and (X 0,0,..., X J 1,0 ) are independent and then q 0 is constructed as a product of two independent distributions q 0,0 and q 0,1, i.e. q 0 ( X 0,0, X 0,0,..., X J 1,0 ) = q 0,0 ( X 0,0 )q 0,1 (X 0,0,..., X J 1,0 ). Analogically, for the same reason, we assume that q 1 ( X 0,1, X 0,1,..., X J 1,1 ) = q 1,0 ( X 0,1 )q 1,1 (X 0,1,..., X J 1,1 ), where q 1,0 and q 1,1 are two independent distributions. For each histogram featurebase, the realizations of each random variable of the random vector (X 0,0,..., X J 1,0 ) are positive integers. For this reason, taking into account that they are not representing frequencies that follow a multinomial distribution, we suppose that they follow a poisson distribution, i.e. after the consideration of the independence between the random vector (X 0,0,..., X J 1,0 ) random variables, we have q 0,1 (X 0,0,..., X J 1,0 ) = X j,0 J 1 λj,0 j=0 X j,0! e λ j,0, where λ j,0 R + for each j {0,..., J 1}. Analogically, with the same assumptions, we have q 1,1 (X 0,1,..., X J 1,1 ) = J 1 j=0 X j,1 λj,1 X j,1! e λ j,1, where λ j,1 R + for each j {0,..., J 1}. The parameters { } J 1 λ j,0 and { } J 1 λ j=0 j,1 j=0 23