Hybrids of Generative and Discriminative Methods for Machine Learning

Transcription

1 Hybrids of Generative and Discriminative Methods for Machine Learning Julia Aurélie Lasserre Queens College University of Cambridge This dissertation is submitted for the degree of Doctor of Philosophy. March 2008

2

3 ABSTRACT In machine learning, probabilistic models are described as belonging to one of two categories: generative or discriminative. Generative models are built to understand how samples from a particular category were generated. The category chosen for a new data-point is the category whose model fits the point best. Discriminative models are concerned with defining the boundaries between the categories. The category chosen for a new data-point then depends on which side of the boundary it belongs to. While both methods have different / complementary advantages, they cannot be merged in a straightforward way. The challenge we wish to undertake in this thesis is to find rigorous models blending these two approaches, and to show that it can help find good solutions to various problems. Firstly, we will describe an original hybrid model [50] that allows an elegant blend of generative and discriminative approaches. We will show: how a hybrid approach can lead to better classification performance when most of the available data is unlabelled, how to make the optimal trade-off between the generative and discriminative extremes, and how the amount of labelled data influences the best model, by applying this framework on various data-sets to perform semisupervised classification. Secondly, we will present a hybrid approximation of the belief propagation algorithm [51], that helps optimise a Markov random field of high cardinality. The contributions of this thesis on this issue are two-folded: a hybrid generative / discriminative method Hybrid BP that significantly reduces the state space of each node in the Markov random field, and an effective method for learning the parameters which exploits the memory savings provided by Hybrid BP. We will then see how reducing the memory needs and allowing the Markov random field to take higher cardinality was useful. i

4

5 DECLARATION This dissertation is my own work and contains nothing which is the outcome of work done in collaboration with others, except as specified in the text. Most of chapter 2 can be found in [50] and most of chapter 4 can be found in [51]. No part has been previously accepted and presented for the award of any degree or diploma from any university. iii

6

7 STATEMENT OF LENGTH This dissertation contains fewer than words and exactly 78 figures excluding the university crest on the title page. Therefore it stands within the authorised limits of words and 150 figures. v

8

9 ACKNOWLEDGEMENTS Firstly, I must express my deepest gratitude to my supervisor Prof Christopher Bishop. I have greatly benefitted from his experience, enthusiasm and advice. I also want to thank him for giving me the opportunity to experience Cambridge academic life under his wings. I want to thank Prof Roberto Cipolla from the Department of Engineering for adopting me in his group half-way through my studies and for the help he has given me; Prof John Daugman from the Computer Laboratory for originally acccepting me in his group (and for managing my studentship!); Prof Zoubin Ghahramani and Prof Christopher Williams for accepting to examine me, and for giving me a gentle viva with many useful comments. I am also immensely grateful to Microsoft Research Cambridge for their generous financial support, which has allowed me to do my studies in the best conditions possible. I need to extend my sincere thanks to Dr Thomas Minka for his exciting ideas; to my friends Dr John Winn and Dr Anitha Kannan for a very educational, intense and fun internship at Microsoft Research Cambridge; and to all the people from my group whose help and company have made my working days so much more pleasant, in particular Gabe, Julien, Jamie, Tae-Kyun, Fabio, Bjorn, Carlos, George, and Matt J. I also want to mention all the people who have made my general experience in Cambridge so enjoyable: Matthias, Jim, Mair, Keltie, Jane, Sam, Matt D., my housemates Ben, Andreas, Roz, James, Catherine, Millie and Andrew, the Rhinos members, and the volleyball women s Blues. Last but not least, I would like to send my very tender thoughts to my close family in France. This thesis is dedicated to them. I want to give my encouragements to my adventurous sister Sarah - may she enjoy her dream-come-true to experience life in Japan; to my courageous sister Sophie - may she keep fighting hard in these difficult times, I am confident she will find happiness on whichever path she will choose to pursue; and to my passionate brother Philippe - may he succeed in his wish to become a professional goal keeper. vii

10

11 TABLE OF CONTENTS Abstract Declaration Statement of length Acknowledgements List of figures i ii iii v xiii Introduction 1 1 Generative or discriminative? Generative models and generative learning Discriminative models and discriminative learning Generative vs discriminative models Different advantages Generative vs conditional learning Classification: Bayesian decision, point estimate Hybrid methods Hybrid algorithms Learning discriminative models on generative features Learning generative models on discriminative features Refining generative models with discriminative components Refining generative classifiers with a discriminative classifier Feature splitting When generative and discriminative models help each other out The wake-sleep-like algorithms Hybrid learning Discriminative training of generative models Convex combination of objective functions Multi-Conditional learning A new hybrid objective function for semi-supervised learning ix

12 2 A principled hybrid framework The hybrid framework A new view of discriminative training Blending generative and discriminative models Illustration Comparison with the convex combination framework Visually Pareto fronts The Bayesian version Exact inference The true MAP Successive Laplace approximations First step Second step Results Conclusion Application to semi-supervised learning The data-sets The features The underlying generative model The learning method Influence of the amount of labelled data On the generative / discriminative models General experimental set-up CSB data-set Wildcats data-set On the choice of model General experimental set-up CSB data-set - HF CSB data-set - CC Wildcats data-set - HF Wildcats data-set - CC On the choice between HF and CC CSB data-set Wildcats data-set Summary of the experimental results Discussion Hybrid belief propagation Inference with belief propagation Standard belief propagation Approximations The Jigsaw model Efficient inference

13 4.3.1 The issues Sparse belief propagation Hybrid belief propagation Efficiency of Sparse BP and Hybrid BP Hybrid learning of jigsaws Analysis of hybrid learning Image segmentation Discussion Conclusions 141 A Complements to chapter A.1 Average plot for the toy example of section A.2 Average plot for the toy example of section A.3 Average plot for the toy example of section A.4 Average plot for the toy example of section B Complements to chapter B.1 The trick B.2 The derivatives B.3 Results on the wildcats data-set with a mixture of 2 Gaussians B.3.1 Generative / discriminative models B.3.2 On the choice of model B.3.3 HF and CC Bibliography 167

14

15 LIST OF FIGURES 1.1 Graphical representation of a basic generative model Training of a mixture of spherical Gaussians Graphical representation of a basic discriminative model Example of boundary obtained with an SVM Generative model vs discriminative model The additional set of parameters θ The communication channel is now open The discriminative case The generative case Hybrid models may do better Influence of the unlabelled data Synthetic training data Classification performance of HF on the toy example for every run Results of fitting an isotropic Gaussian model to the synthetic data Classification performance of HF and CC on the toy example for every run CC - run presenting similarities HF - run presenting differences CC - run presenting differences Visualisation of Pareto-optimal points Pareto fronts for both the HF and CC frameworks Pareto fronts for both the HF and CC frameworks Integrating over σ MAP approximation - b = a - run presented in figure MAP approximation - b = 1/2 - run presented in figure Classification performance - MAP approximation - Gamma prior Laplace approximation - Gaussian prior - run presented in figure Laplace approximation - Gamma prior - run presented in figure Classification performance - Laplace approximation Sample images from the CSB data-set Sample images from the wildcats data-set The generative model for object recognition Other graphical models CSB data-set - Generative and discriminative models performance xiii

16 3.6 Wildcats data-set - Generative and discriminative models performance Performance on the different runs Schematic plots for the cumulative distribution function Schematic plots for the models probability CSB data-set - HF - Choice of model vs percentage of fully labelled data CSB data-set - HF - Choice of model vs percentage of weakly labelled data CSB data-set - CC - Choice of model vs percentage of fully labelled data CSB data-set - CC - Choice of model vs percentage of weakly labelled data Wildcats data-set - HF - Choice of model vs percentage of fully labelled data Wildcats data-set - HF - Choice of model vs percentage of weakly labelled data Wildcats data-set - CC - Choice of model vs percentage of fully labelled data Wildcats data-set - CC - Choice of model vs percentage of weakly labelled data CSB data-set - HF versus CC Wildcats data-set - HF versus CC Performance on the different runs A pair-wise 4-connected MRF Message passing The principle of sparse belief propagation Example the Jigsaw model applied to face images Diagram of the Jigsaw model Structure of the likelihood function Structure of the distribution of the label of a particular pixel Sparse structure of the messages The use of local evidence to favour mappings The use of local evidence to split similar pixels Role of the classifier Construction of the hybrid likelihood Segmentation on building images Various plots Recognition accuracy against memory use Memory needed for different jigsaw sizes Comparison of algorithms for learning jigsaws Accuracy of the hybrid learning against memory use Jigsaws of various sizes learnt from 100 images Recognition accuracy against jigsaw size A.1 Average classification performance of HF on the toy example using HF A.2 Average classification performance of CC on the toy example A.3 Classification performance for the MAP approximation A.4 Classification performance for the Laplace approximation B.1 Wildcats data-set - Generative and discriminative models performance B.2 Wildcats data-set - HF - Choice of model vs percentage of fully labelled data. 159 B.3 Wildcats data-set - HF - Choice of model vs percentage of weakly labelled data.161 B.4 Wildcats data-set - CC - Choice of model vs percentage of fully labelled data. 163

17 B.5 Wildcats data-set - CC - Choice of model vs percentage of weakly labelled data.164 B.6 Wildcats data-set - HF versus CC

18

19 INTRODUCTION Traditional artificial intelligence (AI) models, such as rule-based systems, try to give a reasonable description of the environment needed for a particular problem. This means that a lot of existing knowledge can be explicitly encoded, and that this knowledge is refined during the learning process by using available observations. However, these models are usually too rigid, thus they are limited to very specific tasks. Many modern problems are now so incredibly ambitious, wide-scoped, and involve such sophisticated mechanisms that it has become almost impossible to use traditional AI methods. Therefore, there has been a massive shift from traditional models towards probabilistic models, and probabilistic inference has become very popular. Indeed, by their very nature, probabilistic approaches have the great advantage of explicitly modelling uncertainty and can benefit from the increasingly large amount of available data. Probabilistic models are split into two categories: generative and discriminative, whose formalisms will be detailed in chapter 1. Generative If we take the example of object recognition, generative models are built to understand how an object from a particular category was generated. A typical example is the constellation model [24] that describes an object as a particular spatial configuration of N specific-looking parts. A face would be represented as a spatial arrangement of the eyes, the nose and the mouth. The category chosen for a new object depends on which model fits the object most. Note that the constellation model is generative in the feature space only, i.e. it does not generate the pixels across the whole image. More formally, if the set of all the variables (hidden and observed) of a system is denoted 1

20 INTRODUCTION z, then a generative model will describe the system with a joint probability distribution over all the variables in z, i.e. writing p(z), thereby providing a framework to model all of the interactions between the variables. This is called generative because we now have a way of sampling different possible states of the system. Since we have a distribution over all the variables of the system, inference can be performed for any variable, by marginalising out the others. Note that inference in this setting is able to give not only a solution, but also a confidence measure. Generative probabilistic models provide a rigorous platform to define the prior knowledge experts have about the problem at hand, and to combine it with observed data. Their ability to model uncertainty while still absorbing prior knowledge, is what provoked the transfer from traditional techniques mostly to generative probabilistic modelling. This is true for a number of application fields. In general computer vision, geometry has been put aside for probabilistic models: naïve Bayes and numerous variants of the constellation model are common practice in object recognition, while techniques like Markov random fields have revolutionised segmentation, and there are numerous generative models for pose estimation. In natural language processing, traditional rule-based systems have been overtaken by Markov models. In bio-informatics, to represent regulatory networks, the original dynamic models using differential equations have been challenged by Bayesian networks. In artificial intelligence, the list of applications where traditional methods have been augmented with probabilistic generative models is long: path planning, control systems, navigation systems, etc. However, recently a different shift has appeared, from generative to discriminative models, which have known a popular success in many scientific fields. Discriminative If we take the example of object recognition, discriminative models are concerned with modelling the boundaries between the categories of objects we have at hand. A typical example is the use of a softmax function on appearance histograms [21]. Here, appearance histograms are distributions over learnt visual parts, and there is one histogram per image. The category chosen for a new object depends on which side of the boundary the object sits. 2

21 INTRODUCTION More formally, if the set of input variables in the system is denoted x, and the set of output variables is denoted c, then a discriminative model is a conditional probability distribution over the output variables in c, i.e. p(c x), therefore provides a framework to model the boundaries between the possible output states. This is called discriminative because we now have a way of discriminating directly between the different output states. Again, note that the prediction is given with a confidence measure. It is worth stressing that the process is going the other way round than for generative models. Rather than sketching a solution using prior knowledge and refining it with the data available, discriminative models are generally data-driven. As a consequence, more effort is usually put into preprocessing the data. Discriminative models have shown enormous potential in many scientific areas. Object recognition, economics, bio-informatics and text recognition have known a huge improvement with the use of support vector machines over generative approaches, while conditional random fields are the state-of-the-art in segmentation, and discriminative variants of hidden Markov models have proved to be superior to their generative counterparts for speech recognition. Discriminative probabilistic models are very efficient classifiers, since that is what they are defined for, however they have no modelling power. It is almost impossible to inject prior knowledge in a discriminative model (the best example of a model presenting this difficulty is probably the Neural Network). This makes them act like black boxes: training data computation time boundaries. Why / how? is not something a discriminative model will answer. Ironically though, in the recent years discriminative models have performed so well that they have been preferred to their generative cousins for their ability to classify, at the expense of clarity and modelling. The mere observation of these complementary properties motivates the need of combining generative and discriminative models. However, their formalism is so different that they cannot be merged straightforwardly. This is the challenge we wish to undertake in this thesis. 3

22 INTRODUCTION Objective The objective of this PhD is to explore new frameworks that allow the paradigms of generative and discriminative approaches to be combined. In particular, we we will study two different models that use different properties of the generative / discriminative methods. The first hybrid model that we will present was originally outlined in [61] and later studied in [50]. It is able to blend generative and discriminative training through the use of a hybrid objective function, and allows us to benefit from the good classification performance of discriminative models while keeping the modelling power of generative models. This framework will be studied in the context of semi-supervised learning, a natural application. The second hybrid model that we will study is a combination of a global generative model including a Markov random field (MRF) presenting a very high cardinality, with a bottom-up classifier used to reduce the state space for the variables in the MRF [51]. This hybrid model is quite elegant in its way of marrying two different types of models that will try to mirror one another in order to give each other feedback. This hybrid model is able to achieve substantial reductions in memory usage while keeping the loss in accuracy reasonable. The thesis is organised as follows: Chapter 1 will describe at length the properties of generative and discriminative approaches, and how they are usually trained. It will also provide a solid motivation for our two new frameworks, and a substantial review of various relevant hybrid methods. Chapter 2 will study in detail a hybrid framework that allows us to blend generative and discriminative learning. This hybrid model will be illustrated in a semisupervised context, and some extensions to make the whole framework Bayesian will be presented. Chapter 3 will apply this hybrid model to real images in the context of semi-supervised object recognition. We will study the impact of the amount of labelled data on the 4

23 INTRODUCTION resulting choice of model (generative, hybrid or discriminative), and the chapter will end with a discussion of the limitations of this hybrid framework. Chapter 4 will present the combination of advantages of generative and discriminative models from a different perspective. We will introduce a new learning framework we call hybrid belief propagation. This method will be illustrated on the problem of reducing the state space of Markov random fields (MRFs) presenting a high cardinality. Again, the chapter will end with a proper discussion of the advantages and limitations of the hybrid framework. The Conclusions chapter will close this thesis with conclusions and final remarks, and will open possible future directions. 5

24 INTRODUCTION 6

25 CHAPTER 1 GENERATIVE OR DISCRIMINATIVE? Generative and discriminative approaches are two different schools of thought in probabilistic machine learning. In this chapter, we will define their characteristics more formally, in terms of what they do, what criterion they optimise and how they perform inference. We will first study the generative models, then the discriminative models. Once their formalisms have been described, we will also try to provide a substantial overview of their intrinsic differences. We will close the chapter with a review of typical examples of hybrid approaches. In many applications of machine learning, the goal is to take a vector x of input features and to assign it to one of a number of alternative classes labelled by a variable c (for instance, if we have C classes, then c might be a C-dimensional null vector whose entry corresponding to the right class is switched to 1). This task is referred to as classification. Throughout this thesis, we will have in mind the problem of classification. In the simplest scenario, we are given a training data-set comprising N data-points X = {x 1,...,x N } together with the corresponding labels C = {c 1,...,c N }, in which we assume that the data-points, and their labels, are drawn independently from the same fixed distribution. Our goal is to predict the class ĉ for a new input vector x. This is given by ĉ = arg max p(c x, X, C) (1.1) c To determine this distribution we introduce a parametric model governed by a set of parameters θ such that, under a Bayesian setting, we generally have p(c x,x,c) = p(c x, θ)p(θ X, C) dθ 7

26 CHAPTER 1. GENERATIVE OR DISCRIMINATIVE? How the general p(c x, θ) is written is what makes models different in essence (generative or discriminative), whereas how p(θ X, C) is obtained is what makes the training / learning process different (note that the words training and learning refer to the same activity). These differences will be the subject of this chapter. 1.1 Generative models and generative learning Probabilistic generative models are built to capture the interactions between all the variables of a system, in order to be able to synthesise possible states of this system. This is achieved by designing a probability distribution p modelling inputs, hidden variables and outputs z jointly. This is written p(z θ), where θ represents the parameters of the model. Note that the different variables involved in z can be heterogeneous. To make the joint distribution simpler to estimate, conditional independencies can be added so as to factorise p, or we can define a prior distribution over θ to prevent it from taking undesirable values. Because of their modelling power, generative models are usually chosen to inject prior knowledge in the system. Experts use their experience of the problem to choose reasonable distributions and reasonable conditional independencies. In the context of classification, the system s input is the descriptor x of a data-point, and the output is the label c of this data-point. Probabilistically speaking, it means that p(x, c θ) is defined, as shown in figure 1.1. If the categories are cats and dogs, the question generative c p θ x N points Figure 1.1: Graphical representation of a basic generative model. models will then answer is: what makes a cat a cat? or why is a dog a dog?. Because the 8

27 1.1. Generative models and generative learning label is modelled by the joint distribution, generative models can easily perform classification by computing p(c x,θ) using the common probability rules. Popular generative models include naïve Bayes, hidden Markov models [67], Markov random fields [68] and so on. However, the simplest one is probably the Gaussian mixture model (GMM) [7]. For the cat category, and a Gaussian mixture of K components, this would give: p(x cat,θ cat ) = K k=1 π cat k N(x µ cat k,σcat k ) where k [1,K], π cat k N(x µ cat k,σcat k 0 and K k=1 πcat k = 1. Here θ cat = {π cat,µ cat,σ cat }. The Gaussian ) gives the likelihood for the data-point x to have been generated from the kth component of the cat mixture, and p(x j cat,θ cat ) the likelihood of x to have been generated from the cat category in general. Figure 1.2 has been taken from the web, and shows an example of GMMs. Clockwise, Figure 1.2: Training of a mixture of spherical Gaussians. Taken from [35]. Top-left: the original distribution (shape of a 0). Top-right: initial spherical Gaussians represented by circles. Bottom-right: the components after training. The Gaussians have moved and split to cover as much as possible. Bottom-left: the resulting distribution. we start from a thick oval distribution 0. Then we can see the initial spherical Gaussians (represented by black circles with green dots to mark the centres) with the various training 9

28 CHAPTER 1. GENERATIVE OR DISCRIMINATIVE? points (red dots). Finally, we can see the spherical Gaussians after training and the last image shows the reconstructed distribution. Notice how the distribution has been captured as closely as possible. In particular, note how the Gaussians, initially grouped on the right side, have moved and split to cover as much as possible. The imperfections come from the fact that the model is too limited (the Gaussians are spherical, and there are few of them) and cannot cover everything. A greater number of spherical Gaussians, or fully specified Gaussians, would have done a better job. Machine Learning is intimately linked to Optimisation, for most machine learning problems come down to optimising what is called an objective function. Most generative probabilistic models are trained using what we will call generative learning (but it is not necessarily always the case). Generative learning can optimise the joint likelihood of the complete training data (i.e. {X,C} where X refers to the data and C to their corresponding label), written p(x, C θ). However, as we discussed above, typically we have priors on the parameters so we usually take generative learning s objective function to be the full joint distribution of the complete data and the parameters, and we want to maximise it with respect to the parameters θ. The joint distribution is written: N L G (θ) = p(x,c,θ) = p({x n }, {c n },θ) = p(θ) p(x n,c n θ cn ) (1.2) The term p(x n,c n θ cn ) is crucial here. It describes the modelling of the data s distribution, i.e. the modelling of what the data looks like. Note that only the c n part of θ is used for each image. n=1 1.2 Discriminative models and discriminative learning Probabilistic discriminative models are built to capture the boundaries between the different possible output states of a system, without taking any interest in modelling the distribution of the inputs. This is achieved by designing a probability distribution p over the outputs c, and conditioning it on the inputs x. This is written p(c x,θ), where θ represents the parameters of the model, and shown in figure 1.3. Note that sometimes, there is not even a probability distribution. Instead, a function f θ (x) is designed, that simply returns one of the possible states of c. This is called a discriminant function. However, in this thesis, 10

29 1.2. Discriminative models and discriminative learning c θ p x N points Figure 1.3: Graphical representation of a basic discriminative model. we will focus on discriminative models that use probability densities, and if we talk about discriminant functions, we will assume that there exists a way to map f θ (x) to p(f θ (x) x,θ). The difference between p(x,c θ) and p(c x,θ) has an important impact. In the context of classification, x is the descriptor of a data-point and c the label of this data-point. Therefore, instead of focusing on recovering the distribution the data came from, the model now only concentrates on approximating the shape of the boundary between classes, so resources are much more targeted. If the categories are cats and dogs, the question generative models will answer is then: is it a dog or a cat?. Popular discriminative models include logistic regression [7] and Gaussian processes [70]. However, the most popular discriminative model is probably the Support Vector Machine (SVM) [19], that happens to be a discriminant function. For c n { 1,1}, SVMs try to separate the data linearly with a hyperplane defined by its normal vector w such that c n (w x n + b) 1 where b is called the bias and needs to be learnt too. It can be shown that this reduces to a quadratic programming optimisation problem: minimise w 2 under the constraints 2 afore mentioned. These classifiers are called SVMs because we can prove that the hyperplane only relies on the few points that are close to the boundary, i.e. the ambiguous points, which are referred to as the support vectors. Figure 1.4 shows an example of classifica- 11

30 CHAPTER 1. GENERATIVE OR DISCRIMINATIVE? tion provided by a SVM. As a discriminant function, a SVM does not directly provide the Figure 1.4: Example of boundary obtained with an SVM. Taken from [37]. The points that define the boundary are circled with white and are called the support vectors. They define the margin, i.e. the distance between the actual boundary (thick white line) and the manifold they create (thin white line). posterior distribution over the labels, its output R(w) simply says class blue (R(w) > 0) or class yellow (R(w) < 0). However, it does give a confidence measure in the sense that the higher R(w) is, the more reliable the answer is. Some trick can then be used to obtain an estimate of p(blue), typically adding a sigmoid function over R(w), such that 1 p(blue x, w) = 1 + exp ( R(w)). Another very popular discriminative model is the Neural Network [6], also called multilayer perceptron (MLP). MLPs are discriminant functions too, but have been widely coupled with the sigmoid function to provide a probability distribution instead. A typical MLP contains an input layer where the user plugs in the features, a series of hidden layers where the core of the computation happens, and an output layer where the user can read p(c x,θ), θ being in this case the weights between units of different layers. Note that the layers can be connected to one another in different ways. Most discriminative models are trained using what we will call discriminative learning. This type of training is fundamentally different from the generative learning defined in (1.2). With our complete training set {X,C} at hand, the function to maximise with respect to the 12

31 1.3. Generative vs discriminative models parameters θ is written: N L D (θ) = p(c X,θ) = p({c n } {x n },θ) = p(c n x n,θ) This is referred to as the conditional likelihood since we optimise one part of the training set s likelihood, conditioned on the other part. It is also sometimes abusively called the discriminative likelihood. Indeed, it can be seen as the likelihood of the system where the entities to model are now the labels, and the data acts as pure information. We now see the distinction from generative models. The data distribution has disappeared (no form of p(x)) and has been replaced by the posterior distribution of the labels p(c X, θ), which is now the quantity which is maximised. However, this definition of L D (θ) is unsatisfactory, as we would like to be able to inject a prior over θ, so we will write instead n=1 N L D (θ) = p(θ) p(c X,θ) = p(θ) p(c n x n,θ) (1.3) n=1 1.3 Generative vs discriminative models In the previous sections, we have described the principle of generative and discriminative models. This section will be primarily concerned with what makes them so different, and how they complement each other Different advantages One of the interesting particularities that generative models have over discriminative ones, is that they are learnt independently for each category. Indeed, this comes out in equation (1.2) where we saw that only the c n part of θ was used for each image. This one to one mapping between model and category makes it very easy to add categories. It also makes it easier to have different types of model for every category. Conversely, because discriminative methods are concerned with boundaries, all the categories need to be considered jointly. Therefore, if we want to add a category, we have to start everything again from scratch. In the generative case, we can keep the previous models we had and just train an additional one for the new category. However, generative models most important characteristic is their modelling power. 13

32 CHAPTER 1. GENERATIVE OR DISCRIMINATIVE? Generative models are designed to absorb experts beliefs about the system s environment, i.e. prior knowledge about how some of its variables interact, prior knowledge about which variables do not interact, and prior knowledge about its parameters range of values. Conversely, discriminative models are classification-oriented and therefore often lack the flexibility needed to model. This is why they tend to be black boxes. A data-point is given as an input, and p(category input) is returned, but without a clear understanding of how or why. Another fundamental particularity of generative models, that naturally follows from their modelling power, is their ability to deal with missing data. Indeed, when a model is determined, reconstructions of the missing values are also obtained. Conversely, discriminative models cannot easily handle incompleteness since the distribution of the observed data is not explicitly modelled. This particularity is crucial because it allows to use generative models with different kinds of data: complete data or labelled data (that comes with labels), or incomplete data or unlabelled data (that has no labels). There are other kinds of incompleteness generative models can deal with (for example a particular value missing in the feature vector x) but here we will only focus on the label one. In classification problems, training is called supervised when the labels c n are known (observed), and unsupervised when the c n are unknown (unobserved). Very often, both labelled data and unlabelled data are available. If a mixture of both is used in the training process, we call it semi-supervised training. Abusing the language, we will also refer to the problem and to the data as supervised, semi-supervised or unsupervised according to the type of training we perform. Unlabelled data are very easy to get, and we want to be able to use them. Any generative model can make use of all kinds of data within the same framework. Indeed, if we consider L = {X L,C L } to be the set of labelled data, and U = X U the set of unlabelled data, we can rewrite L G (θ) as L G (θ) = p(l,u,θ) = p(x L,C L,X U,θ) = p(θ) p(x L,C L,X U θ) = p(θ) p(x L,C L θ) p(x U θ) = p(θ) p(x n,c n θ cn ) n θ) n L m Up(x 14

33 1.3. Generative vs discriminative models = p(θ) p(x n,c n θ cn ) n L m U ( ) p(x m,c θ c ) c (1.4) where we assume that the labels are missing at random. Conversely, a straightforward discriminative model requires the labels. On the other hand, discriminative models do not waste any resources trying to model the joint distribution, instead they focus on the boundary between classes. Indeed, the joint distribution may contain a lot of structure that has little effect on the posterior probabilities, as illustrated in Figure 1.5. Therefore, it is not always desirable to compute the joint distri- Figure 1.5: Generative model vs discriminative model. Taken from [7]. Example of the class-conditional densities for 2 classes having a single input variable x (left plot) together with the corresponding posterior probabilities (right plot). Note that the left-hand mode of the class conditional density p(x C 1 ) shown in blue on the left plot, has no effect on the posterior probabilities. The vertical green line in the right plot shows the decision boundary in x that gives the minimum misclassification rate. bution. This is the reason why discriminative models have been widely (and successfully) used. In particular, neural networks and SVMs are probably the most common examples because they gave excellent results for commercial applications such as the recognition of handwritten digits [53; 54]. Another popular characteristic for discriminative models is speed. Indeed, classifying new points is usually faster since p(c x,θ) is directly obtained. 15

34 CHAPTER 1. GENERATIVE OR DISCRIMINATIVE? Generative vs conditional learning As we have just described, generative and discriminative methods have different advantages. As a first step to try and combine these, an interesting and widely used approach is to use a generative model and train it in a discriminative fashion, i.e. using L D (θ) as an objective function [13; 25; 15; 84; 30]. Indeed, we can apply Bayes rule: p(c x,θ) = p(x,c θ c) p(x θ) = p(x,c θ c ) c p(x,c θ c ) which allows us to rewrite (1.3) as N N L C (θ) = p(θ) p(c n x n,θ) = p(θ) n=1 n=1 p(x n,c n θ cn ) c p(x n,c θ c ) (1.5) and then to use a generative model to explicitly model p(x,c θ c ). With this method, hopefully we keep the attraction of a generative model (distribution of the observed data) while keeping the power of classification of discriminative models. This approach is called conditional learning but is also commonly referred to as discriminative training, although it is not really fully discriminative since the data is still modelled. To gain further insight into the differences between generative and conditional learning, we can detail some mathematical derivations to see what would happen if we were to use a technique such as conjugate gradients to optimise L G (θ) and L C (θ). Let us start with L G (θ): log L G (θ) θ k ( ) = N log p(θ) p(x n,c n θ cn ) θ k n=1 ( = N log p(θ) + log p(x n,c n θ cn ) θ k = log p(θ k) θ k + N n=1 n=1 δ cnk ) θ k log p(x n,k θ k ) (1.6) Now let us consider L C (θ): log L C (θ) θ k ( = log p(θ) θ k = θ k log ) N p(x n,c n θ cn ) n=1 c p(x n,c θ c ) ( ) N p(θ) p(x n,c n θ cn ) n=1 [ N ( )] log p(x n,c θ c ) θ k n=1 c 16

35 = log L G(θ) θ k = log L G(θ) θ k = log L G(θ) θ k = log p(θ k) θ k + = log p(θ k) θ k + N n=1 n=1 n=1 ( ) log p(x n,c θ c ) θ k c 1.3. Generative vs discriminative models N 1 c p(x p(x n,k θ k ) n,c θ c ) θ k N p(x n,k θ k ) c p(x log p(x n,k θ k ) n,c θ c ) θ k ( N δ cnk p(x n,k θ k ) c p(x n,c θ c ) N (δ cnk p(k x n,θ)) n=1 n=1 ) θ k log p(x n,k θ k ) θ k log p(x n,k θ k ) (1.7) Equations (1.6) and (1.7) are useful to understand how both learning approaches operate. In the generative case, the parameters θ k are influenced by the training samples of class k only. In the conditional case, the parameters θ k are influenced by those training samples that have a high absolute value for the quantity δ cnk p(k x n,θ). These are the training samples that are of class k but have a low probability of being classified as k, or the samples that are not of class k but have a high probability of being classified as k. Here we can clearly see how conditional learning, similarly to discriminative learning, is concerned with boundaries, i.e. with these data-points that are problematic, and very little with the rest, as opposed to generative learning whose sole purpose is to capture the distribution of the data belonging to class k, regardless of the data from other classes. This behaviour is even more true in discriminative learning since it never has access to the distribution of the data. Another interesting case is the use of unlabelled data in L G (θ). From (1.4), and using similar steps we have: log L G (θ) = log p(θ k) + δ cnk θ k θ k n L θ k log p(x n,k θ k ) + m U p(k x m,θ) θ k log p(x m,k θ k ) (1.8) The last term of the equation shows that the model is able, for each unlabelled data-point, to make a prediction on which class it comes from, and then uses these predictions to regulate the influence of each unlabelled data-point on each class. Note that unlabelled data cannot be used with L C as defined in (1.5) since L C relies on the labels posterior distribution. 17

36 CHAPTER 1. GENERATIVE OR DISCRIMINATIVE? Classification: Bayesian decision, point estimate Remember that the category ĉ of a new image x is decided using (1.1). For generative probabilistic models, we rewrite p(c x, X, C) p( x, c X, C). Under a Bayesian setting, we then have p(ĉ x,x,c) p( x, ĉ θ)p(θ X, C) dθ In practice though, both p(θ X,C) and the integral are intractable, so we use either approximations (such as variational inference or Markov Chain Monte Carlo methods), or a point estimate θ. Optimising the full joint distribution L G (θ) leads to a point estimate commonly called the maximum a posteriori θ MAP. Indeed, rewriting the optimisation gives θ MAP = arg max θ L G(θ) = arg max p(x,c,θ) arg max θ θ p(θ X,C) We can see that θ MAP is actually a mode of the true posterior distribution p(θ X,C) that we are trying to approximate, hence the name. If data is plentiful, the distribution should be highly peaked around its mode(s), and it makes sense to consider a mode θ MAP only. If data is really plentiful, then we can approximate further by removing the prior p(θ) from L G (θ). This method is commonly called the maximum likelihood approach, because it now optimises the joint likelihood of the data and the labels. Indeed, rewriting the optimisation gives θ ML = arg max θ L G (θ) p(θ) = arg max θ p(x,c,θ) p(θ) = arg max θ p(x,c θ) θ MAP is equivalent to θ ML for a uniform prior p(θ) and, under some mild conditions on the prior distribution, tends to θ ML when the number of data-points grows to infinity. Whichever point estimate technique we use, we have: p(c x,x,c) p( x,c X,C) p( x,c θ)δ(θ θ)dθ = p( x,c θ) so that ĉ = arg max p( x,c θ). c For discriminative probabilistic models, or for generative models trained using condi- 18

37 1.3. Generative vs discriminative models tional learning, a Bayesian setting gives p(c x,x,c) = p(c x, θ)p(θ X, C) dθ Both p(θ X, C) and the integral are usually intractable, so we use either approximations (such as Markov Chain Monte Carlo methods), or a point estimate θ. We now aim at maximising the objective function (or discriminative likelihood) L D (θ) so that θ = arg max θ L D(θ) = arg max p(θ)p(c X,θ) (1.9) θ Again, if we have possession of lots of labelled data, then we can also consider dropping the prior over θ, in which case we have θ = arg max θ L D (θ) p(θ) = arg max θ p(c X,θ) which is equivalent to (1.9) for a uniform prior p(θ) and, under some mild conditions on the prior distribution, tends to (1.9) when the number of data-points tends to infinity. Note, however, that in this case θ no longer gives θ MAP or θ ML. Whichever point estimate technique we use, we have: p(c x,x,c) p(c x,θ)δ(θ θ)dθ = p(c x,θ) so that ĉ = arg max p(c x,θ). c At this point, we ought to say that, for practical reasons, we do not usually optimise L G or L D directly. Indeed, because most of the probability densities that are used are from the exponential family, it is much more analytically practical and numerically useful to maximise log L G or log L D. This is fine though, since the logarithm function is strictly monotonic, so the maximum is conserved. As we have mentioned in our introduction, and as we have confirmed in this chapter so far, generative and discriminative models are highly complementary. However, there is no straightforward way to combine them. We have discussed discriminative training, but this is still limited as we do not fully take advantage of the modelling power of the generative model, since unlabelled data cannot be included for instance. Many attempts 19

38 CHAPTER 1. GENERATIVE OR DISCRIMINATIVE? have been made, within the various scientific communities that use or develop machine learning, to reach a better combination of generative and discriminative approaches. We will now review a few of them. 1.4 Hybrid methods Generative versus discriminative models has been a hot topic for the last decade. Many authors have published empirical comparisons of the two approaches [43; 21; 77] often reporting different conclusions, which tends to invalidate the common belief that, with the same (or comparable) training data, discriminative models should perform better at classification. Nonetheless, Ng et al. [63] have published a very interesting formal comparison of logistic regression and naïve Bayes in which they prove that logistic regression performs better (i.e. have a lower asymptotic error) when training data is sufficiently abundant. The previous sections have shown how different the advantages of generative and discriminative approaches are. The complementary properties of these two approaches have understandably encouraged a number of authors to seek methods which combine their strengths. This has led to two subtly different types of hybrid frameworks, that we will call hybrid algorithms and hybrid learning. Hybrid algorithms refer to algorithms involving two or more models (with their own objective functions) that are trained one after the other (sometimes in an iterative process) and that influence each other. Hybrid learning (or more exactly hybrid objective functions) on the other hand, are multi-criteria optimisation problems. Typically, they optimise a single objective function that contains different terms, at least one for the generative component and one for the discriminative component Hybrid algorithms There are infinitely many ways to obtain a hybrid algorithm, many of which have been explored in the modern literature. However, we will try to give an overview of selected typical examples, that are quite different from one another and give a good idea of the different approaches. 20

39 1.4. Hybrid methods Learning discriminative models on generative features A first approach is to learn interesting features using a generative model, and to use the derived feature vectors to train a discriminative classifier. Note that we have a generative step followed by a discriminative step. The most popular example of this type of algorithm is Fisher s kernel, suggested by Jaakkola et al. in [41]. The idea is to build a logistic regression model, but using feature vectors coming from a generative model, typically the derivative of the log likelihood of the data-point with respect to the different parameters of the generative model. If x is the set of basic features of data-point I, then the features used by the discriminative model for data-point I will be φ = θ log p(x θ). Using p(x θ) = p(x,c θ) and the trick defined in c (3.2), we can write φ = p(c x,θ) log p(x,c θ) θ c Various kernel can now be defined, for instance: K 1 (φ m,φ n ) = φ m φ n, K 2 (φ m,φ n ) = φ m [E(φφ T )] 1 φ n known as Fisher s kernel, and used for any kind of discriminative classifier such as logistic regression or SVMs. Fisher kernels have probably been the most used hybrid method. They have been applied successfully in domains as varied as biology [41; 39] using hidden Markov models (HMMs) and SVMs, speech recognition [52; 62] using Gaussian mixture models and SVMs, or object recognition [31] using constellation models [24] and SVMs ([32] even explores semisupervised learning). To perform scene recognition, Bosch et al. [9] use a probabilistic latent semantic analysis (plsa) model (see [29]), quite popular in the text literature. Similarly to Fisher kernels, their idea is to learn the image features with this generative model, and then to classify new images discriminatively with a SVM. If we call patches of a visual vocabulary words, they define an image d as a collection of words w conditioned, not on the image category, but on a latent topic z, therefore p(w,d) = p(d) z p(w z)p(z d), which gives p(w d) = z p(w z)p(z d). Each image is therefore modelled as a mixture of topics. The 21

40 CHAPTER 1. GENERATIVE OR DISCRIMINATIVE? likelihood to maximise is then L = n(w,d)log p(w,d), where n(w,d) is the number of d w times the visual word w appears in image d. This simple model is trained using expectationmaximisation. Note that training is unsupervised. This gives them a distribution p(z d) over the topics. Subsequently, a multi-class SVM is learnt, using the feature vector p(z d) and the class label of each training image as an input. This concept of learning the features generatively and classifying discriminatively was also developed for image classification in [57]. Learning generative models on discriminative features Interestingly, the reverse approach can also be found. In [56] for example, Lester et al. perform human activities modelling through a two-stage learning algorithm: firstly they run a boosting algorithm to discriminatively select useful features and learn a set of static classifiers (one static classifier per activity category). The predictions of these classifiers are then combined in a feature vector per training sample that is used to train a hidden Markov model (HMM) in order to capture the temporal regularities. Recognition is then performed using the HMM. Refining generative models with discriminative components Yet another approach is to learn a generative model for each category, and then to perform inference using these generative models but giving different weights to different components of the model. These weights are learnt discriminatively. Note that in this case again, we have a generative step followed by a discriminative step. A good example of this can be found in text classification, developed by Raina et al. [69]. They consider that a document is split between regions, like the header or the signature for an for example. Each region has a different set of parameters corresponding to a generative model, but the influence of this region (i.e. its weight) is learnt discriminatively. Their generative model is written p(x y) = j p(w = x j y) with x j being the j th word of document x, and p(w y) being estimated by simple counting: p(word w y) = number of occurrences of w in class y documents number of words in class y documents 22

41 1.4. Hybrid methods Inference is then performed using p(y x) = p(x y)p(y) Splitting the document x in p(x y)p(y). regions {x s } and assigning normalised weights θ s N s to these regions, gives p(y x) = y p(y) s p(xs y) θs/ns y p(y) s p(xs y) θs/ns where N s is the length of the s th section of the document. For a binary problem (y {0,1}), this can be rewritten ( ) p(y = 1) where b 0 = log p(y = 0) learnt using θ = arg max θ 1 p(y = 1 x) = 1 + exp ( θ T b) ( p(x s ) y = 1) p(x s, and θ 0 = 1. Now the best θ can be y = 0), b s = 1 N s log i p(y i x i ). In the same vein, Subramanya et al. [75] use a hybrid learning algorithm for speaker verification based on user-specific passphrases. They first build a non-speaker-specific generative model λ, and then they train one speaker-specific generative model λ u per user U. When confronted to the query X claims to be user S, a passphrase spoken by X is needed to answer. Classification (X is S or not) could be performed by simply evaluating the likelihood score p(passphrase from X was generated by λ s ) p(passphrase from X was generated by λ) However, they want to be able to assign a different relevance to the different words, depending on their ability to discriminate users. To learn the weights discriminatively, they use a boosting algorithm. They claim a much better verification performance (more than 35% improvement) compared to the usual likelihood score. Refining generative classifiers with a discriminative classifier A different approach is to use one generative model per class to perform inference on new data-points, and if some classes are ambiguous, to use a discriminative classifier to separate them. Note that in this case again, we have a generative step followed by a discriminative step. In hand-writing recognition, Prevost et al. [66] use a generative classifier to preselect 23

42 CHAPTER 1. GENERATIVE OR DISCRIMINATIVE? the two most likely categories of a hand-written character, and a discriminative classifier to choose between the two. They build a confusion matrix on their training set, and learn pairwise discriminative classifiers (neural networks) to further discriminate / separate each pair of categories that have a high confusion score. For a test character, they first run the generative models to get the two most likely categories C 1 and C 2. If C 1 and C 2 did not have a high confusion score, C 1 is picked as the category, otherwise the discriminative classifier for (C 1,C 2 ) is run to pick the category of the test character. This kind of two-stage classification was also studied by Holub et al. in [30]: they first learn generative constellation models [24] for every class, build a confusion matrix between the different classes, identify the ambiguous subgroups of classes, and further discriminate between the classes of an ambiguous subgroup by retraining the constellation model discriminatively on the classes of this subgroup only. Quite a different example can be found with deep belief networks [28]. The weights between layers are trained so as to maximise the joint distribution of the complete training set p(x, c). They are learnt in an unsupervised greedy layer-by-layer fashion: the input of the first layer is the data, the input of the second layer is the output of the first layer and so on. Once training has converged, the resulting weights, combined with a final layer of variables that represent the desired outputs, form a solid initialisation for a more conventional supervised algorithm such as back-propagation. A closely related approach uses the same type of greedy, layer-by-layer learning with a different kind of learning module: an auto-encoder that simply tries to reproduce each data vector from the feature activations that it causes [4]. Feature splitting Another approach was given by Kang et al. [45]. In this work, they essentially relax the independence assumption of naïve Bayes by splitting the features into two sets X 1 and X 2, and assuming conditional independence for the features in X 2. The nodes of X 1 will be the ancestors for the label c, and the nodes of X 2 will be its children. So now the quantities to maximise are p(c X 1 ) (optimised discriminatively with logistic regression) and p(x 2 c) (optimised generatively with naïve Bayes). The split between X 1 and X 2 is learnt through an iterative algorithm based on the classification accuracy, that increases X 1 until adding a 24

43 1.4. Hybrid methods feature makes no significant improvement. Classification is performed using the posterior distribution p(c X) p(c X 1 ) A X 2 p(a C). When generative and discriminative models help each other out In [26], Hinton et al. have introduced a very elegant algorithm mixing neural networks and a generative model to recognise hand-written digits. They use a hand-coded generative model taking a set of spring stiffnesses (a motor program) as an input, and giving the corresponding drawing as an output. To come up with the appropriate motor programs for each image, they use one MLP per digit category (10 in total) that takes an image as an input and returns a set of spring stiffnesses. However, because they do not have access to the true motor programs needed to learn the MLPs, they hand-create a prototype of program and, for each MLP, they bootstrap their training data-set by adding various perturbations to this prototype and by using the hand-coded generative model to get the corresponding images. They do so until all the training images are well represented. They claim good object recognition can be achieved by 1) inferring the motor programs of the new image using the 10 class-specific MLPs, 2) scoring these programs by computing the reconstruction error using the generative model, and 3) using these 10-dimensional score feature vectors to perform recognition, using softmax classification for example. This work is a very elegant way of combining generative and discriminative approaches. However, the generative model is not jointly learnt with the MLPs, it is hand-coded beforehand. Fujino et al. [1] have designed a set of different models that they optimise in turns. First of all, they train a generative model on the labelled samples {x n,y n } they have available, by optimising the class-conditional likelihood J(θ) = p(θ) n p(x n y n,θ). Once they have a MAP estimate of θ, they define a bias correction model, using a similar likelihood function J(ψ) = p(ψ) p(x m k,ψ k ) u mk, now augmented with weights u mk that will be learnt m k discriminatively. Moreover, a new conditional distribution is defined by: R(k x,θ,ψ,λ,µ) = p(x k,θ k) λ 1 p(x k,ψ k ) λ 2 exp(µ k ) z p(x z,θ z) λ 1 p(x z,ψz ) λ 2 exp(µz ) This gives a discriminative classifier consisting of the trained generative model and the bias 25

44 CHAPTER 1. GENERATIVE OR DISCRIMINATIVE? correction model. Learning is performed as follows: at iteration t, ψ t = arg max ψ J(ψ), (λ,µ) t = arg max (λ,µ) u t mk = R(k x m,θ,ψ t,λ t,µ t ) [ p(λ,µ) n R(y n x n,θ n,ψ,λ,µ) ], where θ n is the MAP estimate of θ learnt on the labelled samples excluding x n. In other words, once the generative model θ is learnt, they define a similar bias model ψ, and combine both θ and ψ in a discriminative model (λ, µ). The discriminative model is updated and its predictions u mk for each (data-point m, class k) pair are used to give weights to training points in the bias model. The bias model is then updated and is used by the discriminative model, and so on. They report better results on text classification, mostly in the case where generative and discriminative classifiers have a similar performance. However, learning θ and every θ n is not tractable when models are complicated and samples numerous. The wake-sleep-like algorithms Going several steps further, Hinton et al. [27] have offered the machine learning community the typical example of how generative generative and discriminative models, or what they call generative and recognition models, can help each other and be learnt iteratively. The goal is to learn representations that are economical to describe but that still allow the input to be reconstructed accurately. Their framework is developed in the context of unsupervised training of a neural network. The network is given two sets of weights: the generative weights, and the recognition weights. During the wake phase, training samples x are presented to the recognition model, which produces a representation of x in the first hidden layer, a representation of this representation in the second hidden layer, and so on. The activity of each unit in the top hidden layer are communicated through distributions, then the activities of the units in each lower layers are communicated. The recognition weights determine a conditional probability distribution q(y x) over the global representation of x, so that each generative weight can be adjusted to minimise the expected cost y q(y x)c(y,x), where C is a particular cost function computed using the generative weights. The learning makes each layer of the global representation better at reconstructing the activities in the layer below. During the sleep phase, q(y x) is made as close to p(y x) as possible, where p(y x) is 26

45 1.4. Hybrid methods the posterior distribution obtained using the generative weights. The recognition weights are now replaced by the generative weights, starting at the top-most hidden layer, down to the input units. Due to the stochasticity of the units, repeating this process provides different fantasised unbiased samples that are used to adjust the recognition weights so as to maximise the log probability of recovering the hidden activities that caused the fantasised samples. The wake-sleep algorithm has also been applied to Helmholtz machines in [20], but more importantly it has been the precursor of most models based on iterative learning of generative and discriminative methods. The few ones we will discuss in the remainder of this section are derived from this algorithm. For motion reconstruction, Sminchisescu et al. [74] have interleaved top-down and bottomup approaches, that are actually generative and discriminative approaches, through what they call a generative / recognition model. They define x as the hidden state of the human joint angles, and r the observation of the body position given by SIFT descriptors over edge detections. The class-conditional density of the body position can be written p(r x,θ) = exp ( E(r x,θ)). The recognition model is a conditional mixture of experts: Q(x r,ν) = M g i (r)n(x W i r,ω i ) i=1 with g i (r) = exp (λ T i r) M k=1 exp (λ k T and ν = {W,Ω,λ}. To include feedback from the generative r) model, the weights are transformed into g i (r) = exp (λ i T r E(r W i r,θ)) M k=1 exp (λ k T r E(r W k r,θ)) The optimisation is performed with variational expectation-maximisation (EM): the E-step finds ν k = arg max at iteration k, ν KL(Q(x r,ν) p(x,r θk 1 )) the M-step finds θ k = arg max θ KL(Q(x r,νk ) p(x,r θ)) Q acts as as an approximating variational distribution for the generative model p. Learning therefore progresses in alternative stages that optimise the probability of the image evidence: the recognition model is tuned using samples from the generative model which is in 27

46 CHAPTER 1. GENERATIVE OR DISCRIMINATIVE? turn optimised to produce predictions close to the ones provided by the current recognition model. A similar kind of algorithm (reconstruction / recognition) for the task of pose estimation was developed by Rosales et al. in [71]. In [87], Zhang et al. perform object detection and develop what they call a random attributed relational graph (RARG), which is a slightly more general version of the constellation model [24]. The vertices of the graph are random variables, the edges are their relations, and each vertex a i / edge a i j is associated with a particular distribution f i or f ij. If O is the presence of the object hypothesis then, following their graph, they define p(image graph O) = X p(x O) p(y u X,O) p(y uv X,O) vertices u edges uv where X represents the correspondence between a node of the image graph and a node of the RARG, such that x iu = 1 if the RARG node a i matches the image part y u, x iu = 0 otherwise. Of course p(y u x 11 = 0,,x iu = 1,,O) = f i (y u ), and the same is true for p(y uv X). After a few equations, they want to optimise q(x iu )log η iu (x iu ) + q(x iu,x jv )log ς iu,jv (x iu,x jv ) + entropy(q(x)) (iu) (iu,jv) where q(x iu ) and q(x iu,x jv ) are the approximated marginals of p(x image graph,o), and η iu (x iu ) and ς iu,jv (x iu,x jv ) are likelihood ratio estimating how well part y u (edge y uv ) matches node a i (edge a i j). Now the trick is to see all the η iu and the ς iu,jv as classifiers obtained generatively, and the idea is to replace them by discriminative classifiers exp (C i ) and exp (C ij ). Therefore the q are learnt generatively in a first pass (E-step) and the C are learnt discriminatively with a SVM (M-step). The process is repeated until convergence. They report better results than with the purely generative model. Hybrid models based on the wake-sleep framework can also be found for newsgroup categorisation [23], or for biology [78]. We have seen seven different ways of using generative and discriminative models in a hybrid algorithm. All these methods (except generative training of HMMs on discriminatively learnt features) have in common the way they first use a generative model that they 28

47 1.4. Hybrid methods either 1) refine with a discriminative model, or 2) use to learn and refine a discriminative model. This kind of learning process is very elegant, because it uses the properties of one type of model to help the other type of model. Chapter 4 will present a framework belonging to this type of hybrid algorithm, more specifically belonging to the class of wake-sleep algorithms. We will deal with a global generative model containing a Markov random field (MRF) presenting a high cardinality, and we will show that a discriminative classifier can help reduce the state space of this MRF Hybrid learning In this section, we will investigate another kind of hybrid technique that we will call hybrid learning. There are a lot fewer frameworks in this category. We will try to explain each of them, going from the simplest to the most general. As far as we know, these are the only hybrid objective functions that have been studied and used, interestingly most of them in text classification. Discriminative training of generative models As discussed in section 1.3.2, we can use a generative model and train it in a discriminative fashion, i.e. using (1.5) as an objective function. With this method, hopefully we keep the attraction of a generative model (distribution of the observed data) while keeping the power of classification of discriminative models. This has been done in speech recognition [13; 25; 15; 84], but also in image classification [30]. Much like conditional learning, Jaakkola et al. [40] train generative models using a discriminative criterion, only it is not the conditional likelihood. Taking the example of a binary classification problem, and given two corresponding generative models parameterised by θ + for class 1 and θ for class 2, the likelihood ratio is given by LR(x) = p(x θ+ ) p(x θ ) A discriminant function can easily be defined as f(x θ) = sign(log LR(x) + b). Therefore θ = [θ +,θ,b] is now learnt so as to maximise the classification accuracy n y n f(x n θ), and yet parameterises generative models. Pushing further, and using the concept of maximum 29

48 CHAPTER 1. GENERATIVE OR DISCRIMINATIVE? entropy, a (posterior) distribution over θ can be obtained by maximising its own entropy H(p(θ)) subject to p(θ)y n f(x n,θ)dθ γ, where γ > 0 means a classification margin is desired. A thorough study of this framework can be found in [42]. Convex combination of objective functions Going a step further from conditional learning, Bouchard et al. [11] suggest a compromise between the two objective functions L G and L C. This was later followed by Chen et al. [14] for face recognition. They still deal with a generative model, but the idea is to use a convex combination of the logarithms of L G (θ) and L C (θ). The logarithm of the new objective function L(θ) is now written log L(θ) = α log L C (θ) + (1 α) log L G (θ) with α [0,1] (1.10) Using equations (1.6), (1.7) and (1.10), we can differentiate L with respect to θ k : log L(θ) θ k = α L C(θ) + (1 α) L G(θ) θ k θ k [ log p(θ k ) N = α + (δ cnk p(k x n,θ)) θ k n=1 [ log p(θ k ) N + (1 α) + δ cnk θ k = log p(θ k) θ k + n=1 N (δ cnk α p(k x n,θ)) n=1 ] log p(x n,k θ k ) θ k θ k log p(x n,k θ k ) ] θ k log p(x n,k θ k ) We can see that, similarly to the discriminative training property, data-points that influence the gradient most are the ones that are of class k but are not classified as k or images of other classes that are classified as k, but now the influence of the classification term p(k x n,θ) is modulated by the factor α. By varying α, we can work our way from pure generative training (α = 0) to pure discriminative training (α = 1). If we want to see what happens with a mixture of labelled and unlabelled data, then we 30

49 1.4. Hybrid methods can recompute the derivative of L, but this time using equations (1.7), (1.8) and (1.10): log L(θ) θ k = α log L C(θ) θ [ k log p(θ k ) = α θ k + (1 α) + (1 α) = log p(θ k) θ k + (1 α) + (1 α) log L G(θ) θ k + (δ cnk p(k x n,θ)) n L [ log p(θ k ) δ cnk θ k + n L [ p(k x m,θ) m U + (δ cnk α p(k x n,θ)) n L [ p(k x m,θ) m U ] log p(x n,k θ k ) θ k θ k log p(x n,k θ k ) θ k log p(x m,k θ k ) ] ] θ k log p(x n,k θ k ) θ k log p(x m,k θ k ) ] (1.11) Again, (1.11) shows us how the different components come into play. Similarly to (1.10), the discriminative component, represented by the p(k x n,θ) term, is modulated by α, but now the influence of unlabelled data is modulated by (1 α). This is attractive because it shows that, for any α ]0,1[, we can make use of unlabelled data and still have some discriminant power. Unfortunately, the criterion (1.10) was not derived by maximising the distribution of a well-defined model, thereby making it an ad-hoc procedure. Multi-Conditional learning A different approach approach called multi-conditional learning has been studied by Mc- Callum et al. [60]. They suggest a framework whereby generative and discriminative components have different weights α and β. Their objective function is log L(θ) = α log p(c X,θ) + β log p(x C,θ) In fact, it should be noted that their objective function is originally more general: it involves defining N s sets S j = {Sj A,SB j }, where SA j and Sj B are disjoint for every j, and is written N s log L(θ) = α j log p(sj A SB j,θ). The name multi-conditional now makes sense. Every j=1 log p(sj A SB j,θ) is a criterion to optimise based on a conditional distribution with relative 31

50 CHAPTER 1. GENERATIVE OR DISCRIMINATIVE? importance α j. The generative / discriminative approach they have chosen is a particular case of this general framework with S 1 = {C,X}, and S 2 = {X,C}. Note that there are no constraints on the various α j. This is the same in their generative / discriminative particular case, since neither α nor β are constrained. They have applied their framework to supervised pixel classification [48] and to semi-supervised text classification [22] by changing S 2 to {X, }. This relates to the convex combination framework of Bouchard et al. [11], by using S 1 = {C,X} and S 2 = {{X,C}, }, dropping the prior p(θ) in (1.10), and setting β to 1 α, i.e. normalising the coefficients. It also relates to features splitting as suggested in Kang et al. [45] if we choose S 1 = {C,X 1 } and S 2 = {X 2,C}, however training in [45] involves multiple runs whereas multi-conditional learning trains both distributions simultaneously. Again, note that multi-conditional learning is an ad-hoc procedure. A new hybrid objective function for semi-supervised learning The next chapter will present a new hybrid objective function studied in [50], and will detail its properties, and how we can gain from using such a function. This framework will be studied in the context of semi-supervised learning, as it is one of its straightforward applications. Indeed, when labelled training data is plentiful, discriminative techniques are widely used since they give excellent generalisation performance. However, for large-scale applications, hand-labelling data is expensive, and in complex problems such as object recognition, where there is huge variability in the range of possible input vectors, it may be difficult or impossible to provide enough labelled training examples to generalise well. Therefore we cannot always rely on discriminative models. On the other hand, as discussed in 1.3.1, generative models can handle missing data. In particular, as demonstrated in equations (1.4) and (1.8), they perform semi-supervised learning very naturally. However, generative models are not the best classifiers, and although their generalisation performance can often be improved by training them discriminatively (see section 1.3.2), they can then no longer make use of unlabelled data. Therefore, blending the modelling power of generative models with the classifying power 32

51 1.4. Hybrid methods of discriminative models in a hybrid objective function should be a promising avenue for semi-supervised learning. This is investigated further in chapters 2 and 3. Semi-supervised learning is an active branch of machine learning and many authors have sought good solutions to this problem. Below is a sample of the popular techniques. For a more thorough survey, please refer to [72; 88]. Note that our approach is rather different from these methods. The most common method is probably to use a generative model with any number of training techniques (expectation-maximisation, gradient ascent, etc), like Nigam et al. in [64]. However, this is only good if the generative model captures the decision boundary, which is not guaranteed at all. In fact, if the model is wrong, it can be shown that the unlabelled data might even hurt rather than help [18]. Another, perhaps more intuitive, approach is self-training, used for example by Yarowsky in text [85]. The few labelled samples are used to train a classifier, which in turn is used to classify the unlabelled data-points. Points classified with a high confidence are inserted in the supervised training set, a new classifer is trained, and the process is repeated. This is very simple and can be applied to any sort of classifier, however it is easy to see how mistakes are self-reinforced. Zhu et al. [89] build a graph connecting all the data-points depending on their similarity, and propagate the labels essentially by performing a Markovian random walk from a vertex (data-point) to another. This works well when there is little or no ambiguity between the different classes, which is often not the case. In other words, when the classes are not well separated, building the graph (or rather the similarity measure) is a key skill, which makes the process more difficult. Blum et al. [8] have chosen an approach they call co-training (multi-view learning). Several classifiers are trained over multiple feature splits of the same labelled examples, and are penalised if they do not make the same prediction on unlabelled data. However, this does not rely on any natural / real phenomenon, and a good split of features may not exist. 33

52 CHAPTER 1. GENERATIVE OR DISCRIMINATIVE? Finally, discriminative classifiers such as SVMs have also been generalised to cope with some unlabelled data [5]. This is good news, especially as they keep a clean mathematical framework (a key aspect of SVMs). However, they are not probabilistic models, they are difficult to optimise, and they assume little overlap in the classes. 34

53 CHAPTER 2 A PRINCIPLED HYBRID FRAMEWORK Here we develop a novel viewpoint introduced in [61] which says that, for a given probabilistic model, there is a unique full joint distribution and hence there is only one correct way to train it. The conditional / discriminative training of a generative model (see section 1.3.2) is instead interpreted in terms of standard training of a different model, corresponding to a different choice of distribution. This removes the apparently ad-hoc choice for the training criterion, so that all models are learnt according to the principles of statistical inference. Furthermore, by introducing a constraint between the parameters of this model, through the choice of a prior, the original generative model can be recovered. As well as giving a novel interpretation for the discriminative training of generative models, this viewpoint opens the door to principled blending of generative and discriminative approaches by introducing priors having a soft constraint amongst the parameters. The strength of this constraint therefore governs the balance between generative and discriminative. In this chapter, we will explain how the framework operates, especially in the context of semi-supervised learning. The key ideas here are the blend of generative and discriminative models, and the possibility to learn where in between these two approaches we ought to be. We will start in section 2.1 by a thorough description of the framework, and a demonstration of how we can blend generative and discriminative models. We will also illustrate this with a toy example. In section 2.2, we will compare our approach to the convex combination framework of Bouchard et al. [11] discussed in section 1.4.2, using two different visualisation methods: the conditional densities and Pareto surfaces. We will then show in section 2.3 how to make the framework Bayesian in principle, and we will close this chapter in section 35

54 CHAPTER 2. A PRINCIPLED HYBRID FRAMEWORK 2.4 with a few brief remarks. We will now always refer to conditional learning as discriminative learning. It is slightly abusive since real discriminative learning would not have to deal with the distribution of the input variables and would not be applied to generative models, however it has the advantage of being a more intuitive description. 2.1 The hybrid framework A parametric generative model is defined by specifying the joint distribution p(x, c θ) of the input vector x and the class label c, conditioned on a set of parameters θ. Typically, this is done by defining a prior probability for the classes p(c π) along with a class-conditional density for each class p(x c,λ), so that p(x, c θ) = p(c π) p(x c, λ) (2.1) where θ = {π,λ}. Since the data-points are assumed to be independent, the joint distribution is given by (1.2). This can be maximised to determine the most probable value of θ, which we denote θ MAP. In order to improve the predictive performance of generative models, it has been proposed to use discriminative learning [84] which involves maximising (1.5). As discussed in section 1.3.3, note that (1.5) is not the joint distribution for the original model defined by (1.2), and so does not correspond to MAP for this model. The terminology of discriminative training is therefore misleading, since for a given model there is only one correct way to train it. It is not the training method which has changed, but the model itself. Consequently, we adopt an alternative view of discriminative learning, which will lead to an elegant framework for blending generative and discriminative learning approaches A new view of discriminative training Let us take a generative model whose objective function is the conditional likelihood written in (1.5), i.e. a model whose parameters θ are directly concerned with p(c X, θ). Now consider a new model which contains, in addition to the parameters θ = {π, λ} modelling 36

55 2.1. The hybrid framework p(c X,θ), an extra independent set of parameters θ = { π, λ} modelling p(x θ), as shown in figure 2.1. The joint distribution to optimise is now q(x,c θ, θ) = p(c X,θ) p(x θ). Note that adding θ should not change the resulting model as it has no influence on θ. c θ = {π, λ} q x N points ~ ~ ~ θ = {π, λ} Figure 2.1: The additional set of parameters θ. Note that in a genuine discriminative model, we would only have the set of parameters θ, and X would not be modelled. The joint distribution is q(x,c θ, θ) = p(c X,θ) p(x θ). Now let us imagine we want to enable some sort of communication between θ and θ, i.e. we open a channel between them. θ and θ therefore become random variables and we can place an arrow between them (arbitrarily oriented towards θ), as in figure 2.2. The joint c θ q x ~ θ N points Figure 2.2: The communication channel is now open. The resulting distribution is q(x,c,θ, θ) = p( θ) p(θ θ) p(c X,θ) p(x θ). Note that the channel (right-most link) could also be directed the other way round. 37

56 CHAPTER 2. A PRINCIPLED HYBRID FRAMEWORK distribution to optimise becomes q(x,c,θ, θ) = p( θ) p(θ θ) p(c X,θ) p(x θ). Keeping the prior as general as possible so that q(x,c,θ, θ) = p(θ, θ) p(c X,θ) p(x θ), we finally obtain q(x,c,θ, θ) = p(θ, θ) N N p(c n x n,θ) p(x n θ) (2.2) n=1 n=1 so that Let us consider the special case in which θ and θ are independent, i.e. the prior factorises, p(θ, θ) = p(θ) p( θ) (2.3) This actually corresponds to figure 2.1 where we have cut the link between θ and θ. We then determine optimal values for the parameters θ and θ in the usual way by maximising (2.2), which can now be written q(x,c,θ, θ) N N = p(θ) p( θ) p(c n x n,θ) p(x n θ) n=1 n=1 [ ] [ ] N N = p(θ) p(c n x n,θ) p( θ) p(x n θ) n=1 n=1 We see that the resulting value of θ will be identical to that found by maximising (1.5), since it is the same function which is being maximised. Since it is θ and not θ which determines the predictive distribution p(c x, θ), we see that this model is equivalent in its predictions to the original generative model trained discriminatively. This gives a consistent view of training in which we still maximise the joint distribution of q, and the distinction between generative and discriminative training lies in the choice of model. As a consequence, we will now refer to the model defined by (2.2) and (2.3) as the discriminative model. Figure 2.3 shows the relationship between the generative model and the model that leads to its discriminative training. Now let us suppose instead that θ and θ are constrained to be equal, i.e. we consider a prior of the form p(θ, θ) = p(θ) δ(θ θ) (2.4) 38

57 2.1. The hybrid framework c π c θ p q x λ x ~ θ N points N points Figure 2.3: The discriminative case. Probabilistic directed graphs, showing on the left, the original generative model, and on the right the corresponding discriminative model. We can easily integrate θ out in (2.2), which leads to: ( ) q(x,c,θ) = p(θ)δ(θ θ) N N p(c n x n,θ) p(x n θ) d θ n=1 N N = p(θ) p(c n x n,θ) p(x n θ) n=1 N = p(θ) p(x n,c n θ) n=1 We recover the original generative model p(x,c,θ), so we see that the resulting value of θ will be identical to that found by maximising (1.2), which is summarised in figure 2.4. We n=1 n=1 c π c θ p q x λ x N points N points Figure 2.4: The generative case. Probabilistic directed graphs, showing on the left, the original generative model, and on the right its equivalent in the new framework. 39

58 CHAPTER 2. A PRINCIPLED HYBRID FRAMEWORK will now refer to the model defined by (2.2) and (2.4) as the generative model. In order not to be confused with the original generative model defined by 1.2, we will refer to the latter as the original (generative) model, or the underlying (generative) model, however they are strictly equivalent. We have just described a single class of distributions in which the discriminative model corresponds to independence in the prior, and the generative model corresponds to an equality constraint in the prior. This is a very elegant framework as training is now always consistent. There is only one quantity to be maximised, the joint distribution (2.2), and what has to change is the model, through the prior over θ and θ Blending generative and discriminative models First of all, we note that the reason why discriminative training might give better results than direct use of the generative model, is that (2.2) is more flexible than (1.2) since it relaxes the implicit constraint θ = θ. Of course, if the generative model were a perfect representation of reality (in other words the data really came from the model), then increasing the flexibility of the model would lead to poorer results. Any improvement from the discriminative approach must therefore be the result of a mismatch between the model and the true distribution of the (process which generates the) data. In other words, the benefit of discriminative training is dependent on model mis-specification. Conversely, the benefit of the generative approach is, as discussed in 1.3.1, that it can make use of unlabelled data to augment the labelled training set. This paradigm is summarised in figure 2.5. Figure 2.5: Hybrid models may do better. An example of a situation where hybrid models may do better is when the generative model is weak and we have insufficiently labelled data. 40

59 2.1. The hybrid framework In general, if the original model is not a perfect representation of reality, and if we have unlabelled data available, then we would expect the optimal balance to lie neither at the purely generative extreme nor at the purely discriminative extreme. Ideally then, we would like to have access to an approach in between. Using the framework of section 2.1.1, clearly we can blend generative and discriminative extremes by considering priors which impose a soft constraint between θ and θ. Now let us consider how a combination of labelled and unlabelled data can be exploited from the perspective of our new approach defined by (2.2). If we split {X,C} into {X L,C L } the set of labelled data and X U the set of unlabelled data, the joint distribution becomes q(x,c,θ, θ) = q(x L,C L,X U,θ, θ) = p(θ, θ) p(c L X L,θ) p(x L,X U θ) = p(θ, θ) n Lp(c n x n,θ) p(x m θ) (2.5) m L U We represent this case in figure 2.6. We see that the unlabelled data (as well as the labelled α θ c n ~ θ U x m x n L Figure 2.6: Influence of the unlabelled data. Note how the unlabelled data x m affects θ through θ. If we close the channel, θ has no contact except with the c n. data) influence the parameters θ which in turn influence θ via the soft constraint imposed by the prior. Note that if we cut the link between θ and θ, θ has no more access to the unlabelled data. 41

60 CHAPTER 2. A PRINCIPLED HYBRID FRAMEWORK As a simple example of a prior which interpolates smoothly between the generative and discriminative limits, we can consider the class of priors of the form p(θ, θ) f(θ) g( θ) 1 ( σ exp 1 ) 2 θ θ 2σ2 (2.6) where f and g are two non-negative derivable functions such that the right hand side of (2.6) is integrable both over θ and over θ. If desired, we can relate σ 2 to an α-like parameter (as for the convex combination framework described in section 1.4.2) by defining a mapping from [0,1] to R +, for example using σ 2 (α) = ( ) α r (2.7) 1 α In this thesis, we will take r = 2. To keep the notation as simple as possible, we will refer to σ 2 (α) as σ 2. For α 0, we have σ 2 0 and we obtain a hard constraint of the form (2.4) which corresponds to the generative model. Conversely, for α 1, we have σ 2 and we obtain an independence prior of the form (2.3) which corresponds to the discriminative model Illustration We now illustrate this new framework for blending generative and discriminative training, or rather generative and discriminative models from our new point of view, using an example based on synthetic data. This is chosen to be as simple as possible, and so involves data vectors x n which live in a 2-dimensional Euclidean space for easy visualisation, and which belong to 1 of 2 classes. Data from each class is generated from a Gaussian distribution as illustrated in figure 2.7. Here the scales on the axes are equal, and so we see that the classconditional densities are elongated in the horizontal direction. We now consider a continuum of models, which interpolate between purely generative and purely discriminative. To define these models, we consider the generative limit, and represent each class-conditional density using an isotropic Gaussian distribution. Since this does not capture the horizontally elongated nature of the true class distributions, this represents a form of model mis-specification. The parameters of the model are the means and variances of the Gaussians for each class, along with the class prior probabilities. 42

61 2.1. The hybrid framework Figure 2.7: Synthetic training data shown with red and blue crosses, together with contours of probability density for each of the 2 classes. We consider a prior of the form (2.6) in which σ 2 is defined by (2.7). Here we choose p(θ, θ α) = p(θ) N( θ θ,σ 2 (α)), where p(θ) are the usual conjugate priors (a Gaussian prior for the means, a Gamma prior for the precisions, and a Dirichlet for the class priors). This results in a proper prior. The training data-set comprises 200 points per class, of which just 2 from each class are labelled, and the test set comprises 200 points per class (all of which are labelled). Experiments are run 15 times with different random initialisations: for each of the 15 runs, we randomly select 2 points per class that we label, we initialise the class-conditional means to the mean of their 2 corresponding labelled points, both variances to 1, and both class priors to 1/2. This gives θ 0, and we simply set θ 0 to θ 0. We run conjugate gradients to optimise (2.5) with various fixed values of α: {0, 0.005, 0.025, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.975, 0.985, 0.995, 1}. For each value of α, we therefore obtain a MAP estimate of θ and θ. Note that classification is then performed using θ MAP only. To avoid scaling issues due to the difference of magnitude between conditional and marginal densities, training is stopped when both p(c X,θ) and p(x θ) have converged. The results are used to draw a curve of the performance against α for each run, using aligned axes (i.e. the axes limits are the same for every plot). These curves are plotted in 43

62 CHAPTER 2. A PRINCIPLED HYBRID FRAMEWORK figure 2.8. Their only difference is the split between training set and test set (which implies a different location for the labelled points). Figure 2.8: Classification performance of HF on the toy example for every run. Plot of the percentage of correctly classified points on the test set versus α for the synthetic data problem, for every run. α = 0 corresponds to the generative model, and α = 1 to the discriminative model. The axes are aligned. The average performance is plotted in figure A.1 in section A.1 of the appendix. We see that the best generalisation occurs for values of α intermediate between the generative and discriminative extremes. We can also compute a mean and standard deviation over the test set classification, which are shown respectively by the blue curve and the red error bars in figure A.1 in section A.1 of the appendix. To gain insight into this behaviour we can plot the contours of density for each class corresponding to different values of α, as shown in figure 2.9. We see that a purely generative model is strongly influenced by modelling the density of the data and so gives a decision boundary which is orthogonal to the correct one. Conversely, a purely discriminative model attends only to the labelled data-points and so misses 44

63 2.2. Comparison with the convex combination framework Figure 2.9: Results of fitting an isotropic Gaussian model to the synthetic data for various values of α. The top left shows α = 0 (generative case) while the bottom right shows α = 1 (discriminative case). The green area corresponds to points that are assigned to the blue class, while the orange area corresponds to points that are assigned to the red class. In the generative case, the model tries to explain all the data so it stretches the Gaussians, leading to a decision boundary orthogonal to the correct one. In the discriminative case, the data is too scarce and the model misses information about the elongated nature of the densities. In the hybrid case of α = 0.7, a good solution is obtained. useful information about the horizontal elongation of the true class-conditional densities which is present in the unlabelled data. Hybrid models however, are able to make use of all the information available, and eventually reach the correct decision boundary. We have just presented a detailed study of our hybrid framework. We cannot help but notice the striking similarities it shares with the convex combination framework described in section In the next section, we will try to compare the results they produce, first visually, then using Pareto fronts. 2.2 Comparison with the convex combination framework From now on, we will refer to our hybrid framework as HF. We would like to gain some insight about the link between our framework HF and the convex combination framework 45

64 CHAPTER 2. A PRINCIPLED HYBRID FRAMEWORK explored by Bouchard et al. in [11], and detailed in section 1.4.2, that we will refer to as CC. Indeed, recall their respective objective function, this time taking the logarithm: for HF, we take the log of (2.2): L HF (θ, θ) = log p(θ, θ) + n log p(c n x n,θ) + n log p(x n θ) (2.8) for CC, we redefine (1.10): L CC (θ) = log p(θ) + α n log p(c n x n,θ) + (1 α) n log p(x n,c n θ) (2.9) Although (2.8) is more general, (2.8) and (2.9) share common grounds. The first thing we would like to understand is whether the blending of generative and discriminative approaches is different using either framework, or if they have a similar effect. For this, we will run both methods and look what happens empirically. We will use two graphical views: first of all, we will look at the resulting model like we did in section 2.1.3, and we will also build Pareto fronts Visually We run the same experiments described in section 2.1.3, but using the convex combination framework CC. θ 0 is defined as in section 2.1.3, and we use the following values of α: {0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.93, 0.97, 0.99, 0.992, 0.994, 0.996, 0.998, 0.999, , , , , , , 1}. In figure 2.10, we have plotted the performance against α for each run, on top of the (blue) curves we have obtained for HF (seen in figure 2.8), again using aligned axes (i.e. the axes limits are the same for every plot). However, due to the very different nature of α in both frameworks, we had to re-scale the CC x-axis so that we would see the interesting bits. Indeed, for CC, the interesting values of α are very close to 1, too close to see anything. The re-scaling is done as follows: we have run 25 values of α for CC,so instead of plotting the performance against the 25 values of α, we plot it against the vector [0 : 24]/24 in Matlab. This essentially allows to zoom in on what happens for α [0.9,1]. There seems to be very little difference, as both curves achieve similar peak performances. Note that we should not be deceived by the fact that the curves seem to be on top of each other, or that HF s performance seems to be higher sometimes, since the x-axis 46

65 2.2. Comparison with the convex combination framework Figure 2.10: Classification performance of HF and CC on the toy example for every run. Plot of the percentage of correctly classified points on the test set versus α for the synthetic data problem, for every run. Blue curve: HF, red curve: CC. α = 0 corresponds to the generative model, and α = 1 to the discriminative model. The axes are aligned, but the red CC x-axis had to be re-scaled (essentially a zoom on α [0.9,1]) to see anything. The average performances are plotted in figure A.2 in section A.2 of the appendix. was re-scaled for CC. It means that we cannot compare the curves point-wise, only the range of the performance. Again, the results are used to compute a mean and standard deviation over the test set classification, which are shown respectively by the blue curve and the red error bars in figure A.2 in section A.2 of the appendix. A similar classification performance does not ensure a similar solution model. In figure 2.11, we have plotted the contours of density for each class corresponding to different values of α, as done in figure 2.9, but this time for the CC framework. For this particular run, there seems to be no major difference in the route that both frameworks take. Indeed, in the generative case, the model tries to explain all the data so it stretches the Gaussians, leading to a decision boundary orthogonal to the correct one. In the 47

66 CHAPTER 2. A PRINCIPLED HYBRID FRAMEWORK Figure 2.11: CC - run presenting similarities. Results of fitting an isotropic Gaussian model to the synthetic data for various values of α. The top left shows α = 0 (generative case) while the bottom right shows α = 1 (discriminative case). The green area corresponds to points that are assigned to the blue class, while the orange area corresponds to points that are assigned to the red class. The same scenario as in figure 2.9 happens: the generative model leads to a decision boundary orthogonal to the correct one, the discriminative model misses information about the elongated nature of the densities, and in the hybrid case of α = , a good solution is obtained. discriminative case, the data is too scarce and the model misses information about the elongated nature of the densities. In the hybrid case of α = , a good solution is obtained. The only difference is that the variance of the red class tends to be higher so that the decision boundary is much more curvy than in figure 2.9, indicating a different local minimum, but this phenomenon disappears as we become more discriminative. Figures 2.12 and 2.13 show another run, for which the results are different for the HF and CC frameworks. Again, for both methods, in the generative case, the model tries to explain all the data so it stretches the Gaussians, leading to a decision boundary orthogonal to the correct one. In the discriminative case, the data is too scarce and the model misses information about the elongated nature of the densities. Note however, that the best boundary achieved by HF in figure 2.12 is still not ideal. In most cases, the blue density has a higher variance and rounds 48

67 2.2. Comparison with the convex combination framework Figure 2.12: HF - run presenting differences. Results of fitting an isotropic Gaussian model to the synthetic data for various values of α. The top left shows α = 0 (generative case) while the bottom right shows α = 1 (discriminative case). The green area corresponds to points that are assigned to the blue class, while the orange area corresponds to points that are assigned to the red class. In the generative case, the model tries to explain all the data so it stretches the Gaussians, leading to a decision boundary orthogonal to the correct one. In the discriminative case, the data is too scarce and the model misses information about the elongated nature of the densities. In the hybrid case of α = 0.5, a more decent solution is obtained. the boundary too much. Nevertheless, we can see in this case that HF does a better job at predicting than CC. Indeed, instead of moving the centres around like HF, CC increases the variance of the blue density. However, nothing particular suggests why CC should fail in that case. This may point to the difficulty of local minima. Here we should stress the very different roles that α plays in each framework, making it very awkward to compare them. However, we can emphasise the asymmetry of the roles by noting the different order of magnitude of α in each case. The values in figures 2.11 and 2.13 are extremely close to 1. Indeed, in CC, α measures the importance of the discriminative labelled data component in L CC (θ) (2.9). Because the ratio is very small, the discriminative total data component is very weak compared to the generative one, therefore a very large α is needed to balance both components out. In HF however, α measures the degree of flexibility of θ 49

68 CHAPTER 2. A PRINCIPLED HYBRID FRAMEWORK Figure 2.13: CC - run presenting differences. Results of fitting an isotropic Gaussian model to the synthetic data for various values of α. The top left shows α = 0 (generative case) while the bottom right shows α = 1 (discriminative case). The green area corresponds to points that are assigned to the blue class, while the orange area corresponds to points that are assigned to the red class. The same scenario as in figure 2.12 happens: the generative model leads to a decision boundary orthogonal to the correct one, the discriminative model misses information about the elongated nature of the densities. However, no intermediate decision boundary looks correct. They are slightly better than the generative and discriminative extremes but fail to get close to the true boundary. with respect to θ, i.e. how much it can differ by. Intuitively, it should also dependent on the labelled data ratio total data. Indeed, the smaller the ratio is, the less impact θ has on L HF(θ, θ) (2.8), and the more freedom it needs to differ from θ. But this dependency is much more subtle since even little flexibility in the prior gives access to many more models Pareto fronts Pareto fronts are a well known object in multi-criteria (or multi-objective) optimisation of the kind seen in section A brief introduction is given in [34], but we will try to give a sufficient explanation of the principle here. Let {f 1,f 2,...,f M } be a set of M criteria (objectives) to minimise with respect to a variable x lying on the input space I we work on. Pareto-optimal points x are defined so that there are no points x in their neighbourhood 50

69 2.2. Comparison with the convex combination framework V (x ) that minimise all the criteria at once, i.e. such that V (x ), x I V (x ), [ m, f m ( x) < f m (x )] [ p, f p ( x) > f p (x )] (2.10) In other words, x is Pareto-optimal means that there exists no point x in V (x ) such that x makes: at least 1 criterion strictly better ( m, f m ( x) < f m (x )), and no p criterion strictly worse ( ( p, f p ( x) > f p (x ))). The images [f 1 (x ),f 2 (x ),...,f M (x )] of Pareto-optimal points x lie on what is called the Pareto front / surface whose shape determines the nature of the trade-off between the different criteria. As a consequence of (2.10), Pareto plots have one particularity, shown in figure 2.14 for the 2-dimensional case: if the goal is to minimise a multi-criteria objective function, for any point on the front, there is no other point on the bottom-left quadrant (shown in grey), i.e. one criterion cannot be minimised without degrading the other. If the goal is maximisation, the same is true using the top-right quadrant. Figure 2.14: Visualisation of Pareto-optimal points. The image was taken from [33]. The preferred solution is plotted in yellow/red but it is not attainable. The different attainable trade-offs are plotted in light blue and form the blue Pareto front. Note that, for any point on the front, there is no other point in its bottom-left quadrant (grey area).the grey points show non Pareto-optimal solutions. 51

70 CHAPTER 2. A PRINCIPLED HYBRID FRAMEWORK One standard way to solve a multi-objective optimisation problem is to define a global objective function as the convex combination of the different criteria [58], as seen in section Of course, different weighings give different Pareto-optimal points. It is then a matter of drawing a parametric curve: each axis represents 1 criterion, and the parameter is a vector containing a normalised set of weights. Each set of weights gives a different x, and therefore a different point in the space defined by the criteria. Pareto fronts are relatively easy to build this way and provide a useful tool to understand how much changing the weight of each criterion affects the optimisation of this particular criterion. Indeed, they show how fast a criterion is degraded as another one is improved. In our case, we only have two criteria: the logarithms of (1.2) and (1.5). We will rewrite them as: and f 1 (θ) = log p(θ) + n f 2 (θ) = log p(θ) + n log p(c n x n,θ) (2.11) log p(x n,c n θ) (2.12) For the CC framework, it is straightforward as it is exactly the standard way of finding a Pareto-optimal point for the criteria {f 1,f 2 }. It is slightly more difficult for our HF framework, as there is no guarantee it will find a Pareto-optimal point. Indeed, θ is mainly concerned with f 1 (2.11), and is only subject to f 2 (2.12) through the prior between θ and θ. Figure 2.15 shows the parametric curves for both HF and CC, for the run where both methods achieve similar pathways and results. As could be expected, the curves follow a similar path. For CC, we see a neat curve that starts from the top-left corner, decays rapidly to the bottom-left corner, and goes down to the bottom-right corner. We know that this curve is an actual Pareto front as all points on it are Pareto-optimal by construction, at least locally (probably not globally since (2.9) is not convex). What is interesting now, is to see where the HF curve lies. It seems to overlap at both generative and discriminative ends but in the middle it is consistently above-right compared to the CC curve. This means three things: 52

71 2.2. Comparison with the convex combination framework Figure 2.15: Pareto fronts for both the HF and CC frameworks for the run where HF and CC give similar results. CC achieves slightly better trade-offs. the HF curve itself keeps the bottom-left quadrant property so it would seem that we do indeed trade-off optimally, if we take the two curves together, we are violating the bottom-left quadrant property for HF, so we have only found a local sub-optimum that is not as good as the real optimum (in fact, it is not even as good as the local optimum found by CC), the two methods are not equivalent for our particular prior p(θ, θ). This does not mean that they cannot be very similar. Indeed most of the 15 runs present similar paths for HF and CC. It is also interesting to note that the best result is not achieved for the best trade-off. Indeed it does not correspond to the closest point to the origin (although this statement could be slightly controversial for the HF curve). This means that we are not necessarily interested in minimising both criteria equally. Figure 2.16 shows the parametric curves for both HF and CC, for the run where both methods achieve similar pathways and results. 53

72 CHAPTER 2. A PRINCIPLED HYBRID FRAMEWORK Figure 2.16: Pareto fronts for both the HF and CC frameworks for the run where HF and CC give different results. Although it performs worse, CC achieves better trade-offs. Here the curves differ more, but interestingly still follow a similar path. In this particular case, it is worth noting that, although CC achieves a much better trade-off, it actually gives worse performance results. We have tried to compare our framework HF with the convex combination framework CC. It is not straightforward due to the different interpretation of α. However, on the toy data-set, we have found little variation in the classification performance and in the MAP estimates. A major difference between these two approaches however, is that HF can be made Bayesian quite naturally. This will be the object of the following section. 2.3 The Bayesian version A principled approach to combining generative and discriminative models not only gives a more satisfying foundation for the development of new models, but it also brings practical benefits. In particular, the parameter α which governs the trade-off between generative and discriminative is now a hyper-parameter within a well defined probabilistic model which is trained using the (unique) correct joint distribution. In a Bayesian setting, the value of 54

73 2.3. The Bayesian version this hyper-parameter can therefore be optimised by maximising the marginal likelihood in which the model parameters have been integrated out, thereby allowing the optimal tradeoff between generative and discriminative limits to be determined entirely from the training data, without having to span through all the possible values of α. We will present three different possible methods that would allow us to learn our model in one shot only. In this section, we will sometimes talk about the non-bayesian version. This will refer to the case when we have no prior over α, we obtain a MAP estimate of θ and θ for a fixed value of α, and we obtain the curves by testing different possible values of α, as was done in sections and 2.2. We would also like to note that α was introduced artificially to make a parallel with the convex combination framework. Here it is much more useful to consider directly the quantity σ 2 as described in (2.7), or even better the quantity σ 2, i.e. the precision of the prior over θ and θ. If needed, α can easily be retrieved from σ 2 using (2.7), which gives 1 α = 1 + σ 2/r. Of course σ 2 R Exact inference We would like to find the true posterior distribution of θ given {X,C}. Using Bayes formula and the usual probability rules for parametric models, we have p(θ X,C) = p(x,c,θ, θ,σ 2 )d θ dσ 2 [ p(x,c,θ, θ,σ 2 )d θ dσ 2 ] dθ Unfortunately, this equation contains several integrals we do not know how to solve exactly. Similarly, if we want to find the true posterior distribution of σ 2 given {X,C}, we can write p(σ 2 X,C) = p(x,c,θ, θ,σ 2 )dθ d θ [ p(x,c,θ, θ,σ 2 )dθ d θ] d(σ 2 ) However, again, we can not solve this equation exactly. Therefore, we have little choice but to give up exact inference and seek a good approximation technique. 55

74 CHAPTER 2. A PRINCIPLED HYBRID FRAMEWORK The true MAP Another possible idea is to integrate σ 2 out. First let us compute the true joint prior of θ and θ: p(θ, θ) = p(θ, θ,σ 2 )dσ 2 = p(θ) p( θ θ,σ 2 ) p(σ 2 )dσ 2 = p(θ) p( θ θ,σ 2 ) p(σ 2 )dσ 2 We know that p( θ θ,σ 2 ) is a Gaussian N( θ θ,σ 2 ) = N( θ θ 0,σ 2 ), so we choose p(σ 2 ) to be a Gamma distribution G(σ 2 a,b). Using standard results, we can say that p( θ θ) = S( θ θ 0,ab 1,2a), where S refers to a student s t-distribution. Of course, the result would not be so neat if we had not chosen a Gamma prior for σ 2. However, we could always pick any different distribution, the integral would no longer be tractable but we could always use Laplace approximation (see section 2.3.3) on log ( σ 2) to estimate p(θ, θ). We would like to deal with one degree of freedom only, typically a. We could set the mean of our prior on σ 2 to 1 by enforcing ab 1 = 1 b = a. This means that p( θ θ) = S( θ θ 0,1,2a). Note that σ 2 = 1 refers to the case when p( θ θ,σ 2 ) = N 0, where N 0 is the canonical normal distribution of mean 0 and variance 1. Incidentally, it is also the non-bayesian case where α = 0.5. Alternatively, we could simply set the scale value (2b) 1 to 1 (i.e. b = 1/2), which leads to a chi-square distribution for σ 2. This means that p( θ θ) = S( θ θ 0,2a,2a). Figure 2.17 shows the resulting prior distribution over θ for various values of a, and for the two different cases. The left plot corresponds to b = a, and the right plot to b = 1/2. In each case, the dashed black curve shows the prior over θ before integration p( θ θ,σ 2 ) = N 0 with σ 2 = 1, while the coloured curves show the prior over θ after integration p( θ θ) for various values of a. In figure 2.17a, note that, although they keep a peaked look around 0, all the resulting distributions are flatter than N 0, and they tend to N 0 as a increases. This means that we will only span hybrid models from a N 0 prior (non-bayesian case where α = 0.5) to an infinitely flat prior (discriminative case). Looking at figure 2.17b, setting b to 1/2 and varying a for 56

75 2.3. The Bayesian version Figure 2.17: Integrating over σ 2. (a) b = a. The prior over σ 2 has its mean at 1. All the resulting student s t-distribution (in colour) are flatter than the canonical normal distribution (in black) and they tend to it as a increases. (b) b = 1/2. A much wider range of student s t-distribution is accessible. p(σ 2 ) allows different types of distributions being explored, from very flat to very peaked. Of course this would be true for every fixed value of b, but we only need to study one case. Now we would like to compute the MAP values for θ and θ. This is done by simply maximising the joint likelihood q(x,c,θ, θ): [θ, θ] MAP = arg max p(x,c θ, θ) p(θ, θ) θ, θ e If the posterior distribution of θ and θ is needed, we can use p(θ, θ X,C) = p(x,c θ, θ) p(θ, θ) p(x, C) [ ] and the quadratic Taylor expansion of log p(x,c θ, θ)p(θ, θ) at [θ, θ] MAP, so as to obtain a Gaussian approximation of p(θ, θ X,C), with mean [θ, θ] MAP and with precision H, where H ij = [ 2 log [p(x,c t)p(t)]] and t = [θ, t i t θ]. For more information about this j t MAP approximation, please refer to section To make predictions however, we will simply use p(ĉ x,x,c) p(ĉ x,θ MAP ). Figure 2.18 shows the same run as figure 2.9, but this time using the MAP technique we just described, varying a instead of α, and keeping b = a. 57

76 CHAPTER 2. A PRINCIPLED HYBRID FRAMEWORK Figure 2.18: MAP approximation - b = a - run presented in figure 2.9. Results of fitting an isotropic Gaussian model to the synthetic data for various values of a. The green area corresponds to points that are assigned to the blue class, while the orange area corresponds to points that are assigned to the red class. We start at the non-bayesian solution given by α = 0.5 and stop at the discriminative end. Note that for intermediate values of a, we capture the correct decision boundary. Remember that setting b to a only allows to go from the non-bayesian case where α = 0.5 to the discriminative case. Indeed, we can see on the plots that we start with the solution given by α = 0.5 in the non-bayesian version of figure 2.9 for very high values of a, and end with the discriminative case for a 0. Nonetheless, here we were fortunate enough to find the best decision boundary since the hybrid model leading to it, as we could see in the non-bayesian version, lives between the α = 0.5 case and the discriminative case. However, this is not satisfying, as we do not want to risk missing the best model. Figure 2.19 shows the same run as figure 2.18, but this time varying a and setting b to 1/2. Here we do come across very flat and very peaked priors, which leads to a very similar plot as the one we had in figure 2.9, where we start at the generative end and finish at the discriminative end, finding the best decision boundary in the process. It looks like we have changed the problem of varying α into the problem of varying a, which was to be expected. This is acceptable though, as varying a gives more stable results. Indeed, let us look at figure The blue curves show the classificaton performance for each of the 15 runs using a Gamma prior with b = a (top) or with b = 1/2 (bottom), on aligned axes (i.e. the axes limits are the same for every plot). The dashed green thresholds represent the best performance achieved with the non-bayesian version, i.e. by scanning through all the possible values of α. First of all, notice how both cases perform almost equally on the [-2.5 7] portion. The 58

77 2.3. The Bayesian version Figure 2.19: MAP approximation - b = 1/2 - run presented in figure 2.9. Results of fitting an isotropic Gaussian model to the synthetic data for various values of a. The top shows high values of a (very peaked priors) while the bottom shows low values of a (flat priors). The green area corresponds to points that are assigned to the blue class, while the orange area corresponds to points that are assigned to the red class. For very high values of a, the prior is highly peaked and we find the generative case again, with an orthogonal boundary. For very low values of a, the prior is almost flat, leading to a discriminative model that does not benefit from the unlabelled data. We capture the correct decision boundary for intermediate values of a. differences at the beginning make sense since the two priors start at different regions of α (0.5 for the first prior, 0 for the second). This means that it makes little difference which prior we use, as long as it covers all the possible values for α, such as our Gamma prior with b = 1/2. Additionally, it can also be noted that most of the curves reach a peak similar to the one of their corresponding non-bayesian version. If not, they tend to perform much better (the Bayesian version performs significantly worse in 1 run only). Moreover, looking closely, we can see that the plots in figure 2.20b are very similar to the ones in figure 2.8, only shifted and re-scaled. Finally, most curves now peak at very similar values of log(2a) (around -2.5), which may (and does) make the average plot much more peaked. The average performance and the standard deviations for both kind of priors are plotted in figurea.3 in section A.3 of the appendix. 59

78 CHAPTER 2. A PRINCIPLED HYBRID FRAMEWORK Figure 2.20: Classification performance for the MAP approximation using a Gamma prior. (a) with b = a. (b) with b = 1/2. Plots of the percentage of correctly classified points on the test set versus log(2a) for the MAP method. Each curve corresponds to a different run, i.e. to a different subset of labelled training points. The dashed green thresholds represent the best performance achieved by the non-bayesian version. The axes are aligned. The average curves are plotted in figures A.3 in section A.3 of the appendix. 60

79 2.3. The Bayesian version The MAP approximation method seems very promising as it works very well on the toy data-set. However, MacKay [59] demonstrates that this approach can fail when the problem is ill-posed, i.e. when some of the measurements are of poor quality. Indeed, although the MAP method locates the true posterior maximum [θ, θ] MAP, it may fail to capture most of the probability mass. This is especially true in high dimensions where non-desirable peaks of the distribution are so high that they dominate the optimisation, instead of the lesser modes that contain most of the mass. Instead, MacKay recommends the use of what he refers to as the evidence framework, that we will call successive Laplace approximations Successive Laplace approximations An alternative approach is to use several times what is called a Laplace approximation [7]. The core idea of Laplace method is to approximate a distribution p that we do not know how to normalise by a Gaussian distribution centred on one of p s modes x with precision H, where H is the Hessian matrix of ( log p) at x. Indeed, simply taking a quadratic Taylor expansion of log p at x, we have log p(x) log p( x) + (x x) T x log p(x) x } {{ } (x x)t xx log p(x) x (x x) } {{ } H Removing the term with the null gradient and taking the exponential, we obtain ( p(x) p( x)exp 1 ) 2 (x (2π) x)t H(x x) = p( x) d H N(x x,h 1 ) (2.13) where d is the dimension of x. This allows us to approximate the normalisation factor with p(x)dx p( x) (2π) d H (2.14) We will use equations (2.13) and (2.14) very often in this section. Note that x is a minimum of log p, so H is definite positive which means that it is suitable as a precision matrix. In this section, it will be more convenient to change notation again and to refer to ξ, where ξ = log (σ 2 ), so that ξ R. 61

80 CHAPTER 2. A PRINCIPLED HYBRID FRAMEWORK First step For our problem, we first need to infer the posterior distribution of [θ, θ] for a particular value of ξ: p(θ, θ X,C,ξ) = p(x,c,θ, θ ξ) p(x, C ξ) = p(x,c θ, θ) p(θ, θ ξ) p(x, C ξ) The issue here is to compute the first tricky integral p(x,c ξ) = p(x,c,θ, θ ξ)dθ d θ = p(c X,θ) p(x θ) p(θ, θ ξ)dθ d θ which is exactly what Laplace approximation does. Therefore, we follow [7] and use a [ quadratic Taylor expansion of log p(c X,θ)p(X θ)p(θ, θ ξ) ] around one of its maxima [θ, θ] MAP. We obtain: a Gaussian approximation of p(θ, θ X,C,ξ): p(θ, θ X,C,ξ) N([θ, θ] [θ, θ] MAP,H(ξ) 1 ) an approximation of the normalising factor: p(x,c ξ) p(c X,θ MAP ) p(x θ MAP ) p(θ MAP, θ (2π) MAP ξ) d H(ξ) [ where H(ξ) is the Hessian matrix of log p(c X,θ)p(X θ)p(θ, θ ξ) ] words, if t = [θ, θ], then H ij (ξ) = [ 2 log [p(x,c t)p(t ξ)]] t i t j t MAP. at [θ, θ] MAP. In other Second step Additionally, we need to infer ξ s posterior distribution: p(ξ X,C) = p(x,c,ξ) p(x, C) = p(x,c ξ) p(ξ) p(x, C) Again, following [7] and using a quadratic Taylor expansion of log [p(x, C ξ)p(ξ)] at one of its maxima ξ MAP, we finally obtain the Gaussian approximation p(ξ X,C) N(ξ ξ MAP,Λ 1 ) 62

81 2.3. The Bayesian version where Λ = [ 2 ξ 2 log [p(x,c ξ)p(ξ)]] ξ MAP. The only worrying thing here is that it is not clear how well defined the optimisation of log [p(x, C ξ)p(ξ)] with respect to ξ is. Indeed H(ξ) is definite positive for the initial value of ξ (the one we computed [θ, θ] MAP with), but it does not necessarily stay positive as we vary ξ during the optimisation. Now we repeat the process (step 1 + step 2), using newly found values of ξ to start step 1, for as long as ξ or [θ, θ] change. Results For our experiments, we choose a Gaussian prior N(ξ 0,w 1 ) over ξ, centred on 0 (corresponding to σ 2 = 1). For very high values of w, ξ will be forced to be 0, which corresponds to the non-bayesian case where α = 0.5. However, as soon as we give flexibility to ξ by decreasing w, θ tries to separate from θ, forcing ξ to decrease to join the discriminative end. Therefore, we can only span from ξ = to ξ = 0. This is not new, we saw the same problem appear in section when we used a Gamma prior over σ 2 and we forced it to be centred on 1. As a result, we will also study a Gamma prior G(exp(ξ) a,1/2) on exp(ξ) = σ 2, where we have set b to 1/2 (i.e. the scale (2b) 1 to 1). This is the second prior we studied in the previous section. Figure 2.21 shows the same run as figure 2.9, but this time varying w in N(ξ 0,w 1 ). Remember that using a Gaussian prior centred on 0 only allows to go from the non-bayesian case where α = 0.5 to the discriminative case. Indeed, we can see on the plots that we start with the solution given by α = 0.5 in the non-bayesian version for very high values of w and end with the discriminative case of figure 2.9 for w 0. Again, this is not satisfying, as we do not want to risk missing the best model. Note that the posterior distribution is always more peaked than the prior distribution, but the difference between the two decreases significantly, as the data looses its influence. 63

82 CHAPTER 2. A PRINCIPLED HYBRID FRAMEWORK Figure 2.21: Laplace approximation - Gaussian prior - run presented in figure 2.9. The top shows high values of w (peaked priors), while the bottom shows low values of w (flat priors). Left column: results of fitting an isotropic Gaussian model to the synthetic data for various values of w. The green area corresponds to points that are assigned to the blue class, while the orange area corresponds to points that are assigned to the red class. Right column: the prior distribution of ξ is shown in blue, the approximative posterior distribution of ξ in red. We start at the non-bayesian solution given by α = 0.5 and stop at the discriminative end. Note that for intermediate values of ν, we capture the correct decision boundary. 64

83 2.3. The Bayesian version Figure 2.22: Laplace approximation - Gamma prior - run presented in figure 2.9. The top shows high values of a (peaked priors), while the bottom shows low values of a (flat priors). Left column: results of fitting an isotropic Gaussian model to the synthetic data for various values of a. The green area corresponds to points that are assigned to the blue class, while the orange area corresponds to points that are assigned to the red class. Right column: the prior distribution of ξ is shown in blue, the approximative posterior distribution of ξ in red. For very high values of a, the prior is highly peaked and we find the generative case again, with an orthogonal boundary. For very low values of a, the prior is almost flat therefore θ is trained discriminatively and does not benefit from the unlabelled data. We capture the correct decision boundary for intermediate values of a. 65

84 CHAPTER 2. A PRINCIPLED HYBRID FRAMEWORK Figure 2.22 shows the same run as figure 2.9, but this time varying a in G(exp(ξ) a,1/2). Here we do span all the possible values for ξ, which leads to a very similar plot as the one we had in figure 2.9, where we start at the generative end and finish at the discriminative end, finding the best model in the process. Again, note that the difference between the posterior distribution and the prior distribution decreases significantly, until the prior distribution becomes more peaked. Figure 2.23 shows the classification performance after training with the Laplace approximation using a Gaussian prior (top) or a Gamma prior (bottom). The blue curves show the classificaton performance for each of the 15 runs, again using aligned axes (i.e. the axes limits are the same for every plot). The dashed green thresholds represent the best performance achieved with the non-bayesian version, i.e. by scanning through the possible values of α. Looking at figures 2.20 and 2.23, we can see that, for the toy example, the approximation technique has little importance since both the true MAP and the successive Laplace approximations give almost the exact same plots. This suggests that our problem is well-posed, with no singularities. It may be different in real-size problems. It follows that we can repeat what has been said for figure 2.20: it makes little difference which prior we use, as long as it covers all the possible values for α, such as our Gamma prior with b = 1/2. most of the curves reach a peak that is very similar to the one of their corresponding non-bayesian version. If not, they tend to perform much better (the Bayesian version performs significantly worse in 1 run only). most curves now peak at very similar values of log(2a) (around -2.5), which may make the average plot much more peaked. The average performance and the standard deviations for both kind of priors are plotted in figurea.4 in section A.4 of the appendix. 66

85 2.3. The Bayesian version Figure 2.23: Classification performance for the Laplace approximation using (a) a Gaussian prior, (b) a Gamma prior with b = 1/2. Plots of the percentage of correctly classified points on the test set (a) versus log(w), (b) versus log(2a), for the Laplace method. Each curve corresponds to a different run, i.e. to a different subset of labelled training points. The dashed green thresholds represent the best performance achieved by the non-bayesian version. The axes are aligned. The average curves are plotted in figures A.4 in section A.4 of the appendix. 67

86 CHAPTER 2. A PRINCIPLED HYBRID FRAMEWORK 2.4 Conclusion In this chapter, we have studied a new framework and shown that, adopting its point of view, the discriminative training of generative models can be re-cast in terms of standard training applied to a modified model we qualify as being discriminative. This new viewpoint opens the door to a wide range of new models which interpolate smoothly between generative and discriminative approaches and which can benefit from the advantages of both. We have given insights on how the framework allows for the use of unlabelled data as well as the discriminant information, and we have compared it to the convex combination framework. The main drawback of this framework is that the number of parameters in the model is doubled leading to greater computational cost, and less robust optimisation. Unfortunately, this is especially true for intermediate values of α, which are the ones we are interested in. It is not entirely clear what happens with conjugate gradients in high-dimensional spaces, and we are strongly subjected to local minima. The next chapter will focus on applying this model to semi-supervised object recognition in static images. We will try to understand if the amount of labelled data affects the nature of the best model, and we will close the chapter with much more thorough conclusions on this framework. 68

87 CHAPTER 3 APPLICATION TO SEMI-SUPERVISED LEARNING In the previous chapter, we have presented a framework that allows the blending of generative and discriminative models, and we have discussed the advantages of such an approach in the context of semi-supervised learning. We now apply our method to a realistic application involving object recognition in static images. The value of such a study is to understand how the amount of labelling affects the value of α that gives the best classification performance. Object recognition aims at classifying images, depending on which kind of objects they contain, in pre-learnt categories. Generative models are designed to explain how an image x is generated. In practice though, such models are applied to a set of feature vectors {x j } (like patches for instance) extracted from the image rather than to the image x itself, as it is much easier to explain patches than entire images. It also makes more sense since patches are shared across images, whereas images will never be repeated. In the vast majority of papers, the authors use the assumption that the patches are independent and identically distributed (i.i.d.) given the image class, which leads to the extremely useful property that p(x c,θ) = p(x j c,θ) (3.1) j patches where p(x j c,θ) will be the generative model. Equation (3.1) allows most inference to be tractable. Its correctness is arguable, but it has been used successfully by the whole community. 69

88 CHAPTER 3. APPLICATION TO SEMI-SUPERVISED LEARNING Let us assume we have our usual training set {X,C} = {x n,c n }, where X are the images, and C their labels. We know that image x n belongs to the category c n, however that does not mean that all its patches {x nj } do (indeed some patches could belong to the background), so we augment the data-set with T = {τ nj }, where τ nj is the label of patch x nj. Therefore we have to add an extra axis of supervision that does not depend on the proportion of labelled data, but on the quality of the label: an image is fully labelled (and the problem fully supervised) if we know all the {τ nj }. This means that we have to segment all the training images, in order to know the class of each training patch. This operation is very tedious and time-consuming. an intermediate step is to give the global image label {c n } only. The image is said to be weakly labelled and the problem to be weakly supervised. the image is unlabelled and the problem unsupervised if no c n,τ nj are known. Note that semi-supervised training can now be achieved using any sort of combination of fully weakly and non labelled images. We will first describe the data-sets (section 3.1), the features (section 3.2), the underlying generative model (section 3.3), and the learning process (section 3.4). However, the core of this chapter is section 3.5 where the experiments will be described and the results reported. We will study the influence of the amount of labels, firstly on the performance of generative and discriminative models, secondly on the choice of model type: generative, discriminative or hybrid, and thirdly on the performance of HF versus CC. Note that, as far as we know, this is the first time that CC is used for semi-supervised learning. Finally, section 3.6 will close this chapter with a detailed discussion on the limitations of this model and on possible extensions. 3.1 The data-sets For our experiments, we will use two different data-sets. The first one will be referred to as CSB and will contain cows, sheep and bikes. The cows and sheep images come from Microsoft Research [38], while the bikes images were downloaded from the Technical University of Graz [36]. Together these images exhibit a wide variety of poses, colours, and illumination, as illustrated by the sample images shown in figure 3.1. However, it should 70

89 3.1. The data-sets Figure 3.1: Sample images from the CSB data-set after re-scaling. Taken from [38; 36]. not be too hard to discriminate between these 3 classes. All the images were re-scaled to , and raw patches of size were extracted on a regular grid of size (i.e. every 24 th pixel). The second data-set will contain cheetahs, lions and tigers. All the images were downloaded from the web, and segmented manually. They are all fairly similar in poses, colour and background, as illustrated by the sample images shown in figure 3.2. This data-set Figure 3.2: Sample images from the wildcats data-set after re-scaling. 71

90 CHAPTER 3. APPLICATION TO SEMI-SUPERVISED LEARNING should be much harder. All images were re-scaled to , and raw patches of size were extracted on a regular grid of size (i.e. every 21 st pixel). Each image contains one or several objects from a particular class, and the goal is to build a true multi-class classifier in which each image is assigned to one of the classes (rather than simply classifying each class separately versus the rest, which would be a much simpler problem). It should be stressed here that we are using overlapping patches, which is not entirely correct: from a generative point of view: we are now generating the image 4 times (indeed each quarter patch is extracted 4 times except for the border patches). However, this allows us to be more robust to where we start extracting patches. from the independence assumption point of view seen in (3.1): our patches are not independent since they share common parts. However, non-overlapping patches violate this assumption too since adjacent patches are not independent. This problem is not new, and Williams [79] suggests the use of a mixture of experts model to give a coherent treatment to the overlapping patches. However, here we will keep things simple for the sake of illustration and because it is fairly common practice in vision to consider overlapping patches for robustness. Finally, a patch is considered a part of the object if the ratio number of pixels in the patch segmented as part of the object total number of pixels in the patch exceeds a certain threshold, otherwise it is labelled as background. Our data-sets containing 1 class only per image, we do not need to worry about patches containing competing classes. In our experiments, the threshold is set to The features Our features are taken from [82], in which the original RGB images are first converted into the International CIE (International Commission on Illumination) LAB colour space [47]. 72

91 3.3. The underlying generative model Each training image is then convolved with 17 filters, and the set of corresponding pixels from each of the filtered images represents a 17-dimensional vector. The filters are quite standard: the first 3 are obtained by scaling a Gaussian filter, and are applied to each channel of the colour image, which gives 3 3 = 9 response images. Then a Laplacian filter is applied to the L channel, at 4 different scales, which gives 4 more response images. Finally 2 DoG (difference of Gaussians) filters (one along each direction) are applied to the L channel, at 2 different scales, giving another 4 responses. From these response images, we extract every 17-dimensional pixel on a 2 2 grid. This gives many more pixels than we can process so we randomly pick 10% of them. We apply K-means to obtain K 17-dimensional textons. Now each patch will be represented by a histogram of these textons, i.e. by a K-dimensional vector containing the proportion of each texton. Note that the texton features are found without using any label. Here K = 100. Since this large value of K is computationally costly in later stages of processing, PCA (probabilistic component analysis) is used to give a 15-dimensional feature vector. 15 is still too large a dimension, so we further reduce it by applying LDA (linear discriminant analysis). Since there are only 3 classes in each data-set, we obtain a 2-dimensional feature vector per patch. This additional reduction is not ideal as we now use labels to create our features, and when labels are rare, it may not do anything sensible. However, we need to reduce the dimension as much as possible to reduce the training time, and to help the conjugate gradients algorithm. For a more intuitive description, we will refer to our feature vectors as patches, even though they are in fact processed patches. 3.3 The underlying generative model We consider the generative model introduced in [77], which we now briefly describe. As usual, each image is represented by a vector x n, where n = 1,...,N, and N is the total number of images. Each vector comprises a set of J feature vectors x = {x nj } (patches) where j = 1,...,J. We assume that each patch belongs to one and only one of the classes, or to a 73

92 CHAPTER 3. APPLICATION TO SEMI-SUPERVISED LEARNING separate background class, so that it can be characterised by a binary vector τ nj coded so that all elements of τ nj are zero except the element corresponding to the class. We use c n to denote the image label vector for image n with independent components c nk {0,1} in which k = 1,... C labels the class. The overall joint distribution for the model can be represented as a directed graph, as shown in figure 3.3. We can therefore characterise the model completely in terms of the c n nj x nj J N Figure 3.3: The generative model for object recognition expressed as a directed acyclic graph, for unlabelled images, in which the boxes denote plates (i.e. independent replicated copies). Only the patches {x nj } are observed, corresponding to the shaded node. The image class labels c n and patch class labels τ nj are latent variables. conditional probabilities p(c), p(τ c) and p(x τ). This model is most easily explained generatively, that is, we describe the procedure for generating a set of observed patches from the model. First we choose the overall label of the image according to some prior probability parameters ψ label where the subscript is over all the possible labels present in the training images, and ψ label [0,1], with label ψ label = 1, so that p(c ψ) = label ψ label c=label It is important to stress here that the background is always switched on in the image label, so that c has always at least 2 entries set to 1: the appropriate class(es) and the background. 74

93 3.3. The underlying generative model Given the overall label for the image, each patch is then drawn from either one of the foreground classes or the background (k = C + 1) class. The probability of generating a patch from a particular class is governed by a set of parameters π k, one for each class, such that π k 0, constrained by the subset of classes actually present in the image. Thus p(τ j c,π) = C+1 k=1 ( c k π k C+1 l=1 c lπ l ) τjk = k c ( ) τjk πk l c π l where k c means here that c k = 1. Note that there is an overall undetermined scale to these parameters, which may be removed by fixing one of them, e.g. π C+1 = 1. For each class, the distribution of the patch x j is governed by a separate mixture of Gaussians which we denote p(x j τ j,λ) = C+1 k=1 Φ k (x j,λ k ) τ jk where λ k denotes the set of parameters (means, covariances and mixing coefficients) associated with the mixture Φ k. Finally, if we assume N independent images, and for image n we have J patches drawn independently, then the joint distribution of all random variables is p(x,c,t θ) = N p(c n ψ) n=1 J p(x nj τ nj,λ) p(τ nj c n,π) j=1 with θ = {ψ,π,λ}. Here we are assuming that each image has the same number J of patches, though this restriction is easily relaxed if required. The graph shown in figure 3.3 corresponds to unlabelled images in which only the patches {x nj } are observed, with both the image category and the classes of each of the patches being latent variables. It is also possible to consider images which are weakly labelled, that is each image is labelled according to the category of object present in the image. This corresponds to the graphical model of figure 3.4 in which the node c n is shaded. Of course, for a given size of data-set, better performance is expected if all of the images are strongly labelled, that is segmented images in which the region occupied by the object or objects is known so that the patch labels τ nj become observed variables. The graphical model for a set of strongly labelled images is also shown in figure

94 CHAPTER 3. APPLICATION TO SEMI-SUPERVISED LEARNING c n c n nj nj x nj J N x nj J N Figure 3.4: Other graphical models corresponding to figure 3.3 for weakly labelled images (left) and strongly labelled images (right). Strong labelling requires hand segmentation of images, and so is a time consuming and expensive process as compared with collection of the images themselves. For a given level of effort it will always be possible to collect many unlabelled or weakly labelled images for the same cost as a single strongly labelled image. Since the variability of natural images and objects is so vast, we will always be operating in a regime in which the size of our data-sets is statistically small (though they will often be computationally large). For this reason there is great interest in augmenting expensive strongly labelled images with lots of cheap weakly labelled or unlabelled images in order to better characterise the different forms of variability. 3.4 The learning method Although the two-stage hierarchical model shown in figure 3.3 appears to be more complicated than in the simple toy example described in chapter 2, it does in fact fall within the same framework. If we let θ = {ψ,π,λ} denote the full set of parameters in the model, then we can consider a model of the form (2.5) in which the prior is given by (2.6) with σ 2 defined by (2.7), and the terms f(θ) and g( θ) taken to be constant. We use conjugate gradients to optimise the parameters. Conjugate gradients is the most widely used technique when it comes to blending generative and discriminative models, thanks to its flexibility. Indeed, because of the discriminative component p(c n x n,θ) which 76

95 3.4. The learning method contains a normalising factor, an algorithm such as EM would require much more work, as nothing is directly tractable anymore. Before doing any mathematics, we need to distinguish between three different sorts of data: the set of fully labelled data will be denoted F = {X F,C F,T F }, the set of weakly labelled data W = {X W,C W }, and the set of unlabelled data U = {X U }. Extending (2.5), the full objective function to maximise is then: q(x,c,θ, θ) = q(x F,C F,T F,X W,C W,X U,θ, θ) = p(θ, θ) p(t F C F,X F,θ) p(c F X F,θ) p(c W X W,θ) p(x F,X W,X U θ) = p(θ, θ) n Fp({τ nj } c n,x n,θ) p(c n x n,θ) p(x m θ) n F W m F W U In addition, in all the following derivations, we will use this trick, proved in section B.1: x,y, θ log p(x θ) = y p(y x,θ) log p(x,y θ) (3.2) θ It is useful because it allows to leave θ log p(x θ) out and instead deal with log p(x,y θ) θ which is usually much easier. If it is not sufficient, we can repeat the process several times. Here we will use it twice: first with y being the image label, and then with y being the patch label. We can now write down the derivatives of the logarithm of L(θ, θ), where we define L(θ, θ) = q(x F,C F,T F,X W,C W,X U,θ, θ), with respect to all the θ k, k = 1,...,C + 1. The entire calculus is quite long so we will not show it here, but it is provided in appendix in section B.2. The final result is log L(θ, θ) θ + n F j = log p(θ, θ) θ + n F W θ log p(x nj,τ nj c n,θ) + m W n F W (δ cnc p(c x n,θ)) c c j log p(c θ) θ p(k x mj,c m,θ) θ log p(x mj,k c m,θ) k c m p(k x nj,c,θ) θ log p(x nj,k c,θ) p(c x n,θ) j k c 77

96 CHAPTER 3. APPLICATION TO SEMI-SUPERVISED LEARNING which can be rewritten as log L(θ, θ) θ + c = log p(θ, θ) θ k c n F j + c + c n F W (δ cnc p(c x n,θ)) } {{ } (a) (δ cnc τ njk p(c x n,θ) p(k x nj,c,θ)) } {{ } (b) k c m W j log p(c θ) θ (δ cnc p(c x m,θ)) p(k x mj,c,θ) } {{ } (c) θ log p(x nj,k c,θ) θ log p(x mj,k c,θ) We rewrote it this way to show how the different gradients influence the total one. First of all, note how the unlabelled images do not not explicitly appear, instead they will influence θ through θ. As shown in section 1.3.2, the gradient of log p(c θ) is important: for images whose label is c, but that have a low probability of being labelled c (a), and for images whose label is not c, but that have a high probability of being labelled c (a). Also, the gradient of log p(x nj,k c,θ) is important: for patches of class k that have a low probability of being classified as k (b), for patches of class k k that yet have a high probability of being classified as k (b), and for unlabelled patches that have a high probability of being classified as k and that come from weakly labelled images with a low probability of being correctly classified (these images of W influencing the gradient of log p(c θ)) (c). Similarly, log L(θ, θ) θ = log p(θ, θ) + θ N p(c x n, θ) log p(c θ) c θ N + p(c x n, θ) p(k x nj,c, θ) n=1 c j θ log p(x nj,k c, θ) n=1 k c Note here how all the images have an equally important role. There is no more difference between fully / weakly labelled images and unlabelled ones. Another interesting thing is that the gradients involved are the same, but the coefficients are not. As shown in section 78

97 3.4. The learning method 1.3.2, the gradient of log p(c θ) is important for images that are believed by the system to belong to this class, i.e. the system uses its early predictions to modify the influence of every image. Also, the gradient of log p(x nj,k c, θ) is important for patches that have a high probability of being classified as belonging to k. This is a typical generative behaviour. For a label c, we now only care about these data that are most relevant to this particular label. So far these derivations are very general and do not make use of the model. Hence they could be reused for any generative model that follows any of the figures 3.3 or 3.4. Now we ought to differentiate further by making use of the model, i.e. by explicitly writing the derivatives of log p(θ), log p(c θ) and log p(x nj,k c,θ) with respect to θ and θ, for all the possible values of c and k. The calculus is now pretty straightforward as we are only dealing with Gaussians and mixing coefficients. Therefore we will only give the final results. 79

98 CHAPTER 3. APPLICATION TO SEMI-SUPERVISED LEARNING The main prior is a Gaussian over θ θ with zero mean and variance σ 2. Its derivatives are: θ log p(θ, θ) = θ log N(θ θ 0,σ 2 ) = σ 2 (θ θ) θ log p(θ, θ) = θ log N(θ θ 0,σ 2 ) = σ 2 (θ θ) The label priors are multinomial distributions, not over the possible classes, but over the possible label vectors. The parameters are ψ, with k,ψ k 0, such that p(c ψ) = label ( ψlabel P z ψz ) c=label. Its derivatives are: log p(c ψ) = 1 1 ψ c ψ c log p(c ψ) = 1 ψ w z ψ z z ψ z when w c Before any further processing, log p(x nj,k c,θ) should be decomposed: θ log p(x nj,k c,θ) = log p(k c,θ) + θ θ log p(x nj k,θ) The patchs labels priors are multinomial distributions over the possible classes, with parameters π, and l,π l 0, such that p(τ c,π) = ( ) cl π τl. P l l To give the scale, the value for the last z czπz class π C+1 is actually fixed. The derivatives are written: ( 1 1 log p(k c,π) = c k π k π k log p(k c,π) = c w π w z c = zπ z z c zπ z ) ( 1 = c k π k c w z c π z 1 z c π z ) The data likelihood is a mixture of Gaussians Φ, with parameters λ = {ρ,µ,σ}, such that p(x j k,λ k ) = Φ k (x j ) = p ρ p P z ρz N(x j µ kp,σ kp ). The derivatives are then: 1 ( log p(x nj k,λ k ) = ρ kp z ρ N(xnj µ kp,σ kp ) Φ k (x nj ) ) kz log p(x nj k,λ k ) = ρ kp µ kp z ρ N(x nj µ kp,σ kp ) Σ 1 kp (x nj µ kp ) kz log p(x nj k,λ k ) = 1 ρ ( ) kp Σ kp 2 z ρ N(x nj µ kp,σ kp ) Σ 1 kp (x nj µ kp )(x nj µ kp ) T Σ 1 kp Σ 1 kp kz 80

99 3.5. Influence of the amount of labelled data We have just detailed the four elements we needed to start our experiments: the data, the features, the underlying generative model, and the learning procedure. In the next section, we will describe the experiments we have conducted on the influence of the amount of labels, and the results they have produced on the performance of the generative and discriminative models, on the choice of model, and on the choice between HF and CC. 3.5 Influence of the amount of labelled data This section is concerned with the core of the experiments. We want to understand if the amount of labelled data has any influence on the choice of model, more precisely on the choice of the prior over θ and θ. To study our model in the context of semi-supervised learning, we have chosen to apply it to object recognition. However, note that object recognition is not the goal here, our primary concern is the variation of the classifier s performance, rather than its absolute value. Each test has a different level of supervision, going from almost totally unsupervised to fully supervised. We have three parameters here: the percentage of fully labelled images, the percentage of weakly labelled images and the percentage of unlabelled images, which actually leaves two degrees of freedom only. This means that each test has a fixed proportion of fully labelled data, and a fixed proportion of weakly labelled data. Each test is run 10 times. For every run, we split training set / test set differently, but we make sure to keep the levels of full / weak supervision constant. For every test run, using conjugate gradients, we obtain a MAP estimate of θ and θ. To avoid scaling problems due to the different magnitude between posterior and marginal distributions, training is stopped when both p(c X,θ) and p(x θ) have converged. Inference is then performed using the resulting θ MAP only, which means that when we report results using a particular type of model, they are given through the predictive accuracy of θ MAP, and always on test images. The rest of this chapter reports the experimental results on the two data-sets described in section 3.1. Section reports the evolution of the performance of the generative and discriminative models when adding labels, section studies what the best model is (generative, discriminative or hybrid) depending on the level of supervision. The experiments 81

100 CHAPTER 3. APPLICATION TO SEMI-SUPERVISED LEARNING are also run on the CC framework, and an empirical comparison between HF and CC is reported in section Finally a summary of the experimental results is given in section On the generative / discriminative models General experimental set-up Reminder: each test has a fixed percentage of fully labelled images, a fixed percentage of weakly labelled images, and is repeated 10 times using 10 different splits of the data into training / test sets. Initialisation of θ: for the Gaussian mixture associated to the (patch-)class k, we run the EM algorithm on the patches that are known to belong to class k (background patches have their GMM initialised the same way). This implies we always need at least a couple of images to be labelled. Finally, we set the class priors to (1/C), and all the π k to 1, to obtain a complete θ 0. We then learn: our generative model described in equations (2.2) and (2.4), starting with θ 0, and our discriminative model described in equations (2.2) and (2.3), starting with θ 0. CSB data-set We run experiments on the CSB data-set. Each category has 170 images, including at least 80 segmented images. For each run of a particular test, we use a random split of these images into 80 training images and 90 test images per category. For the amount of supervision, we use the following percentages: 2%, 15%, 33%, 50%, 67%, 85% and 100% for fully labelled images, and 0%, 15%, 33%, 50%, 67%, 85% or 98% for weakly labelled images, which means there are always at least 2 fully labelled images. This is for initialisation purposes. Figure 3.5 plots the results of these tests (averaged over 10 runs) in terms of classification performance, for the purely generative and discriminative models. The first row shows the generative case, while the second row shows the discriminative case. In the first column, 82

101 3.5. Influence of the amount of labelled data Figure 3.5: CSB data-set - Generative and discriminative models performance on test sets against the percentages of fully labelled and weakly labelled data. The training sets contain 80 images per category, the test sets 90 images per category. Left column: performance of the generative model (top) and the discriminative model (bottom) when varying the percentage of fully labelled data. Right column: performance of the generative model (top) and the discriminative model (bottom) when varying the percentage of weakly labelled data. the abscissa is the percentage of fully labelled images, and each curve represents a fixed proportion of weakly labelled images. In the second column, the abscissa is the percentage of weakly labelled images, and each curve represents a fixed proportion of fully labelled images. Discriminative case (bottom row): in both plots, the performance increases, and gets levelled at the same classification performance whether we increase fully or weakly labelled. This is slightly surprising as we would have expected the performance to be worse when we have lots of weakly labelled images and few fully labelled images. This success may be 83

102 CHAPTER 3. APPLICATION TO SEMI-SUPERVISED LEARNING due to the fact that, even though we have a discriminative model, the basis is still a generative model. Note that the performance stabilises quite quickly (more than 33% (=27) of fully labelled images seems to make little difference). This is also true, although slightly more slowly, when increasing the proportion of weakly labelled images (more than 50% (=40) weakly labelled images has very little effect). Generative case (top row): the results are rather interesting. When we fix the percentage of fully labelled images and increase the proportion of weakly labelled images, the performance increases, which is what we would expect. However, when we fix the weakly labelled data ratio, and we increase the proportion of fully labelled images, the performance improves for a bit but rapidly drops after 33% (=27 images) (15% (=12) in some cases). This may be due to two things: the more obvious reason is that the model is too limited and underfits, the other reason is that LDA is performed using image labels rather than patch labels, so the generative model should do better at the image level than at the patch level. Wildcats data-set We run the same experiments on the wildcats data-set. Each category has 100 images, all of them segmented. For each run of a particular test, we use a random split of these images into 75 training images and 25 test images per category. As for the CSB data-set, for the amount of supervision, we use the following percentages: 2%, 15%, 33%, 50%, 67%, 85% and 100% for fully labelled images, and 0%, 15%, 33%, 50%, 67%, 85% or 98% for weakly labelled images, which means there are always at least 2 fully labelled images for initialisation purposes. Figure 3.6 plots the results of these tests (averaged over 10 runs) in terms of classification performance, for the purely generative and discriminative models. The first row shows the generative case, while the second row shows the discriminative case. In the first column, we vary the percentage of fully labelled images, and each curve represents a fixed proportion of weakly labelled images. In the second column, we vary the percentage of weakly labelled images, and each curve represents a fixed proportion of fully labelled images. The results are very different from the former data-set. First, note the drop in absolute performance. The number of classes is the same in each case (3), so a random system would 84

103 3.5. Influence of the amount of labelled data Figure 3.6: Wildcats data-set - Generative and discriminative models performance on test sets against the percentages of fully labelled and weakly labelled data. The training sets contain 75 images per category, the test sets 25 images per category. Left column: performance of the generative model (top) and the discriminative model (bottom) when varying the percentage of fully labelled data. Right column: performance of the generative model (top) and the discriminative model (bottom) when varying the percentage of weakly labelled data. have 33% accuracy on both data-sets. While the average performance on CSB is around 75%, the performance on this data-set is 50% at its best. Clearly, the underlying generative model described in section 3.3 is not well suited. However, the task is also more difficult. Indeed, wildcats are very similar in appearance and are much more difficult to discriminate than cows and bikes. Generative case (top row): when fully labelled images are added, the performance of the generative model increases almost linearly. No curve stabilises, so it seems that the system has not been fed enough data to reach the performance peak, which confirms that the task 85

104 CHAPTER 3. APPLICATION TO SEMI-SUPERVISED LEARNING is harder. The curves are close to one another, which suggests that weakly labelled images have little influence. Indeed, as expected, when weak labels are added (right plot), the performance also increases, but more slowly. In addition, the curves are not on top of each other like they were in the left plot, but are clearly separated, showing the stronger influence of fully labelled images. Again, the peak is never attained. Discriminative case (bottom row): when adding full labels, the performance increases much more quickly and stabilises around 49%. When adding weak labels, not much seems to happen since most of the curves are on top of each other. Again, this shows the stronger influence of full labels. In addition, it seems that the discriminative model has not reached its limit yet, i.e. that more data could help. In order to try and improve the performance, we have also run experiments on the wildcats data-set with a mixture of 2 Gaussians instead of a single one. The results did not change much, so we will not show them here, but in figure B.1 in section B.3.1 of the appendix. We have discussed the effect of the amount of labelled data on the generative and discriminative models. We will now focus on the hybrid models On the choice of model More interestingly perhaps, we wish to study the effect of the amount of labelled data on the hybrid models. The original hope was to study the progression of α, the α maximising the classification performance, when increasing the percentage of labelled images. However, there is a lot of variability in the training runs. As a consequence, sometimes (often in some cases), the average curve of the classification performance against α [0, 1] is not peaked for intermediate values of α, even though the best results is given by an intermediate value of α in most of the 10 runs. This problem did happen for the toy example as well. If we look at figure A.1 in section A.1 of the appendix, the peak is for α 0.8, but it is not a really strong peak. However, if we take the 15 curves one by one as shown in figure 2.8, most of them have a really strong peak for an intermediate value of α. In this section, the problem being more difficult (two levels of labelling, no clear separation between classes), this effect is even more important. Figure 3.7 shows an example of this phenomenon. 86

105 3.5. Influence of the amount of labelled data Figure 3.7: Performance on the different runs. α is the alpha giving the best performance. Note that no two curves look the same and that, although most of them peak at intermediate values of α, the average performance peaks at the generative and discriminative end. Perhaps a more appropriate measure is to study the number of times, out of 10 runs, the best result was obtained from the generative model, from the discriminative model, or from a hybrid model. There are two quantities we may want to look at: the cumulative distribution function p(α α) = p(α ), where α [0,1], α S, α α and S is the set of studied values for α. Of course p(α α) [0,1], and p(α 1) = 1. This is calculated for each studied value of α, by a simple count of how many times (out of our 10 runs) the best solution was found for α α. the probability of each type of model to give the best solution. This is obtained by a simple count of how many times (out of 10 runs) the best solution comes from the generative model, how many times it comes from a hybrid model, and how many times it comes from the discriminative model. It will be plotted for 3 values of α: 0 will represent the generative case, 0.5 any hybrid case, and 1 the discriminative case. We hope to observe an evolution of these quantities when we vary the respective amounts of fully and weakly labelled data. We sill study these variations on our two data-sets, using both our hybrid framework HF and the convex combination framework CC. 87

106 CHAPTER 3. APPLICATION TO SEMI-SUPERVISED LEARNING Since all the experimental results will be reported using the same figure format, it is worth explaining in detail once what the plots represent: the top row (for example 3.10a) shows the cumulative distribution functions for various discrete values of α. Each curve in a single plot represents a fixed percentage of fully labelled images, while each column represents a different proportion of weakly labelled images. Templates of various situations are shown in figure 3.8. Figure 3.8: Schematic plots for the cumulative distribution function. (a) The generative model always gives the best performance, so p(α 0) = 1. The function jumps for α = 0. (b) The discriminative model always gives the best performance, so p(α < 1) = 0 and p(α 1) = 1. The function jumps for α = 1. (c) The generative or discriminative models give the best solution but never the hybrid model. The function jumps at α = 0 and α = 1. (d) The best solution comes from all kinds of models. The distribution function increases regularly. (e) The best solution comes from different hybrid models. The distribution function increases regularly, but starts at 0 (no generative model performs best) and reaches 1 before α = 1 (no discriminative model performs best). (f) One particular value of α gives the best performance. The curve jumps from 0 to 1 for this α. In the ideal case, we would like to have plots of the kind seen in figure 3.8f. Unfortunately, as previously discussed, most of the curves peak for a different value of α so we do not expect plot 3.8f to happen, but we should be allowed to hope to see plot 3.8e or at least plot 3.8d with low values at α = 0 and α = 1 to appear. Both types of plots indicate that hybrid models often lead to the best result. Any of the top plots would mean that generative or discriminative models always perform best. 88

107 3.5. Influence of the amount of labelled data the middle row (for example 3.10b) shows the probability of the best solution to be the generative model, the discriminative model, or a hybrid model. Here, probability of an event E refers to the number of times E occurred out of 10 runs. Again, each curve in a single plot represents a fixed proportion of fully labelled images, while each column represents a different proportion of weakly labelled images. Templates of various situations are shown in figure 3.9. Figure 3.9: Schematic plots for the models probability. (a) The generative model always gives the best performance, then p(α = 0) = 1. (b) The discriminative model always gives the best performance, then p(α = 1) = 1. (c) The generative or discriminative models give the best solution but never the hybrid model. The curve gives a peaked V shape. (d) The best solution comes from all kinds of models, but with a preference for the extremes. The curve has a flatter V shape. (e) The best solution comes from all kinds of models, but with a preference for hybrod models. The curve has a hilly shape. (f) The best performance is always achieved by hybrid models. The curve has a peaky hilly shape. In the ideal case, we would like to have plots of the kind seen in figures 3.9e and 3.9f. Both types of plots indicate that hybrid models often lead to the best result. Any of the top plots would mean that generative or discriminative models are to be preferred. the bottom row (for example 3.10c) shows how the probability of the best solution to be a particular model (generative, discriminative, or hybrid) evolves when we add full labels. Each curve in a single plot represents a fixed percentage of weakly labelled images. Each column refers to one of the model types. 89

108 CHAPTER 3. APPLICATION TO SEMI-SUPERVISED LEARNING General experimental set-up Reminder: each test has a fixed percentage of fully labelled images, a fixed percentage of weakly labelled images, and is run for different values of α. Each test is repeated 10 times using 10 different splits of the data into training / test sets. For each run, we learn a model for various values of α between 0 (generative) and 1 (discriminative). The actual values are {0, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1}. In fact, the 0 and 1 cases have already been detailed in the previous section. We start the training with the discriminative learnt model (α = 1 case) as our initial θ and the generative learnt model (α = 0 case) as our initial θ. CSB data-set - HF We run experiments on the CSB data-set. Each category has 170 images, including at least 80 segmented images. For each run of a particular test, we use a random split of these images into 80 training images and 90 test images per category. For the amount of supervision, we use the following percentages: 2%, 15%, 33%, 50%, 67%, 85% and 100% for fully labelled images, and 0%, 15%, 33%, 50%, 67%, 85% or 98% for weakly labelled images, which means there are always at least 2 fully labelled images. This is for initialisation purposes. Figure 3.10 plots the results of these tests (averaged over 10 runs), not only for the purely generative and discriminative models, but also for the hybrid ones. The top row shows that a lot of the best solutions are actually hybrid models, since the cumulative distribution functions are going smoothly from a low p(α = 0) (probability of preferring the generative model) to 1. This is true except for low percentages of fully labelled data. Indeed, the magenta and dark blue curves (2% (=2) and 15% (=12) fully labelled images) have a strong preference for the generative model, since they start from much higher up. Because the curves go up so smoothly, it also shows that there is no typical α that would always give the best solution. Interestingly, there seems to be an order in the curves, i.e. the higher the percentage of fully labelled data is, the lower the curve stands compared to the others, and the more it has to catch up for high values of α (high being close to 1). This 90

109 3.5. Influence of the amount of labelled data Figure 3.10: CSB data-set - HF - Choice of model when varying the percentage of fully labelled data. The training sets contain 80 images. The statistics are computed on 90 test images. (a) Cumulative distribution functions. Each column stands for a different percentage of weakly labelled data. (b) Number of times (out of 10) each training has given the best answer. Each column stands for a different percentage of weakly labelled data. (c) Evolution, when varying the percentage of fully labelled data, of the number of times a particular training has given the best answer. Left: generative, middle: hybrid, right: discriminative. From these plots, we can deduce that the best performance is given by the generative model for under 33% (=27) fully labelled training images, then the best performance is achieved with a hybrid model. 91

110 CHAPTER 3. APPLICATION TO SEMI-SUPERVISED LEARNING suggests that the preference is shifting from low values of α to high values. It is not true for every curve but it seems to be the tendency. The middle row confirms the general preference for hybrid models, since a lot of the curves peak in the middle. The bottom row is very interesting. It summarises what we were saying about the shift from α 0 to α 1. Indeed, the preference for the generative model decreases, while the preference for hybrid models increases until we reach enough fully labelled data for the discriminative model to be preferred. Note that the tendencies we have been discussing are the same in every column, i.e. for any percentage of weakly labelled data, which suggests weakly labelled data has little influence on the choice of the type of model. Indeed, figure 3.11 shows the same results as figure 3.10 (statistics computed on the CSB data-set using the HF framework), but each column now represents a different percentage of fully labelled data, and each curve a different percentage of weakly labelled data. We can observe that all the comments we have made are only valid for the percentage of fully labelled data. If we look at figure 3.11a (the top row), the left plot has all its curves high up but they go down as we increase the percentage of fully labelled data until they are low down in the right plot. Most curves are on top of each other. The steepness of the curves is increasing as we add full labels meaning that more and more hybrid models are preferred. The middle and bottom row show that hybrid models are preferred from about 33% (=27) fully labelled images while the generative model is preferred for 2% (=2) and 15% (=12) fully labelled images. 92

111 3.5. Influence of the amount of labelled data Figure 3.11: CSB data-set - HF - Choice of model when varying the percentage of weakly labelled data. The training sets contain 80 images. The statistics are computed on 90 test images. (a) Cumulative distribution functions. Each column stands for a different percentage of fully labelled data. (b) Number of times (out of 10) each type of model has given the best answer. Each column stands for a different percentage of fully labelled data. (c) Evolution, when varying the percentage of weakly labelled data, of the number of times a particular type of model has given the best answer. Left: generative, middle: hybrid, right: discriminative. From these plots, we can deduce that the best performance is given by the generative model for under 33% (=27) fully labelled training images, then the best performance is achieved with a hybrid model. 93

112 CHAPTER 3. APPLICATION TO SEMI-SUPERVISED LEARNING CSB data-set - CC Figure 3.12 shows the exact same quantities as figure 3.10 (statistics computed on the CSB data-set plotted against the percentage of fully labelled data), but this time they were obtained using the CC framework described in section The top row shows that a lot of the best solutions are actually hybrid models, but many other best solutions are generative. Indeed the cumulative densities are going smoothly from a much higher p(α = 0) (probability of preferring the generative training) to 1. This is not the case for high percentages of fully labelled data though, for which hybrid models are preferred. Again, there doesn t seem to be a typical α that would always give the best solution. Note that the curves corresponding to higher percentages of fully labelled data still dominate the others, which suggests that the preference is shifting from low values of α to high values. For this data-set too, the percentage of weakly labelled data seems to have little influence on the pattern. The bottom row is quite different from 3.10c. Indeed the probability of the generative model to give the best classifier decreases, but so does the probability of a hybrid model to generate the best solution. This time, the shift occurs not from generative to hybrid to discriminative like it did in figure 3.10, but from generative / hybrid to discriminative. As a result, the probability of the discriminative model to be picked is much higher. Note that the generative and discriminative ends are exactly the same in figures 3.10 and 3.12, so if the generative and discriminative models are more often preferred with the combination framework, it means the CC hybrid models perform worse. This will be further discussed in section Figure 3.13 shows the same results as figure 3.12 (statistics computed on the CSB data-set using the CC framework), but each column now represents a different percentage of fully labelled data, and each curve a different percentage of weakly labelled data. As foreseen, there is no obvious pattern coming out when adding weak labels. However, all the plots confirm the preference for the discriminative model when the percentage of fully labelled data is over 50% (i.e. more than 40 fully labelled images). Indeed the cumulative density function is flat all along and jumps at α = 1. The jump gets higher as the percentage increases. 94

113 3.5. Influence of the amount of labelled data Figure 3.12: CSB data-set - CC - Choice of model when varying the percentage of fully labelled data. The training sets contain 80 images. The statistics are computed on 90 test images. (a) Cumulative distribution functions. Each column stands for a different percentage of weakly labelled data. (b) Number of times (out of 10) each type of model has given the best answer. Each column stands for a different percentage of weakly labelled data. (c) Evolution, when varying the percentage of fully labelled data, of the number of times a particular type of model has given the best answer. Left: generative, middle: hybrid, right: discriminative. From these plots, we can deduce that the best performance is given by the generative model or a hybrid model for under 50% (=40) fully labelled training images, then the best performance is achieved by the discriminative model. 95

114 CHAPTER 3. APPLICATION TO SEMI-SUPERVISED LEARNING Figure 3.13: CSB data-set - CC - Choice of model when varying the percentage of weakly labelled data. The training sets contain 80 images. The statistics are computed on 90 test images. (a) Cumulative distribution functions. Each column stands for a different percentage of fully labelled data. (b) Number of times (out of 10) each type of model has given the best answer. Each column stands for a different percentage of fully labelled data. (c) Evolution, when varying the percentage of weakly labelled data, of the number of times a particular type of model has given the best answer. Left: generative, middle: hybrid, right: discriminative. From these plots, we can deduce that the best performance is given by the generative model or a hybrid model for under 50% (=40) fully labelled training images, then the best performance is achieved by the discriminative model. 96

115 3.5. Influence of the amount of labelled data Wildcats data-set - HF We run the same experiments on the wildcats data-set. Each category has 100 images, all of them segmented. For each run of a particular test, we use a random split of these images into 75 training images and 25 test images per category. As for the CSB data-set, for the amount of supervision, we use the following percentages: 2%, 15%, 33%, 50%, 67%, 85% and 100% for fully labelled images, and 0%, 15%, 33%, 50%, 67%, 85% or 98% for weakly labelled images, which means there are always at least 2 fully labelled images for initialisation purposes. Figure 3.14 plots the results of these tests (averaged over 10 runs), not only for the purely generative and discriminative models, but also for the hybrid ones. The top row and the middle row show that most of the best classification performances are found for hybrid models. Indeed, all the cumulative distribution functions cover most of the diagonal. Because the curves go up so smoothly, it also shows that there is no typical α that would always give the best solution. The middle plots almost all have a strong peak for hybrid models. In the first two rows, the curves are practically on top of one another which suggests that the pattern is not influenced by the percentage of weakly labelled data. Similarly, the pattern is the same in every column, so the percentage of fully labelled data should not have a strong influence either. The bottom row confirms the little influence both percentages have. Indeed, the preference for the generative model is strong, but relatively constant across various percentages of fully labelled data. Correspondingly, the probabilities to pick the generative and discriminative models are low and constant. Again, the curves being on top of each other, the percentage of weakly labelled images should not change much. Indeed, figure 3.15 shows the same results as figure 3.14 (statistics computed on the wildcats data-set using the HF framework), but each column now represents a different percentage of fully labelled data, and each curve a different percentage of weakly labelled data. As we could foresee, the plots look very similar to the ones in figure Note that results are similar for a mixture of 2 Gaussians. The plots are shown in figures B.2 and B.3 in section B.3.2 of the appendix. 97

116 CHAPTER 3. APPLICATION TO SEMI-SUPERVISED LEARNING Figure 3.14: Wildcats data-set - HF - Choice of model when varying the percentage of fully labelled data. The training sets contain 75 images. The statistics are computed on 25 test images. (a) Cumulative distribution functions. Each column stands for a different percentage of weakly labelled data. (b) Number of times (out of 10) each type of model has given the best answer. Each column stands for a different percentage of weakly labelled data. (c) Evolution, when varying the percentage of fully labelled data, of the number of times a particular type of model has given the best answer. Left: generative, middle: hybrid, right: discriminative. From these plots, we can deduce that the best performance is consistently achieved by a hybrid model. 98

117 3.5. Influence of the amount of labelled data Figure 3.15: Wildcats data-set - HF - Choice of model when varying the percentage of weakly labelled data. The training sets contain 75 images. The statistics are computed on 25 test images. (a) Cumulative distribution functions. Each column stands for a different percentage of fully labelled data. (b) Number of times (out of 10) each type of model has given the best answer. Each column stands for a different percentage of fully labelled data. (c) Evolution, when varying the percentage of weakly labelled data, of the number of times a particular type of model has given the best answer. Left: generative, middle: hybrid, right: discriminative. From these plots, we can deduce that the best performance is consistently achieved by a hybrid model. 99

118 CHAPTER 3. APPLICATION TO SEMI-SUPERVISED LEARNING Wildcats data-set - CC Figure 3.16 shows the exact same quantities as figure 3.14 (statistics computed on the wildcats data-set plotted against the percentage of fully labelled data), but this time they were obtained using the CC framework described in section The plots look extremely similar to the ones in figure Therefore everything we said about figure 3.14 can be repeated here: a strong preference for hybrid models, which is fairly constant across various percentages of fully labelled images and weakly labelled images. It follows that the same can be said about figure 3.17, which plots the same quantities as figure 3.16 (statistics computed on the wildcats data-set using the CC framework), but each column now represents a different percentage of fully labelled data, and each curve a different percentage of weakly labelled data. Note that results are similar for a mixture of 2 Gaussians. The plots are shown in figures B.4 and B.5 in section B.3.2 of the appendix. 100

119 3.5. Influence of the amount of labelled data Figure 3.16: Wildcats data-set - CC - Choice of model when varying the percentage of fully labelled data. The training sets contain 75 images. The statistics are computed on 25 test images. (a) Cumulative distribution functions. Each column stands for a different percentage of weakly labelled data. (b) Number of times (out of 10) each type of model has given the best answer. Each column stands for a different percentage of weakly labelled data. (c) Evolution, when varying the percentage of fully labelled data, of the number of times a particular type of model has given the best answer. Left: generative, middle: hybrid, right: discriminative. From these plots, we can deduce that the best performance is consistently achieved by a hybrid model. 101

120 CHAPTER 3. APPLICATION TO SEMI-SUPERVISED LEARNING Figure 3.17: Wildcats data-set - CC - Choice of model when varying the percentage of weakly labelled data. The training sets contain 75 images. The statistics are computed on 25 test images. (a) Cumulative distribution functions. Each column stands for a different percentage of fully labelled data. (b) Number of times (out of 10) each type of model has given the best answer. Each column stands for a different percentage of fully labelled data. (c) Evolution, when varying the percentage of weakly labelled data, of the number of times a particular type of model has given the best answer. Left: generative, middle: hybrid, right: discriminative. From these plots, we can deduce that the best performance is consistently achieved by a hybrid model. 102

121 3.5. Influence of the amount of labelled data On the choice between HF and CC Now that we have seen that hybrid models could help improve the classification performance, we would like to see which method, between HF and CC, performs best. To do so, for each test, we take the hybrid models that gave the best performance (1 model per run, 10 models in total per test), and we average their performance. This gives the average performance of the best hybrid model for each test, which is the quantity we will compare for both frameworks. We will refer to this quantity simply as the best hybrid model s performance. CSB data-set Remember that, for the CSB data-set, we observed that HF s hybrid models give a better performance than either extreme models more often than CC s hybrid models do. Considering that the generative and discriminative extremes are the same for either framework, it is safe to assume that HF s hybrid models perform better than CC s hybrid models. Figure 3.18a shows the classification performances of the best hybrid model generated by HF (blue), the best hybrid model generated by CC (red), the generative model (black), and the discriminative model (green), against the percentage of fully labelled data. Each subplot corresponds to a different percentage of weakly labelled data. Figure 3.18b shows the performances against the percentage of weakly labelled data. If we look at figure 3.18b, we can see that, for very low percentages of fully labelled data, HF, CC and the generative model perform very similarly, the discriminative model being slightly worse. However, for medium to high percentages of fully labelled data, HF s best hybrid model performs better than the generative model and than CC s best hybrid model, no matter what the percentage of weakly labelled data is, and the difference increases with the percentage of fully labelled data. It also performs better than the discriminative model, however the difference decreases and eventually HF s best hybrid model is caught up by the discriminative model for 100% fully labelled data. Wildcats data-set Remember that, for the wildcats data-set, we observed that both HF s and CC s hybrid models consistently give a better performance than either extreme model do. 103

122 CHAPTER 3. APPLICATION TO SEMI-SUPERVISED LEARNING Figure 3.18: CSB data-set - HF versus CC. The training sets contain 80 images. The statistics are computed on 90 test images. HF s results are plotted in blue, CC s results in red, and the generative and discriminative models results in black and green respectively. (a) We vary the percentage of fully labelled data. (b) We vary the percentage of weakly labelled data. Note how HF s best hybrid model performs better for medium to high percentages of fully labelled data. Figure 3.19a shows the classification performances of the best hybrid model generated by HF (blue), the best hybrid model generated by CC (red), and the generative model (black), and the discriminative model (green), against the percentage of fully labelled data. Each subplot corresponds to a different percentage of weakly labelled data. Figure 3.19b shows the performances against the percentage of weakly labelled data. We can see that, no matter what the percentages of fully and weakly labelled data are, HF and CC perform consistently better than the generative and discriminative models, sometimes much better. However, sometimes HF s best hybrid model overtakes CC s best hybrid model, and sometimes it is the other way round. There does not seem to be any rule, in fact they give very similar results. The differences are almost irrelevant since their respective curves seem to be almost on top of each other. What can be noticed though, is that when their curves (blue and red) do separate, HF s best hybrid model tends to perform slightly better. Note that results are similar for a mixture of 2 Gaussians. The plots are shown in figure B.6 in section B.3.3 of the appendix. 104

123 3.5. Influence of the amount of labelled data Figure 3.19: Wildcats data-set - HF versus CC. The training sets contain 75 images. The statistics are computed on 25 test images. HF s results are plotted in blue, CC s results in red, and the generative and discriminative models results in black and green respectively. (a) We vary the percentage of fully labelled data. (b) We vary the percentage of weakly labelled data. Both HF and CC perform better than the generative model, however there does not seem to be any obvious difference between the two hybrid frameworks Summary of the experimental results In the previous section, we have presented many experiments. The goal was to understand the influence of the amount of fully and weakly labelled data on the choice of model. We have used two data-sets, one formed of cows, bikes and sheep, and the other one containing lions, tigers and cheetahs. The first data-set was fairly easy. With both approaches, the results were highly dependent on the percentage of fully labelled data, and not dependent on the percentage of weakly labelled data. However, the results were quite different between our hybrid model and the convex combination framework. With our approach, the preference was going from generative models to hybrid models to discriminative models, as seen in figure 3.10c. The probability of the generative model to perform better continuously decreased when adding full labels up to 67% fully labelled data. The probability of a hybrid model to perform better increased until 67% fully labelled data, and then dropped. The probability of the discriminative model to perform better was flat and started to increase at 67% fully labelled data. 105

124 CHAPTER 3. APPLICATION TO SEMI-SUPERVISED LEARNING With the convex combination framework, the preference was going from generative / hybrid models to discriminative models, as seen in figure 3.12c. The generative curve (i.e. the probability of the generative model to perform better) and the hybrid curve continuously decreased when augmenting the percentage, while the discriminative curve continuously increased. The second data-set was a lot harder, as the categories were much more similar. With either approach, the results were not dependent on the percentage of fully labelled data or weakly labelled data. In both cases, the best results were almost systematically given by a hybrid model. This suggests that hybrid models are to be preferred when classes are highly ambiguous. However, we did see that neither the generative model nor the discriminative model had achieved their best performance. Indeed, figure 3.6 suggests that adding more labelled data, especially more fully labelled data, would lead to better performance from both type of models. It is unclear what would happen to the performance of hybrid models, but we can imagine that it would be overtaken by the discriminative model s performance, like we saw happen with the first training set. 3.6 Discussion We have provided a detailed study of the general hybrid objective function introduced in [61] on a toy example. We have parameterised it with α [0,1], such that α = 0 leads to the generative model, 0 < α < 1 to hybrid models, and α = 1 to the discriminative model. Also, we have given possible avenues to reach a fully Bayesian version that would allow to marginalise α out, or to infer α. This framework seems very promising to improve classification performance, not only in a semi-supervised setting, but also in cases where the amount of data is not sufficiently large to saturate the discriminative model s performance. However, there are a few issues about this framework. First of all, it doubles the number of parameters, so the learning process is more difficult, and takes longer. In theory, this is only a problem for intermediate values of α as the more dependent or the more independent both sets of parameters become, the easier the optimisation should be. Unfortunately, these intermediate values are the ones we are interested in. 106

125 3.6. Discussion 1 Another problem is the presence of local minima. Indeed, due to the factors p(x n θ) and p(x n θ), both containing marginal densities, we can be sure that our objective function has multiple local minima. This phenomenon gets worse as we increase the dimensions (especially for intermediate values of α), which means that conjugate gradients becomes even more sensitive to initialisation. A local minimum would be acceptable, if we were sure to access a reasonable one every time. However, we do not seem to have this guarantee. Let us have a look at isolated performance curves against α, as shown in figure The jerkiness Figure 3.20: Performance on the different runs. α is the alpha giving the best performance. Note how unstable the results are. of each plot and the variance across plots indicates that we suffer from local minima. The two issues we have just mentioned call for a better optimisation technique. We will discuss this in the final chapter (Conclusions) of this thesis, however we can already give a few leads here: Laplace approximation, Monte-Carlo sampling, a combination of expectation-maximisation (EM) and conditional EM [44], and homotopy continuation [2]. 107

126 CHAPTER 3. APPLICATION TO SEMI-SUPERVISED LEARNING If any of these methods gives smoother results, the experiments should be rerun to see if additional patterns can be observed. This would also allow us to tackle the absolute classification performance by applying the framework to much more sophisticated models, which can only be done if we solve the optimisation issue. Another issue that we have to face lies in the Bayesian version using Laplace approximations, which is the one we ought to use. Let us consider the algorithm presented in section If we stop after one iteration, we are too dependent on the initialisation of ξ. However, using more iterations creates the problem of the determinant of the hessian matrix H(ξ) becoming negative, which makes the hessian unsuitable as a precision matrix. Once we have a reliable Bayesian version, the first thing to do would be to experiment the use of different priors p(θ, θ). For example, we could set p(θ, θ Λ) = p( θ) N(θ θ,λ 1 ), where Λ is a diagonal matrix representing the precision (instead of our scalar σ 2 (α)). This would give a much greater flexibility to the model. It would also make sense since different variables scale differently and relate differently, so they need a different similarity measure. The next step would be to break down p(θ θ,λ) like we usually break down a prior, i.e. using different distributions depending on which type of variable we are dealing with. For example, a mean would keep a Gaussian prior but a precision would have a Gamma or a Wishart prior. In any case, for any kind of variable v, the distribution p(v ṽ,λ) should be parameterised such that the conditional expectation of v is ṽ, i.e. such that we have E[v ṽ] = v p(v ṽ,λ)dv = ṽ. An interesting piece of work to do is to actually run the Bayesian version of our experiments, and see the effect that full and weak labels have on the posterior distribution of α. We would expect the mode α MAP to increase with the amount of labels. Finally, we believe this framework could be used in many different applications. In particular, the speech recognition community has understood very quickly the advantage of blending generative and discriminative approaches by using discriminative training of hidden Markov models (HMMs), and our framework provides a rigorous tool to explore new models. Note that Druck et al. [22] have applied our framework successfully to text classification, although they report that their multi-conditional learning method does better 108

127 3.6. Discussion on harder data-sets. In fact, application fields range from bio-informatics to data-mining, i.e. any field that could benefit from machine learning, and where labelled data can be scarce. 109

128 CHAPTER 3. APPLICATION TO SEMI-SUPERVISED LEARNING 110

129 CHAPTER 4 HYBRID BELIEF PROPAGATION Chapters 2 and 3 have dealt with a hybrid model that combines two different characteristics of generative (i) and discriminative (ii) models: the ability to handle missing data (and in particular to use unlabelled data) for (i), and the power of discrimination for (ii). An obvious application was semi-supervised learning. In this chapter, we will see a new type of interactions between generative and discriminative approaches, used to define an efficient approximation of belief propagation when the state space of the hidden variables is very large. Belief propagation (BP) is a very popular algorithm for approximate inference in Markov random fields (MRFs), especially in computer vision problems where bottom-up information is often needed to smooth results. While the MRF framework yields an optimisation problem that is NP hard, graph cuts [12] and BP [3; 76] have proved to be good approximation techniques for tasks such as stereo and image restoration. However, BP is often preferred for its ability to provide an approximation of the posterior distribution of the state space of the hidden variables, rather than a MAP estimate like graph cuts methods. Unfortunately, BP does not scale well with the cardinality of the state of the hidden variables and needs further approximations. In this chapter, we will study an approximation of BP [51] in the context of a global generative framework called the Jigsaw model [46]. This model contains a MRF of high cardinality, and this cardinality increases as we increase the size of the jigsaw. We will explain three different points that enabled the learning of large jigsaws: a new sparse belief propagation algorithm for inferring the mapping from an image to 111

130 CHAPTER 4. HYBRID BELIEF PROPAGATION the jigsaw, Hyrid BP: a hybrid generative / discriminative method that significantly increases the sparseness of the algorithm, especially when the jigsaw size is large, an effective method for learning large jigsaws which exploits the memory and time savings given by the increased sparseness. We will provide a detailed analysis of how the hybrid inference leads to significant savings in memory and computation time. To demonstrate the success of this method, we will present experimental results applying large jigsaws to an object recognition task. Section 4.1 will explain the basic belief propagation algorithm as well as a few possible approximations. Then section 4.2 will give a thorough desciption of the Jigsaw model, and will point out the difficulties of learning a MRF with a state space of high cardinality. Section 4.3 will discuss how a discriminative model can help reduce the search space during inference in each iteration of the jigsaw learning algorithm, thereby making the inference step more efficient (or sometimes even simply possible). Finally section 4.4 will explain how to integrate this new inference mechanism in the bigger picture of hte jigsaw learning algorihm, and section 4.5 willl close this chapter with conclusions, some remarks on the limitations of the model, and potential future work. Throughout this chapter, when considering the i th pixel of image I, we will refer to its intensity value as I i, and to its location in the image as t i. 4.1 Inference with belief propagation In this section, we will explain the principle of belief propagation and we will outline a few approximation techniques Standard belief propagation We will now briefly describe the BP algorithm, which is essentially a message-passing algorithm. For a more complete description, please refer to [86]. All the diagrams of this subsection have been borrowed to [86]. 112

131 4.1. Inference with belief propagation To understand BP, we will focus on the case of a pair-wise MRF, as it is the model we will deal with in the application. Note that we still use images as data, so that the most natural MRF is pair-wise but 4-connected (indeed, every pixel is the direct neighbour of 4 others). This is represented in figure 4.1a. The black nodes represent the observed variables (the Figure 4.1: (a) A pair-wise 4-connected MRF. Black nodes are the pixels, and the white nodes are the hidden variables whose state we want to infer. (b) Belief of node l i. The posterior belief depends on the incoming messages and the prior belief φ i. pixels) and the white nodes represent the hidden random variables whose state we want to infer. For now, we will assume that the hidden random variables are the pixels labels. The hidden nodes are 4-connected, i.e. their state depends on the state of their four neighbours. For an image x containing pixels {x 1,,x n }, with hidden states {l 1,,l n }, p(l) can then be written p(l) = 1 φ i (l i ) ψ ij (l i,l j ). Typically, each hidden variable (each label) Z i i,j has a prior distribution over its states (given by the unary cost function φ i ), and pairs of neighbouring variables share a pair-wise cost-function ψ ij, that usually acts as a regulariser to keep some smoothness in the labels of neighbouring pixels. In order to agree with each other on which state they should be in, the l i need to communicate, using messages. We now introduce m ij (l j ) as being the message sent from hidden node i to hidden node j, about its belief of what state node j should be in, based on what state i was told it should be in. m ij (l j ) will be a vector of the same length as l j and each entry will be proportional to how likely i thinks that j should be in the corresponding state. 113

132 CHAPTER 4. HYBRID BELIEF PROPAGATION The belief at node i is proportional to the product of the local evidence at that node given by φ i (l i ), and all the messages coming into node i. This is represented in figure 4.1b and can be written b i (l i ) = Kφ i (l i ) j N i m ji (l i ), with K such that b i is normalised and sums to 1. The message sent by node i to node j is then m ij (l j ) = ( ) bi (l i ) ψ ij (l i,l j ). Using m ji (l i ) l i the definitions of m i j(l j ) and b i (l i ), we can then rewrite the messages update rule as: m ij (l j ) = l i φ i (l i )ψ ij (l i,l j ) m ki (l i ) k N i \{j} as represented in figure 4.2a. The algorithm repeats itself until convergence of the posterior beliefs. The global process is shown in figure 4.2b. Figure 4.2: (a) Message sent from l i to l j. The message sent to node j depends on the messages coming into i from other nodes, from i s prior belief φ i, and from i,j relation ψ ij. (b) Message passing across the nodes. Now that we have briefly explained the algorithm, we can easily define the max-product variant of BP. Instead of summing over all possible states of l i, we just pick the maximum of the distribution, so that the messages become m ij (l j ) = max l i φ i (l i )ψ ij (l i,l j ) k N i \{j} m ki (l i ) This is the variant that we will consider in the rest of this chapter. 114

133 4.1. Inference with belief propagation Approximations As we have previously discussed, BP is often preferred to graph cuts algorithms since it gives a distribution over the states, rather than a MAP estimate. However, BP does not scale very well when the state space is very large, and the optimisation can become a challenging problem. Indeed, we need to store the distribution over all the states for every random variable. A common approach to tackling this problem is to prune the original state space S 0 of a variable by disallowing states for which there is little local support [17; 49]. However, this method is vulnerable to pruning out states incorrectly when the local evidence is insufficiently informative for accurate pruning. A promising alternative involves using a message-passing algorithm with sparsely represented messages, such that the true messages can be well approximated by their sparse counterparts. For example, if l is a label, Pal et al. [65] use a forward-backward algorithm and approximate each message p(l) by a mixture of Kronecker delta functions q(l) chosen to be within a fixed Kullback-Leibler divergence of the true message, q(l) = s S q s δ(l = s) (4.1) where S is the set of states whose corresponding peaks are kept. The true distribution would be achieved for S = S 0, and q s = p(s). In [65] it was shown that computing the approximate message with K delta functions q K that minimises KL(q p) simply requires retaining the largest K elements of p and renormalising. In other words, we have q K (l) = qs K δ(l = s), where p(s) qk s = s S K k S K p(k) and S K = {k S 0,p(k) is among the K highest peaks}. A problem with the above approach is that when messages are almost uniform, a very large number of delta functions is required to achieve a sufficiently good approximation, and so efficiency is lost. In [51], this problem is overcome by adding a uniform distribution 115

134 CHAPTER 4. HYBRID BELIEF PROPAGATION to the mixture of delta functions, so that the sparse message has the form q(l) = q 0 + s S q s δ(l = s) (4.2) Unfortunately, finding the sparse message with K delta functions q K that minimises KL(q p) does not now have a closed-form solution. Instead we retain the largest K elements of p and, rather than re-normalising, evenly distribute the remaining probability mass amongst the remaining states (those not in S K ). Figure 4.3 illustrates this process. This Figure 4.3: The principle of sparse belief propagation. We keep the highest K peaks and redistribute the rest of the probability mass. means that q K (l) = q0 K+ qs K δ(l = s), where qk 0 = p(k) S 0 \ S K and qk s = p(s) qk 0. s S K k S 0 \S K The messages are represented in log-form and, since max-product BP is invariant to message normalisation, each message is normalised so that log p 0 = 0. To test the accuracy of their method, random messages are generated by sampling p vectors of size 1000 from a Dirichlet distribution. The peakiness of the messages is varied by varying the Dirichlet pseudo-count parameters from 0.01 to 100. For K = 100, when using the approximation (4.1) the average KL divergence KL(q p) was 1.1. However, when using this new approximation (4.2), the average KL divergence drops to 0.3, indicating a much better approximation of the true message. Both these methods are very promising, however we cannot flatten two many peaks this 116

135 4.2. The Jigsaw model way as we may touch the wrong ones. Indeed, although these approaches are much more robust than pruning, they both suffer from a rigid choice of peaks. In many applications, choosing to keep the highest peaks may be the right choice to approximate a distribution. However, belief propagation is an algorithm that is used when information is exchanged between random variables (here pixels), when compromises need to be made between what the model says and what the neighbouring pixels say. In this case, the highest peaks are not necessarily the ones we should keep. In the rest of this chapter, we will present an approach based on the information given by the neighbouring pixels, illustrated on a very promising method for low-level vision, the Jigsaw model [46]. We will start by explaining how the jigsaw works, and then we will focus on our hybrid belief propagation algorithm, and how it helped learning jigsaws. 4.2 The Jigsaw model In this section, we describe the probabilistic model used to learn a jigsaw from a set of training images [46], and the few changes we have made to it. The aim is to learn a jigsaw image, such that pieces of the jigsaw are similar in appearance to several regions of the training images and are as large as possible for a particular accuracy of reconstruction. These regions are allowed to be of arbitrary shape. In addition, the jigsaw is required to be exhaustive, so that the entirety of each training image can be reconstructed approximately using only pieces from the jigsaw image. Hence, the jigsaw captures repeated structures in the training image set. Figure 4.4 shows results of the jigsaw model applied to a set of face images. Note how the jigsaw captures both the appearance and the shape of eyes, noses and mouths. Figure 4.4: Example the Jigsaw model applied to face images. Taken from [46]. (a) Training images. (b) Examples of resulting clusters (jigsaw pieces). 117

136 CHAPTER 4. HYBRID BELIEF PROPAGATION A jigsaw J is defined as an image such that each pixel z in J has an intensity value µ(z) and an associated variance λ 1 (z). A jigsaw piece is a set of spatially grouped pixels in J. We can combine many of these pieces to generate images, noting that pixels in the jigsaw can be used in multiple image locations. For each image I, we have an associated offset map L of the same size which determines the jigsaw pieces used to make that image. This offset map defines a position in the jigsaw for each pixel in the image, such that more than one image pixel can map to the same jigsaw pixel. This generative process is schematised in figure 4.5. Figure 4.5: Diagram of the Jigsaw model. Taken from [46]. The jigsaw is an image with an intensity value µ and an associated variance λ 1. Jigsaw pieces are combined to generate images, and any subpart of any piece can be shared across images. To store the pixel-to-pixel mapping, each image I has a corresponding offset map L. Each entry in the offset map is a two-dimensional offset l i = (l ix,l iy ), which maps the 2D pointt i = location(i i ) in the image to a 2D pointz i in the jigsaw, using the correspondences z ix = (t ix l ix ) mod J w z i = (t i l i ) mod J, where J = (J w,j h ) refers to the z iy = (t iy l iy ) mod J h jigsaw s dimensions. Notice that if two adjacent pixels in the image have the same offset label, then they map to adjacent pixels in the jigsaw. To explain an image using coherent pieces from the jigsaw, a Markov random field (MRF) is defined on the offset map that encourages neighbouring pixels to have the same offsets: p(l) = 1 Z exp ψ(l i,l j ) 118 (i,j) E

137 4.2. The Jigsaw model where E is the set of edges in a 4-connected grid. The interaction potential ψ defines a Pott s model on the offsets, ψ(l i,l j ) = γ δ(l i l j ), where γ is a parameter which influences the typical size of the learnt jigsaw pieces. The choice of γ affects the granularity of segmentation of the image. For our experiments, we set γ to 6, unless otherwise specified. Given the offset map and the jigsaw, the probability distribution of each image is now assumed to be independent for each pixel. Unlike [46], which used a Gaussian appearance model, we assume that the probability distribution for each image pixel is a mixture of a Gaussian and a uniform distribution, p(i J,L) = i [ π N(Ii µ(t i l i ),λ(t i l i ) 1 ) + (1 π) Uniform(I i ) ] (4.3) where the product is over image pixel positions and both subtractions are modulo J. Figure 4.6 illustrates this new distribution. The use of a mixture distribution has the effect of Figure 4.6: Structure of the likelihood function. Plot of the likelihood p(i i J,l i ) for a particular offset l i. As the likelihood is a mixture of a Gaussian and a uniform, it is effectively constant for a range of values of I i. making the model more robust by implicitly defining an outlier model. This idea can also be found in [81]. Our robust distribution also allows for sparse inference methods to be used (see section 4.3.2). We chose π = 0.9 in our experiments. Note that, for multi-channel images (e.g. RGB), separate mean and precision parameters are used for each channel, and the single values (I i,µ(t i l i ),λ(t i l i )) become vectors 119

138 CHAPTER 4. HYBRID BELIEF PROPAGATION (I i,µ(t i l i ),λ(t i l i )) of dimension the number of channels. As shown in figure 4.7, this likelihood allows us to build, for every pixel I i, a message p(i i J,l i ) containing one entry per possible value of l i, so that each message has J h J w entries. Each image pixel will carry its own message around and will send it to neighbouring Figure 4.7: Structure of the distribution over the possible states of l i. Plot of p(l i I i,j). Since the likelihood is a mixture of a Gaussian and a uniform, it is effectively constant for a range of values of l i. pixels, in order to agree with them on its mapping location in the jigsaw. This agreement is actually a trade-off between what the top-down information from the gaussian-uniform appearance model dictates, and what the bottom-up information from the MRF coherency model suggests. We place an independent Normal-Gamma prior on µ and λ for each jigsaw pixel z: p(j) = z N(µ(z) µ 0,(βλ(z)) 1 ) G(λ(z) a,b) This prior ensures that the behaviour of the model is well defined for unused regions. For our experiments, we set the hyper parameters µ to 0.5, β to 1, b to three times the inverse data variance and a to the square of b. The model defines the joint probability distribution on a jigsaw J, a set of images I 1...I N, 120

139 4.3. Efficient inference and their offset maps L 1...L N to be p(j, {I n,l n }) = p(j) N p(i n J,L n ) p(l n ) (4.4) n=1 4.3 Efficient inference In the previous section, we have introduced a variable L, associated with an image I, that we called the offset map, and such that l i = (t i z i ) mod J, where z i is the actual 2D jigsaw location the pixel at location t i maps to. We have described it this way because it was more convenient to explain the Jigsaw model, however, for the remainder of this chapter, it will be more appropriate to redefine L. Indeed, L may also be called a label map, but in this case l i = z i, i.e. it contains the absolute location (label) of each pixel instead of the offset. Obviously it is an equivalent notation, but it is important to know which one is used. In the remainder of this chapter, we will talk about L as a label map (unless otherwise specified) The issues The joint distribution to be optimised is given in equation (4.4). In [46], an iterative approach is described for maximising this joint probability that requires alternately optimising the label maps {L n } and the latent jigsaw image J. The bottleneck in this procedure is precisely the optimisation of the label maps, which used the alpha-expansion graph cuts algorithm of [12]. This method scales roughly linearly with the number of pixels in the jigsaw and hence becomes prohibitively expensive for learning jigsaws of size greater than pixels. Indeed, each label map L n is the size of its underlying image I n (say h n w n ) which gives n h nw n labels to infer... very many! Especially considering that, if the jigsaw is of size J h J w, each of these labels can take J h J w values. Let us consider a small example: if we have a image ( pixels in total), and a jigsaw (10 4 pixels in total), it means that we have to store the likelihoods of each of the label map entries l i taking any of the 10 4 possible values... i.e values to store! In Matlab, using the regular double precision (8 bytes for every entry), that requires 4.8GB of memory. Even using the single precision, that requires 2.4GB. 121

140 CHAPTER 4. HYBRID BELIEF PROPAGATION To overcome this bottleneck, we propose using a variant of belief propagation (BP) which exploits the fact that many of the messages required during BP can be sparsely represented Sparse belief propagation Figure 4.8 shows the relationship between the likelihood and the structure of the message. Since the likelihood is a mixture of a Gaussian and a uniform, it is effectively constant for Figure 4.8: Sparse structure of the messages. As the likelihood is a mixture of a Gaussian and a uniform, the message has the same value in many of its entries. Hence, the message can be accurately represented by a sparse vector. many choices of l i, so that many entries in the message have (almost) the same value. Formally speaking, our message has the form p(l i J,I i ) p(i i J,l i ) = p 0 + s S p s δ(l i = s) where p 0 is the value of the uniform distribution, S is the set of locations in the jigsaw for which p(i i J,s) is a peak, and p s = p(i i J,s) p 0. We have now the same notation we have used in section 4.1.2, without even approximating our true distribution. Hence, it can be accurately represented (to machine precision) by a constant vector with a few peaks, without affecting the nature of the message. The message p(l i ) is then represented as a sorted list of the states in S and a corresponding list of the values log p s. To optimise the jigsaw label maps, we apply this notation to max-product BP, and we 122

141 4.3. Efficient inference use sparse messages and beliefs throughout. The messages are represented in log-form and, since max-product BP is invariant to message normalisation, each message is normalised so that log p 0 = 0. Since this process does not affect the original message, we can obtain the same results as belief propagation with full messages (full BP) whilst achieving significant memory and time savings (see section 4.3.4). However, if we desire additional improvements in efficiency, we can incorporate longer range image information to minimise the risk of incorrectly pruning message states. This is achieved by exploiting bottom-up information in a hybrid approach Hybrid belief propagation As we have seen in figure 4.8, the likelihood function p(i i J,l i ) defines a somewhat sparse message over the jigsaw locations l i by taking into account the appearance of the single pixel I i. For example, if pixel I i is purple, jigsaw locations whose colours are dissimilar to purple will have approximately the same likelihood, given by the p 0 term in the sparse message. A natural approach to increase sparseness of the messages is to incorporate longer range information. We can motivate this further using our purple pixel example and figure 4.9. Figure 4.9: The use of local evidence to favour mappings. The purple pixel in the image (left) could map to all the purple pixels in the jigsaw (right), as shown with the dashed lines. However, in this case, one mapping only is supported by local evidence, represented with a plain line. Same with the red pixel. 123

142 CHAPTER 4. HYBRID BELIEF PROPAGATION The dashed purple lines show that our purple pixel can effectively be mapped to all the purple pixels in the jigsaw. However, looking around the image pixel, we find blue, red and green pixels. Only one purple pixel in the jigsaw has such neighbours (mapping shown with a thick plain line), therefore we want to favour this location. Similarly, a purple pixel close to a blue pixel and a green pixel in the observed image is very likely to be mapped to a different location from a purple pixel close to a white pixel, as shown in figure Therefore, the two purple pixels will have different and a smaller Figure 4.10: The use of local evidence to split similar pixels. Both purple pixels in the image (left) could map to all the purple pixels in the jigsaw (right), as shown with the dashed lines. However, in this case, a different mapping for each is favoured by local evidence, represented with a plain line. set of plausible mappings into the jigsaw if the neighbourhood information is taken into account. These savings in the length of the messages should scale up drastically with the size of the jigsaw. Thus, by taking into account long-range information, we can narrow the search space, and increase the sparseness of the messages that are propagated during belief propagation. Rather than construct a heuristic to achieve this, we use a classifier to learn the relationship between the image patch around a pixel and the jigsaw location that pixel gets mapped to. We will see, such a classifier is able to use features of the image patch around each pixel to achieve much more efficient inference, with minimal loss in accuracy. We wish to train a classifier T to approximate the conditional probability p(l i I,J) in our 124

143 4.3. Efficient inference generative model. This is done by using a set of training images for which the corresponding label maps L have already been computed. As shown in figure 4.11, the classifier learns to predict the jigsaw location l i for each pixel I i of the training images given its surrounding pixels. Hence, it learns a (local) approximation to p(l i I,J), which we will denote p(l i I,T). Figure 4.11: Role of the classifier. The classifier T needs to be both efficient to train and to apply. Hence, we follow [55] and use a decision tree classifier trained with an entropy loss criterion, and with features involving differences of intensity values of pixels in particular relative locations to the pixel being classified. After learning the structure of the decision tree, we compute p(l i I,T) by finding the histogram of jigsaw locations for those pixels which were assigned to each leaf node. We use this bottom-up prediction to sparsify the generative likelihood by removing delta functions at locations with zero counts in p(l i I,T), as seen in figure In other words, the discriminative prediction is used to mask the generative likelihood function. Hence we will often refer to the prediction returned by the classifier as the mask. For the images that are used to train the classifier, this construction method ensures that their pixels I i keep their real likelihood for their current location l i, so it is reasonable to think that it will give good solutions for these images, and decent solutions for test images (images used to train J but not T) that have similar local appearance. The quality of the approximation will depend on the generalisation capabilities of the classifier. The idea of using bottom-up predictions to speed up inference can also be found in [80]. Here, it is interesting to note the interplay between the discriminative and generative models. The bottom-up classifier tries to mimic the generative jigsaw, while inference in the top-down generative jigsaw is guided and made efficient by the classifier. 125

144 CHAPTER 4. HYBRID BELIEF PROPAGATION Figure 4.12: Construction of the hybrid likelihood. Top-left: the generative likelihood p(l i I i,j) for a particular pixel. Bottom-left: the discriminative probability p(l i I,T) for the label l i given the image and the decision tree. This is trained to approximate p(l i I,J) under the generative model. Right: Hybrid likelihood p(l i I i,j,t) which is equal to p(l i I i,j) masked by the discriminative likelihood Efficiency of Sparse BP and Hybrid BP We are now interested in exploring the efficiency of Sparse BP and Hybrid BP when used to infer the label maps for a pre-learnt jigsaw. For this, we make use of 30 images of scenes containing buildings, taken from the Microsoft Research Cambridge object recognition data-set [73]. We use 20 images for training, and keep 10 for testing. Figure 4.13a shows examples from the test set. We then learn the jigsaw shown in figure 4.14a using the learning method of [46] from the 20 training images. The jigsaw size is relatively small since it is learnt using existing methods. Given this jigsaw, we apply Hybrid BP with decision tree sizes from 1 to around 1800 nodes to learn the label maps for the test images. Bear in mind that Hybrid BP is equivalent to Sparse BP when the tree size is 1, since there is no classifier (i.e. no approximation). In figure 4.14b we show the memory requirements of Hybrid BP as a percentage of the memory needed for full BP, for different decision tree sizes. Sparse BP is able to infer the label maps 126

145 4.3. Efficient inference (a) (b) Figure 4.13: Segmentation on building images. (a) Example building images from the MSR Cambridge data-set. (b) Corresponding inferred segmentations. (a) (b) 50 (c) sparse BP x 106 (d) unknown tree sky building grass road water % memory use tree size (number of nodes) hybrid BP log probability of test images sparse BP hybrid BP - 20 training images 1.22 hybrid BP - 15 training images hybrid BP - 10 training images 1.2 hybrid BP - 7 training images 1.18 hybrid BP - 5 training images hybrid BP - 3 training images 1.16 hybrid BP - 2 training images hybrid BP - 1 training image % memory use Figure 4.14: Various plots. (a) The jigsaw used to test Hybrid BP. (b) Memory usage of Sparse BP and of Hybrid BP for varying sizes of decision tree. Sparse BP can achieve equivalent to standard BP using only 46% of the memory. Hybrid BP can further reduce the memory requirements at the cost of a reduction in accuracy. (c) Accuracy of inference for Hybrid BP against memory use for different sizes of training set. High accuracy can be achieved on a set of test images with as little as 10 15% of the memory needed for standard BP. (d) Class map showing the most likely object class at each jigsaw location. 127