Leveraging over Prior Knowledge for Online Learning of Visual Categories across Robots MOHSEN KABOLI

Size: px
Start display at page:

Download "Leveraging over Prior Knowledge for Online Learning of Visual Categories across Robots MOHSEN KABOLI"

Transcription

1 Leveraging over Prior Knowledge for Online Learning of Visual Categories across Robots MOHSEN KABOLI Master of Science Thesis Stockholm, Sweden 2012

2 Leveraging over Prior Knowledge for Online Learning of Visual Categories across Robots MOHSEN KABOLI DD221X, Master s Thesis in Computer Science (30 ECTS credits) Master Programme in Wireless Systems 120 cr Royal Institute of Technology year 2012 Supervisor at CSC was Danica Kragic Examiner was Danica Kragic TRITA-CSC-E 2012:053 ISRN-KTH/CSC/E--12/053--SE ISSN Royal Institute of Technology School of Computer Science and Communication KTH CSC SE Stockholm, Sweden URL:

3 Contents 1 Introduction Introduction Related Works Contribution and Organization of this Thesis Background Knowledge Transfer Motivation Definition of Knowledge Transfer and Some Notation Related and Unrelated Domain Important Issues in Knowledge Transfer Techniques Categorization of Knowledge Transfer Inductive Knowledge Transfer Trnsductive Knowledge Transfer Unsupervised Knowledge Transfer Negative Transfer Overview on Online Learning Motivation Passive Agressive Algorithm Batch Transfer Learning and Online Transfer Learning Batch Transfer Learning Motivation Least Square Support Vector Machine (LS-SVM) Learning a new object category from many samples Learning a new object category from few samples Multi Prior Transfer Learning OTL: A Framework of Online Transfer Learning Introduction OTL Algorithm Preposition Theoretical Analysis

4 CONTENTS 4 Proposed Method Hybrid Transfer Learning TRansfer initializes Online Learning : TROL Method Theorem Augmentation trick Method : TROL+ Method Modified OTL : M-OTL Experiments and Results Experiments Baselines Initialization of Experiment Single Source or Single prior knowledge Multiple source or Multi Prior knowledge Value of Weights Full Caltech 256 dataset Discussion Conclusion and Future Work Conclusion Future Work Bibliography 61

5 Chapter 1 Introduction Abstract Open ended learning is a dynamic process based on the continuous processing of new data, guided by past experience. On one side it is helpful to take advantage of prior knowledge when only few information on a new task is available (transfer learning). On the other, it is important to continuously update an existing model so to exploit the new incoming data, especially if their informative content is very different from what is already known (online learning). Until today these two aspects of the learning process have been tackled separately. In this thesis we propose an algorithm that takes the best of both worlds: we consider a sequential learning setting, and we exploit the potentiality of knowledge transfer with a computationally cheap solution. At the same time, by relying on past experience we boost online learning to predict reliably on future problems. A theoretical analysis, coupled with extensive experiments, show that our approach performs well in terms of the online number of training mistakes, as well as in terms of performance on separate test sets. 1

6 CHAPTER 1. INTRODUCTION 1.1 Introduction Machine learning algorithms predict on the future data with the help of statistical models that are learned from collected labeled or unlabeled training sets [8],[19],[29]. In semi supervised classification [2],[16],[17], [32] labeled data are too few to build a good classification models, therefore using a large amount of unlabeled data together with a small amount of labeled data it is possible to find the good learning models for the classifiers. Many research in classification task have done by assuming that the distributions of the labeled and unlabeled data are the same. Knowledge transfer, in contrast, allows the domains, tasks, and distributions used in training and testing to be different. Knowledge transfer allows to exploit prior knowledge when learning a new class, which reduces the need for annotated training samples. Many works address the issue of what to transfer, for instance, samples of data [1], feature representation [4], model parameters [12], moreover, some focus on how to transfer [12], like, boosting [30], and SVM [7], while others concentrate on how to avoid negative transfer, evaluating when and how much to transfer or methods to measure the task relatedness [10]. In real world, we observe many examples of transfer learning for example, we may find that learning to recognize bicycle, might help to recognize motorbike or car rather than cat and dog. The fundamental motivation for knowledge transfer in the field of machine learning was discussed in a NIPS- 95 workshop on Learning to Learn which focused on the need for lifelong machine learning methods that retain and reuse previously learned knowledge [22]. The main goal of all research in visual recognition is to enable vision-based artificial systems to operate autonomously in the real world. However, even the best system we can currently engineer is bound to fail whenever the setting is not heavily constrained. This is because the real world is generally too nuanced, too complicated and too unpredictable to be summarized within a limited set of specifications. There will be inevitably novel situations and the system will always have gaps, conflicts or ambiguities in its own knowledge and capabilities. This calls for algorithms able to support open ended learning of visual classes. The open ended learning issue, i.e. the ability to learn a new detected class continuously over time, has been typically addressed in a fragmented fashion in the literature. A first component is that of transfer learning, i.e. the ability to leverage over prior knowledge when learning a new class, especially in presence of few training data [22]. A second component is that of being able to update continuously the learned visual class, as new samples arrive sequentially. The dominant approach in the literature here is that of online learning: predictions are made on the fly and the model is progressively updated at each step, on the basis of the given true label. An attractive feature of this family of algorithms is that they aim at minimizing the number of total mistakes on the incoming samples (mistake-bound). In this thesis we propose to merge together these two components, using the prior knowledge sources for initializing the online learning process on a new target task through transfer learning. This has two main advantages: (1) by using a principled transfer learning process we can study the relation between the old sources and the new target. Within this 2

7 1.2. RELATED WORKS framework, few samples might be sufficient to indicate in which part of the original space the correct solution (the best in term of generalization capacity) should be sought. (2) we show theoretically that a good initialization for the online learning process produces a tighter mistake bound compared to previous work [31], while empirically improving the recognition performance on an unseen test set. Globally an expensive transfer learning approach is used only at the beginning, therefore limiting the computational burden. Then, a fast and efficient online approach is applied. We choose the Passive Aggressive online learning algorithm [5], and we show how to initialize it in two different fashions with a state of the art transfer learning method [25]. For each of the two versions of the algorithm, we derive the relative mistake bound, which provide us with a deeper understanding of the methods. Experiments on the object categorization domain show the potential of our approach. 1.2 Related Works To our best of knowledge, the most similar approach to our methods has introduced by Zhao and Hoi in OTL (A Framework of Online Transfer Learning) [31] which is based on ensemble learning, it makes a prediction of online learning function with the help of PA-algorithm [5] on the data of the target domain, and combine it with the old prediction function learned on the prior knowledge. The weights for the combination of prediction function are adjusted dynamically on the basis of a squared loss function which evaluates the difference among the current prediction and the correct label of any new incoming sample. OTL algorithm cannot exploit multi prior knowledge and authors just introduced a theoretical analysis for single prior knowledge model which demonstrates the existence of a mistake bound for their algorithm. In contrast, our proposed methods are able to transfer from multiple sources. Moreover we will show that the bound on mistakes in case of exploiting both single and multi prior knowledge do not behave worse than the OTL. In addition, to be able to compare our results in case of multi source transfer learning we modified OTL algorithm to be able to transfer multi prior knowledge (4.1.4). 3

8 CHAPTER 1. INTRODUCTION 1.3 Contribution and Organization of this Thesis In this thesis we proposed two novel online transfer learning methods that aim to transfer knowledge from some source domains to an online learning task with different classes in both target and source data. We call our suggested methods TROL and TROL+. In TROL technique (4.1.1) we propose initializing of the online learning algorithm with the help of Multi-KT method [25] which provides a model for the new target problem on the basis of very few training samples exploiting a reliable combination of sources. In second method, TROL+ (4.1.3) we suggest re-weighting the old knowledge during online integrating prior and new knowledge together. We show that it is possible to employ a simple augmentation trick to have the same starting condition of TROL together with a progressive update of old and new knowledge weights in time. One important issue in our research which raises the challenge of knowledge transfer is to address concept drifting problem. It means that the new samples to be predicted changes over time in learning process. Moreover, to compare our techniques with OTL method as the only close work to ours, in case of employing multi class prior knowledge, we needed to modify the OTL to be able to use multiple sources as the original algorithm cannot. We called the modified version of OTL algorithm M-OTL (4.1.4). We show also the mistakes bounds of the proposed algorithm, and empirically examine our methods with some available baselines. This thesis is organized as follows : Chapter 1 includes an introduction to our research followed by description of some related works, chapter 2 explains some definition of transfer learning and reviews online learning methods, in chapter 3 we discuss batch transfer learning together with online transfer learning algorithms, chapter 4 presents our proposed methods, chapter 5 gives experimental results, and chapter 6 concludes this thesis and our research. 4

9 Chapter 2 Background 5

10 CHAPTER 2. BACKGROUND 2.1 Knowledge Transfer Motivation Most traditional machine learning algorithms are invented based on training and testing data sample with same feature space and distribution domain. On the contrary, knowledge transfer makes new target learning algorithms able to exploit the pre-trained knowledge from the previous tasks with different distribution and feature space. In other words, knowledge transfer method allows to employ prior knowledge when learning a new class, which in both supervise and semi supervise learning approach reduces the need for labeled training data. Figure (2.1) shows different learning process between traditional learning and knowledge transfer. Figure 2.1. learning [22] Different learning process between traditional learning and transfer Definition of Knowledge Transfer and Some Notation Domain D has two elements one is feature space, x = {x 1,..., x n } X, and the other is marginal probability distribution, P (x). Different domain D s means different either in the feature space for instance, different feature descriptors in object recognition issue or different in the marginal probability distribution for example, different class of objects in object categorization task. A Task T, in knowledge transfer consists of a label space, and objective predictive function f (.). Different task, means either different in objective predictive function f (.) (different conditional probability distribution ) for instance, the situation in which the source and target objects in object recognition issue are very unbalanced in term of the user defined classes, or different in label space Y for instance in situation where source domains has binary classes whereas the target domain is multi-class [22]. 6

11 2.1. KNOWLEDGE TRANSFER Related and Unrelated Domain The source domains and target domains are related when implicitly or explicitly, there exists some relationship between feature spaces otherwise are unrelated [22] Important Issues in Knowledge Transfer Techniques In the field of knowledge transfer, researchers are concern about finding the proper and clear answer for the following issues : what to transfer, how to transfer, and when to transfer [26]. What to transfer means to discover which part of knowledge makes an improvement of performance in domains and targets and is worth to transfer. To develop the learning algorithm to transfer the knowledge corresponds to the how to transfer. When to transfer means in which situation transferring the knowledge helps to improve the performance of learning in target and when transferring the knowledge by negative transferring hurts the performance of learning due to the unrelated domain and target source [22] Categorization of Knowledge Transfer Transfer learning is classified into three different settings: (1)Inductive Transfer Learning,(2)Transductive Transfer Learning, and (3)unsupervised transfer learning. Moreover, based on what to transfer in learning, each approaches is categorized into four contexts. (1)Instance Transfer approach, (2)Feature Representation Transfer approach, (3)Parameter Transfer approach, (4)Relational Knowledge Transfer approach. In following we explain the different approaches of the transfer learning [22] Inductive Knowledge Transfer Inductive transfer learning tries to ameliorate the learning of the target learning function f(.) in D T exploiting the knowledge in D s and T s, in which T s T T [22]. Instance Transfer Approach In this approach, it is not possible to utilize source domain directly in the target domain, but there are some particular parts of the data that can be exploited with a few annotated data [22]. 7

12 CHAPTER 2. BACKGROUND Feature Representation Transfer Approach The goal of feature representation transfer approach is to find the suitable feature representations to reduce domain divergence and classification model error. The procedure of discovering the appropriate feature representations are different for various types of the source domain data [22]. Parameter Transfer Approach In this approach the individual models for related tasks share some parameters or prior distributions of hyper parameters. The majority of the approaches are designed to work under multitask learning by easily changing the regularization structure [22]. Relational Knowledge Transfer Approach This technique of knowledge transfer is alien to previous approaches and data are not independent and identically distributed (i.i.d) so it can be delineated by multiple relations, which means that is not essential data in each domain to be (i.i.d ) [22] Trnsductive Knowledge Transfer Transductive transfer learning attempts to ameliorate the learning of the target leaning function f(.) in D T employing the knowledge in D s and T s, where T s = T T and D s D T. Moreover, there must be some unrelated target domain data at training time [22]. Instance Transfer Approach The purpose of this idea is to learn an optimal model for the target domain by reducing the expected risk, in addition there is not any annotated training samples in the target domain, therefore, the models should be learned just by the source domain data [22]. Feature Representation Transfer Approach Many researches in this approach have done under unsupervised learning frameworks [22]. 8

13 2.1. KNOWLEDGE TRANSFER Unsupervised Knowledge Transfer Unsupervised transfer learning tries to gain the learning of the target function f(.) in D T employing the knowledge in D s and T s, where T s T T and labels of training samples in both source and target domains are not accessible. In unsupervised transfer learning, the predicted labels are possible variables [22]. Feature Representation Transfer Approach This perspective of knowledge transfer attempts to clustering a small group of unlabeled data in the target domain by employing a large amount of unlabeled data in the source domain. Theses techniques are studied by [28]. In figure (2.2) we summarize all different setting of Transfer Learning. Self taught Learning Self taught Learning No labeled data in a source domain No labeled data in a source domain Case1 Case1 Multi task Learning Multi task Learning Inductive Transfer Inductive Learning Transfer Learning Labeled data are available Labeled in a target data are domain available in a target domain Labeled data are available in a source domain Labeled data are available in a source domain Source and targets are learnt Source simultaneously and targets are learnt simultaneously Case2 Case2 Domain Adaptation Domain Adaptation Transfer Learning Transfer Learning Labeled data are available only Labeled in a data source are domain available only in a source domain Transductive Transfer Transductive Learning Transfer Learning Assumption: different domains but Assumption: single different task domains but single task Assumption: single domains and Assumption: single single task domains and single task No labeled data in both source No labeled and target data domain in both source and target domain Sample Selection Bias Sample Covariance Selection Shift Bias Covariance Shift Unsupervised Transfer Unsupervised Learning Transfer Learning Figure 2.2. Overview of Different Setting of Transfer Learning [22] 9

14 CHAPTER 2. BACKGROUND Negative Transfer Most approaches to transfer learning assume transferring knowledge across domains be always positive. However, in some cases, when two tasks are too dissimilar, brute-force transfer may even hurt the performance of the target task, which is called negative transfer [21]. 10

15 2.2. OVERVIEW ON ONLINE LEARNING 2.2 Overview on Online Learning Motivation A common trend in machine learning is the employment of huge amounts of data to achieve an increase in classification performance. Cognitive systems learn continuously from experience, updating their models of the environment. This learning strategy is the main purpose why cognitive systems are capable of achieving a robust, still adaptable ability to respond to new stimuli. Moreover, in real world many problems are fundamentally sequential and information will not be accessible at the same time. Sometimes the learned implication will revolve over time. For instance an autonomous robots need to learn continuously from their adjacency, to adapt to the ever changing surroundings. This interesting example needs learning algorithms capable to update their internal delegation which is antagonized to the traditional batch learning algorithms. Many times updating the solution is conceivable only via a thorough re-training, employing the training set including with the existing samples and the new training samples. The algorithms expanded in the online learning structure are fundamentally brought forth to be updated after each sample is received. Therefore, we conclude that online updating is a vital component for learning algorithms using in artificial intelligent systems. From computational complexity perspective, online learning algorithm has extremely lower mathematical complexity than batch learning methods. In machine learning field, recently many strong online learning algorithm have been proposed [23] with the ability of employing different regularizers to minimize the objective loss function effectively. With the aid of the offered methods, we can draw expedient online learning methods to find the solution for complex batch learning issues. In the following, concurrently we look at the issue from passive-aggressive online learning algorithm and optimization perspective (2.2.2) Passive Agressive Algorithm The passive-aggressive (PA) algorithm was proposed by Crammer and his colleagues in 2006 [5]. PA-algorithm is a margin based online learning algorithm to estimate different classification tasks which observes instances in a sequential manner. Online algorithm is usually simple to implement and their analysis often provides tight bound on their proficiency. Although PA-algorithm utilizes hypothesis from the set of linear prediction by using Mercer kernels one can employ highly non-linear prediction with still keeping the formal attributes and simplicity of linear predictor. In binary classification each training sample is depicted by a vector and prediction is based on a hyperplane which divides the feature space into two half-space. The 11

16 CHAPTER 2. BACKGROUND update of the algorithm is carried out by solving a confined optimization function by keeping the new classifier as close as possible to the current models of classifier while getting at least a unit margin on the most recent instances. The general structure of PA-algorithm can be viewed as discovering a Support Vector based on the single training sample while changing the norm of SVM with almost constraint to the current classifier. This method has two advantages. (1) To have close form solution for the next classifier and (2) to provide a unified analysis of the cumulative loss for different online classification. Problem Setting Online binary classification participates in a sequence of rounds. On each round the algorithm perceives one sample and estimates its related label with the aid of pre-learned hyper models from previous observed examples. PA- algorithm makes correct prediction if the corresponding margin in each round is grater than one otherwise, it suffers an instantaneous loss which is defined by hinge-loss function. In addition, classifier uses the newly obtained instance-label pair to improve its prediction for the next rounds. The confidence of the prediction in each round, can be shown by w.x. PA-algorithm learns weight vector w incrementally and observes a new training sample and weight vector time to time t. In PA-algorithm hinge loss function is used to calculate the missclassification as [5] : l(w; (x, y)) = max(0, 1 y(w.x)) (2.1) Binary Classification Algorithms How to initialize the weight vector is very important in both obtaining a concrete algorithm and defining the update rule to modify the weight vector w t at the end of each round t. In this section we present three different update rules in case of binary classification. We name them as PA, PAI, and PAII. In all different update rules the weight vector w 1 is initialize with (0,...,0), on round t the new wight vector w t+1 is the solution of the following constrained optimization function [5] : 1 w t+1 = min w R d 2 w w t 2 s.t l(w; (x t, y t )) (2.2) Geomatrically, w t+1 is set to be the projection of w t on to the half-space of vectors which attains hinge-loss of zero on the current instance. If the hinge-loss is zero then 12

17 2.2. OVERVIEW ON ONLINE LEARNING w t+1 = w t, therefore the loss function is also zero l t = 0. Fig.(2.3). The solution to the optimization problem in Eq.(2.2) has a simple close form as in following : w t+1 = w t + τ t y t x t where τ t = l t x t 2 (PA) (2.3) If l t = 0 then w t, itself satisfies the constraining Eq.(2.2) and it is optimal solution of optimization problem. If l t > 0, then the Lagrangian cost function of Eq.(2.2) is: L(x, τ) = 1 2 w w t 2 τ(1 y t.x t ). (2.4) where τ 0 is a Lagrangian multiplier. The optimization problem Eq.(2.2) has a convex objective function and a single affine constrain which are enough to hold the Slater s condition to find the optimum solution which is equivalent to satisfying the Karush-Khun-Tucker s condition(boyd and Vanderberghe,2004). By setting the partial derivatives of L equal to zero with respect to the elements of w, we get [5] : 0 = w L(w, τ) = w w t τy t x t = w = w t + τ t y t.x t (2.5) By substituting the above result into Eq.(2.4) we get : L(τ) = 1 2 τ 2 x t 2 + τ(1 y t (w t x t )) (2.6) By taking the derivative of L(τ) with respect to τ and setting to zero, we get: 0 : = L(τ) τ = τ x t 2 + (1 (y t w t x t )) = τ = 1 y t(w t x t ) x t 2 (2.7) Since l t > 0, we have l t = 1 y t (w t.x t ), therefore, we can write a uniform update for the case where l t = 0, and also the case where l t > 0, just by setting τ t = l t x t 2. 13

18 CHAPTER 2. BACKGROUND In PA-algorithm, updating the weight vector satisfies the constrain aggressively by imposing the current sample which may result undesirable consequences. In other word, PA-algorithm updates weight vector in time of observing a new sample, therefore, any miss classification due to label noise in data set may cause sever change in weight vector with the result of the wrong direction of hyperplane and it means that having several prediction mistakes on subsequent rounds. PA-algorithm introduces two techniques to solve this problem. The original idea of these technique comes from soft margin classifier in support vector machine (Vapnik 1998) by introducing a non-negative slack variable ξ in optimization objective function Eq.(2.2). In first method the optimization objective function is scaled with ξ. 1 w t+1 = min w R d 2 w w t 2 + Cξ s.t. l(w; (x t, y t )) ξ and ξ 0. (2.8) Parameter C in (2.8) is positive value which governs the influence of the slack terms. Large value of C implies more aggressive update in algorithm. This technique is known as a PAI [5]. In second technique, objective cost function is scaled quadratically with ξ. since ξ 2 is always non-negative there is no need for constraint ξ 0. This update method is known as PAII [5]. w t-1 = 1 min w Re d 2 w w t 2 + Cξ 2 s.t. l(w; (x t, y t )) ξ. (2.9) The solution to optimization problem in Eq.(2.8) and Eq.(2.9) has a simple unified close form for all update techniques. w t+1 = w t + τ t y t x t (2.10) Where : τ t = l t x t 2 in (PA), τ t = min{c, l t x t 2 } in (PAI), τ t = l t x t C in (2.11) All discussion till now was limited to linear prediction of the form sign(w.x). By using Mercer kernel we can generalize PA-algorithm as non-linear classifier. For all three different PA update we can write [5] : (PAII) 14

19 2.2. OVERVIEW ON ONLINE LEARNING Figure 2.3. An illustration of the update: w t+1 is found by projecting the current vector w t onto the set of vectors attaining a zero loss. This set is a stripe in the case of regression, a half-space for classification, and a ball for uni class [18]. and therefore: t 1 w t = τ i y i x i (2.12) i=1 t-1 w t x t = τ i y i (x i.x t ) (2.13) i=1 The inner product of Eq.(2.13) can be replaced with a general Mercer Kernel K(x i.x t ). t 1 w t x t = τ i y i K(x i, x t ) (2.14) i=1 Analysis of PA-Algorithm In PA-algorithm the cumulative squared hinge loss upper bounds the number of prediction mistakes. The bounds in PA-algorithm prove that for any sequence of training samples this algorithm cannot work worse than the best fixed predictor. Since we are using PAI update in our project we will show just mistake bound for PA-algorithm in following [5]. 15

20 CHAPTER 2. BACKGROUND Theorem : Let (x 1, y 1 ),..., (x T, y T ) be a sequence of samples where x t R n, y t {+1, 1} and x t R for all t. Then, for any vector u R n, the number of prediction mistakes made by PAI on this sequence of example is bounded from [5]: max{r 2, 1 T C }( u 2 + 2C l(u; (x t, y t ))) (2.15) t=1 where C is the aggressive parameter provided to PAI. 16

21 Chapter 3 Batch Transfer Learning and Online Transfer Learning 17

22 CHAPTER 3. BATCH TRANSFER LEARNING AND ONLINE TRANSFER LEARNING 3.1 Batch Transfer Learning Motivation Suppose we are given n different visual categories, the task is to distinguish each classes from background. Therefore, we should determine n different learning functions n i (x) {1, 1}, with i = 1,..., N such that the object x is set to be in the i th class if and only if n i (x) = 1. Now we would like to learn a new learning function f n+1 to classify a new n + 1 class in the situation that only one or few instances together with some background samples are accessible. To obtain f n+1 we need to train our classifier just with the available training data, or taking the advantage of pre-trained learning function or model parameters. In other words, suppose the task is to learn from few examples, for instance the class car, having already learned the categories motorbike, trunk, horse, and dog with the aim to improve the results by transferring from car and motorbike, rather than transferring from car or motorbike. The expectation is to get better results compared to transferring equally from all known categories, might generate negative transfer. In general speaking, knowledge transfer can give three advantages [9] in comparison with traditional machine learning, see Figure (3.1). First advantage is to get the higher start means the initial performance is higher (one-shot learning) and the second one is to have higher slope it means that the performance grows faster, the last benefit is to achieve the higher asymptote which means the final performance is better. This kind of scenario motivates us to design a knowledge transfer algorithm able to find autonomously the best subset of known models from where to transfer. In the following we briefly review the LS-SVM (3.1.2) theory and how it can be used in a model adaptation framework [11] then we review how this approach can be formulated to derive a knowledge transfer algorithm that exploits source knowledge from only one of the n classes [24] and at the end we extend this method to employ all the appropriate old knowledge [25]. Since we are utilizing a batch of data sample to find the bet learning models, we call our developed techniques as Batch TRansfer Learning Least Square Support Vector Machine (LS-SVM) Definition Least Square Support Vector machine (LS-SVM) is a Kernel based learning methods. This algorithm is interesting because it can be employed as a strong non linear classifier just by utilizing simple mathematical methods. Suppose we have a binary problem with a set of l number of training samples {x i, y i } l i=1 in which x i X R d is an input vector expressing the i th sample and y i Y = { 1, 1} is the related label. The goal is to design a linear model function in a fixed feature 18

23 3.1. BATCH TRANSFER LEARNING Figure 3.1. Three ways in which transfer might improve learning [26]. space which can distinguish correctly the unseen test sample x [6] : f(x) = w φ(x) + b (3.1) where φ(x) is used to map the input samples to a high dimensional feature space Φ : X F, induced by a kernel function : K(x, x ) = φ(x) φ(x ) (3.2) In LS-SVM the model parameters (w, b) are found by solving the following optimization problem [11]: min w,b 1 2 w 2 + C 2 l [y i w φ(x i ) b] 2 (3.3) i=1 where C is a regularization parameter governing the bias-variance trade-off [6]. The precision of LS-SVM on unseen test data depends on discovering optimum values for the hyper-parameters which in our case are regularization parameter C and kernel parameters. Looking for the optimum values of hyper-parameters are named as model selection. It can be shown that the optimal w is expressed by w = l i=1 α i φ(x i ). The corresponding primal Lagrangian for this optimization problem gives the following unconstrained minimization problem [6]: L = 1 2 w 2 + C 2 l l ξi 2 α i {w φ(x i ) + b + ξ i y i } (3.4) i=1 i=1 19

24 CHAPTER 3. BATCH TRANSFER LEARNING AND ONLINE TRANSFER LEARNING where : α = (α 1, α 2,..., α l ) R l (3.5) is a vector of Lagrange multipliers and can be found by the solving the regularized least-squares problem (3.3) with a computational complexity of O(3) operations. The optimality conditions for the problem (3.4) is expressed by a linear equations that can be written concisely in matrix form as follows [3]: [ K + 1 C I 1 1 T 0 ] [ α b ] = [ y 0 ] (3.6) In Eq.(3.6), K depicts a kernel matrix. Suppose G delineates the first term in left-hand side of (3.6). Therefore the least-square optimization problem [3] can be solved just by simply inverting G. As previously mentioned the accuracy of the models on unseen test data is substantially depends on selecting the good learning parameters (e.g. the kernel parameters and the regularization parameter C can be found by a preceding cross validation considering the leave-one-out error. The LOO error is an unbiased estimator of the classifier generalization error and can be evaluated employing knowledge already accessible as a by-product of training the least-squares support vector machine on whole data set, with only a insignificant extra computational costs [3] [20]. Regarding to the equation (3.6) it is possible to show : [α ( i), b ( i) ] T = G ( i) 1 [y 1,..., y i 1, y i 2,.., y l, 0] T (3.7) which shows the dual parameters of the LS-SVM when the i th sample is excluded during the leave-one-out cross validation procedure. With the aid of block matrix inversion lemma we can write the leave-one-out error r ( i) i for the i th sample in close form.[3] [6]: r ( i) i = y i ŷ i = α i G 1 ii. (3.8) where G ( i) is the matrix obtained when the i th sample is omitted in G. Without 20

25 3.1. BATCH TRANSFER LEARNING explicitly running cross validation experiment it is possible to express a criterion error to maximize the LS-SVM model generalization performance, therefore the best learning parameters are those reducing the following error: ERR = l i=1 Ψ{y i r ( i) i 1} with Ψ = exp (γ z) (3.9) Learning a new object category from many samples We would like to learn a new category from a pre-trained learning models or in other words from a set of annotated training data {x i } i=1,m with the reward of exploiting what has been learned till now. To constrain a new model close to one of a set of pre-trained models [11] offered the technique which is mathematically formulated in the LS-SVM classification domain just by moderately changing the classical regularization part and expressing the following optimization problem : min w,b 1 2 w βw 2 + C 2 l ξ 2 (3.10) i=1 subject to y i = w φ(x i ) + b + ξ i i {1,, l} Where w is the parameter explaining the pre-traind model and β is a scaling factor for governing the degree to which the new model is close to the old models or pre-trained models. l w = βw + α i φ(x i ) (3.11) i=1 The optimum solution of modified objective cost function is given by the sum of the pre-trained model scaled by the parameter b and a new model built on the new data points. If b equals to 0 in (3.10) we will get the original LSSVM formulation, which is without any adaptation to the previous data. To find the optimal β, [24] proposed to take benefit from the possibility of LS-SVM to write the leave-one-out error in closed form so the modified close form of leave-one-out error for modified cost function (3.10) is : 21

26 CHAPTER 3. BATCH TRANSFER LEARNING AND ONLINE TRANSFER LEARNING r ( i) i = α i G 1 ii β α i G 1 ii, (3.12) where α i = G 1 ( i) [ŷ 1,..., ŷ i 1, ŷ i+1,..., ŷ l ] T (3.13) The G ( i) is the matrix attained when the i th sample is excluded in G and ŷ i = (w φ(x i )), it means ŷ i is the prediction of the old model on the i th sample. With the aid of the new closed form of LOO for the modified cost function it is possible to discover best β for each known model, therefore by comparing all the criterion errors, the lowest one recognizes the best prior knowledge model to employ for adaptation Learning a new object category from few samples We are given training samples with 1 positive and 10 negative instances and asked to assess from where to transfer with the aid of the leave one-out error. In such a situation just making one wrong prediction on one sample contributes for 1/10 of the total errors independently respect to the sign of its label. To utilize efficiently the criterion error is substantial to be more tolerant on negative sample due to their higher number, and strict on the positive sample which is alone. To cope with the effect of the unbalance contribution of training samples, re-weighting the leave-oneout recognition based on the number of the positive and negative samples is very promising. In the following we show the leave-one-out cross validation estimate of the Weighted Error Rate (WERR) by editing the criterion error [6] [3]: Where : W ERR = l i=1 ζ i Ψ{y i r ( i) i 1} (3.14) ζ i = { l 2l + if y i = +1 l 2l if y i = 1 (3.15) In equation (3.15), l + and l are the number of positive and negative instances respectively. ζ represents a weighting factor which is asymptotically equal to resampling the data, also the function Ψ is the same as (3.9). We suppose again our training set with 1 positive and 10 negative examples, and by employing the 22

27 3.1. BATCH TRANSFER LEARNING Weighted Error Rate, the error on a negative example cooperates for 1/20 of the total while the error on the positive example contribute for 1/2. Without explicitly running cross validation experiments, the best learning parameters which maximize the LS-SVM model generalization performance can be found by minimizing WER and the lowest WER. The modified leave-one-out cross validation estimate of the Weighted Error Rate is named as Adapt-W. WERR gives concurring only in the last part of the transfer learning and it does not help to build the new adapted model which means that it is beneficial to choose the best prior knowledge to transfer and helps to define the relevance of the new task.to increase the robustness to unbalanced distributions of the data, the model parameters (w, b) can be found through the minimization of a regularized weighted least-square loss function [6]: min w,b 1 2 w 2 + C 2 l ζ i [y i w φ(x i ) b] 2. (3.16) i=1 The optimality condition for obtained problem (3.16) express a linear equation. The solution of this linear function results to discover the model parameter (α, b): [ K + 1 C W 1 1 T 0 ] [ α b ] = [ y 0 ], (3.17) Where W = diag{ζ 1 1, ζ 1 2,..., ζ 1 l } and ζ i are defined as in (3.15). Hence the model adaptation technique changes to its weighted formulation [24] : min w,b 1 2 w βw 2 + C 2 n ζ i [y i w φ(x i ) b] 2 subject to 0 β 1. i=1 The weighting factors ζ i take into account that the part of positive and negative examples in the training data are known not to be nominee of the operational class frequencies. Particularly, they help to balance the collaborating with the sets of positive and negative examples to the data misfit term [24]. The proposed algorithm selects only one prior known model in time and is not always the best solution in a situation of exploiting multi prior models moreover, when the number of training samples increase, this method suffers from instability in time. Therefore we should find the new solution to cope with mentioned weakness points of [24]. The [25] solved the problems of [24] and proposed new method based on LS-SVM which is almost the modification of [24], in following we briefly explain this approach. 23

28 CHAPTER 3. BATCH TRANSFER LEARNING AND ONLINE TRANSFER LEARNING Multi Prior Transfer Learning The [25] offered to replace the single coefficient β with a vector β into the objective cost function (3.18) which contains as many components as the number of prior models, k : min w,b k w β j w j + C j=1 2 l ζ i (y i w φ(x i ) b) 2. (3.18) i=1 Where β has to be selected in the unitary ball, it means β 2 1. It is equated to the regularization term employed in LS-SVM in Equation (3.3), and it is a natural generalization of the original confine 0 β 1 which is substantial to prevent high variance problems and happen when the number of known models are sufficient enough compared to the number of training samples. The new formulation of the optimal solution can be shown as : k l w = β j w j + α i φ(x i ). (3.19) j=1 i=1 where w is defined as a weighted sum of the pre-trained models scaled by the parameters β j, plus the new model built on the incoming training data. To discover the optimal β we utilize again the benefit of LS-SVM to write the LOO error in closed form: r ( i) i = y i ỹ i = α i G 1 ii k α i(j) β j j=1 G 1, (3.20) ii where α i(j) = G 1 ( i) [ŷj 1,..., ŷj i 1, ŷj i+1,..., ŷj l, 0]T (3.21) and ŷ j i = (w j φ(x i )) (3.22) and ỹ i are the LOO estimation. By multiplying by y i we achieve: 24

29 3.1. BATCH TRANSFER LEARNING y i ỹ i = 1 y i α i G 1 ii k α i(j) β j j=1 G 1. (3.23) ii The best amount of β j are those reducing the LOO error, it means the values resulting positive amount for y i ỹ i, for each i. However minimizing directly the sign of those quantities would conclude in a non-convex formulation with lots of local minima. Therefore, we offer the following loss function : loss(y i, ỹ i ) = ζ i max [1 y i ỹ i, 0] = max y i ζ i α i k α i(j) β j, 0. (3.24) G 1 ii j=1 G 1 ii This loss function is alike to the hinge loss exploited in Support Vector Machines. It is a convex upper bound to the LOO miss- classification loss and favors solution in which ỹ i has a amount of 1, beside having the same sign of y i. Moreover it has a smoothing effect, similar to the function in (3.9). Finally, the objective function is: l J = loss(y i, ỹ i ) s.t. β 2 1. (3.25) i=1 The above formulation is equivalent to the optimization problem of : β CJ (3.26) for a proper choice of C [6]. The best values of β j to weight the known prior models can be find by minimizing J in the transfer learning process and scaling factors ζ i are introduced in the loss function to take care of the data unbalance between positive and negative samples in the training set, as in [24]. The optimization process the [25] utilizes a simple projected sub-gradient descent algorithm, in which at each iteration β is projected onto the l 2 -sphere, β

30 CHAPTER 3. BATCH TRANSFER LEARNING AND ONLINE TRANSFER LEARNING Properties The main goal of the offered method is to transfer multiple models rather than only one, moreover in [25] the significance of the model and its interaction to prevent negative transfer knowledge are assessed at the same time, also the optimization of the loss function and objective function(3.25) has the unique solution as they are both convex function. In addition, the new formulation is firm enough which means that the behavior of the algorithm does not interchange much if one instance is excluded or included. We employ this algorithm as batch transfer learning in our proposed method (4.1.1),

31 3.2. OTL: A FRAMEWORK OF ONLINE TRANSFER LEARNING 3.2 OTL: A Framework of Online Transfer Learning Introduction Zaho and Hoi proposed a new framework of online transfer learning (OTL) algorithm with the goal of improving supervised online learning algorithm in a new target domain by employing the knowledge learned previously from some source domains. In this method, target variable to be predicted changes in learning procedure over time and also both target and source domains can be variant in their feature representations and class distributions. In the following we explain briefly the OTL technique and its corresponding obtained mistake bound in situation that the source and target domains have the same feature space (homogeneous data) OTL Algorithm The OTL algorithm proposed in [31] is a two stages online learning approach which combines a source classifier h(x) with a prediction function f(x) learned online on the target domain. Specifically f is learned from a sequence of samples (x t, y t ) where t {1,..., T }. At the t trial the learner receives an instance x t and the prediction function f t is updated to f t+1 according to the PA-algorithm s rule Eq.(2.10) with f t (x t ) = w t x t. In addition, the corresponding class label is predicted by the following ensemble function [31] : ŷ t = sign(α t Π(h(x t )) + γ t Π(f t (x t )) 1 2 ) (3.27) Where Π(x) = max{0, min{1, x+1 2 }} is a normalization function. The weights are initialized as α 1 = γ 1 = 1 2 to [31] : and at each step they are adjusted dynamically according where : α t+1 = α t s t (h) α t s t (h) + γ t s t (f t ) γ t+1 = γ t s t (f t ) α t s t (h) + γ t s t (f t ) (3.28) s t(g) = exp{ ηl s (Π(g(x t )), Π(y t ))} (3.29) 27

32 CHAPTER 3. BATCH TRANSFER LEARNING AND ONLINE TRANSFER LEARNING and l s (z, y) = (z y) 2 is the loss function. Here we would like to analyze the mistake bound in OTL method but we nee to introduce a preposition which will be employed in deriving the mistake bound later Preposition When using the square loss function as l s (z, y) = (z y) 2 in which z [0, 1] and y {0, 1} and the above weighting update technique and setting η = 1/2, the bound of the ensemble algorithm can be written as : T l s (α t Π(h(x t )) + γ t Π(f t (x t )), Π(y t )) 2 ln 2 + t=1 { Tt=1 min l s (Π(h(x t )), Π(y t )), Tt=1 l s (Π(f t (x t )), Π(y t ))} With the help of above proposition now we can derive the mistake bound of the OTL method in case of homogeneous target and source domains Theoretical Analysis In the particular case of one single source task the OTL algorithm has a theoretical support given by the possibility to prove an upper bound on the number of mistakes made during the online learning process. Theorem : Let us denote by M the number of mistakes made by the OTL algorithm, we have then M bounded from above by[31] : where: M 4 min {Σ h, Σ f } + 8 ln 2, (3.30) T Σ h = l s (Π(h(x t )), Π(y t )) (3.31) t=1 T Σ f = l s (Π(f t (x t )), Π(y t )) (3.32) t=1 28

33 3.2. OTL: A FRAMEWORK OF ONLINE TRANSFER LEARNING Note that the first stage in OTL is based on the PA-algorithm, that uses the hinge loss, while the second stage uses the square loss. Hence in [31] the Authors observe that, if we denote by M h and M f the mistake bound of the model h and f t respectively, and we assume that: l s (Π(h(x t )), Π(y t )) 1 4 M h (3.33) and then l s (Π(f t (x t )), Π(y t )) 1 4 M f (3.34) M min{m h, M f } + 8 ln 2 (3.35) The obtained mistake bound gives a powerful theoretical support to the OTL method. This algorithm can only employ single source knowledge to transfer. Hence, in case of multiple sources, it should be modified to be able to exploit multi prior knowledge. In this thesis we modify the OTL algorithm to exploit multi prior knowledge (4.1.4), then we compare performance of the modified OTL with our proposed methods. We will see that the performance of our methods and their corresponding mistake bound outperform the OTL. 29

34

35 Chapter 4 Proposed Method 31

36 CHAPTER 4. PROPOSED METHOD 4.1 Hybrid Transfer Learning Our research in this thesis is related to two machine learning issues: transfer learning and online learning. Most of the researches in transfer learning framework have been done in an offline learning manner or in other words, in batch learning fashion, where training samples in new target domain are provided in advanced. Since in reality training examples may arrive in an online manner typical batch transfer learning cannot be appropriate techniques for the real world problems. The online learning algorithms are more suitable for an actual scenario than traditional machine learning methods as in online learning techniques, training sample arrive sequentially. Moreover, online learning algorithms have less computational complexity and are easier to implement than typical batch learning methods. In this thesis we would like to propose the novel and new transfer learning approach with the aid of the combination of the both batch transfer learning and online classifier which in our case is PA-algorithm. Moreover, the only close algorithm to our method is OTL [31] (3.2) and we need to modify this to be able to exploit multi prior knowledge as the original one cannot. In the following we suggest three different methods. First method explains how to initialize the online classifier with batch transfer learning model parameters (4.1.1). In second method we define a feature augmentation trick to update of the source and the target knowledge weights in time (4.1.3) and our last suggestion shows how to modify the OTL to be able to use the multiple sources (4.1.4) TRansfer initializes Online Learning : TROL Method Multi-KT (3.1.5) algorithm provides a model for the new target problem on the basis of very few training samples exploiting a reliable combination of prior models. This is a batch approach directly meant to minimize the generalization error of the obtained target model and operates in the small setting scenario, we can use it to define a hybrid batch-online learning approach different from OTL. At the beginning of the Multi-KT, n target training samples are given as input and Multi- KT outputs the corresponding target model, then this model is used to initialize the online learning process. Using PA-algorithm as the online learning part of the hybrid transfer learning, the updated solution will be at each step close to the previous one which helps keeping the advantage given by Multi-KT together with the proper introduction of new information when necessary. 32

37 4.1. HYBRID TRANSFER LEARNING TROL Algorithm Training Multi-KT algorithm on n target samples consists in solving the optimization problem in (3.18). The corresponding obtained model can be written as : k n w batch = ( β j ŵ j + α i x i ) (4.1) j=1 i=1 We call this new obtained model w batch as batch model which is then introduced in (2.10) as initialization when learning from the (n + 1) th training sample on. The Multi-KT algorithm is applied on n (typically n 10) training samples and k prior knowledge sources, before starting the online learning process. We name our proposed algorithm TROL: TRansfer initializes Online Learning. In this method the final cost is O(T 2 + n 3 + kn 2 ), that for enough samples T is dominated by the complexity of PA-algorithm. In other words the added complexity of using Multi-KT on n samples is negligible. Theoretical Analysis and Mistake Bound A good initialization model in online learning part of our method can improve the mistake bound of PA-algorithm. In the following we will first derive the mistake bound of PA-algorithm and then we modify it to be generalized. We show the loss suffered by PA-algorithm on round t by l t and we denote the loss suffered by the arbitrary fixed predictor by l. we can write : l t = l(w t ; (x t ), y t ) l t = l(u; (x t ), y t ) (4.2) Where u R d is an arbitrary vector. Note that the loss function in PA-algorithm is the hinge loss function. With the aid of technical lemma we will derive the mistake bound for PA-I algorithm which we employ it on our methods. Lemma 1 Let (x 1, y 1 ),..., (x T, y T ) be a sequence of training samples in which x t R d and l y t { +1, 1 } for all t and τ t = min{c, t x t } and by using the notation in (4.2) 2, we can write [5] : T ) τ t (2l t τ t x t 2 2l t u 2. (4.3) t=1 33

38 CHAPTER 4. PROPOSED METHOD Theorem Let (x 1, y 1 ),..., (x T, y T ) be a sequence of training samples in which x t R d and y t { +1, 1 } and x t R for all t. Then for any vector u R d, the number of prediction mistakes made by PAI is bounded by [5] : { max R 2, 1/C} ( u 2 + 2C Where C is the aggressiveness parameter in PAI (2.11). ) T l(u; (x t, y t )), (4.4) t=1 Proof l 1 in case of any prediction mistakes in PAI on round t. With the aim of our assumption that x t 2 l R 2 and the definition of τ t = min{c, t x t } for any 2 prediction mistakes in round t we can write : { } min 1/R 2, 1/C τ t l t (4.5) Suppose M denotes the number of prediction mistakes in entire training samples since τ t l t is always positive value we can show that : { } min 1/R 2, 1/C M T τ t l t. (4.6) t=1 By plugging the τ t l t Cl t and τ t x t 2 l t in to Lemma 1 (4.3) we obtain: T T τ t l t u t 2 + 2C l(u; (x t, y t )) (4.7) t=1 t=1 Combination of Eq.(4.6) and Eq.(4.7) holds the following results : { } min 1/R 2, 1/C M u t 2 + 2C T l(u; (x t, y t )) (4.8) t=1 Finally by multiplying both sides of (4.8) by max { R 2, 1/C } we obtain the mistake bound of PAI method as we introduced in (4.4) [5]. 34

39 4.1. HYBRID TRANSFER LEARNING Generalization of Mistake Bound In fact it is easy to generalize the mistake bound in (4.4) to the case of using a w batch = ( k j=1 β j ŵ j + n i=1 α i x i ) different from the null vector. Therefore, if we denote the general loss function in PAI by l H and the number of prediction mistakes in entire training sequence by M, the mistake bound of PA-algorithm can be generalized as : M 2 max{r, 1/C} ( 1 2 u w batch 2 + C ) T l H (u; (x t, y t ) t=1. (4.9) From this bound we conclude that, it is possible to improve the performance of the PA-algorithm, at least in the worst case, by initializing the algorithm with a classifier that is close to the optimal one Augmentation trick Method : TROL+ Method The learning solution described above to integrate old and new knowledge is based on a proper initialization of the online process. The old knowledge still is not directly re-weighted during the learning process. We show here that it is possible to use a simple feature augmentation trick to have the same starting condition of TROL together with a progressive update of the source and the target knowledge weights in time. We call this algorithm TROL+. TROL+ Algorithm Given the model w batch, we can evaluate its prediction on each new training samples as (w batch x t ). Cropping the obtained value between -1 and 1, similarly to OTL, to limit the norm of the added dimension, we use this prediction as the (d + 1)-th element in the feature vector descriptor of x t. So we define: x t = (x t, ν t ) R d+1 where ν t = max{ 1, min{1, w batch x t }}. (4.10) Samples with such a modified representation enter the PA-algorithm which is initialized now with w 1 = (0,..., 0, 1) R d+1. At time t = 1, PA-algorithm predicts with sign(w 1 x 1) = sign(ν 1 ), while for any time step t, the updating 35

40 CHAPTER 4. PROPOSED METHOD rule in (2.10) results in : w t+1 = w t + τ ty t x t where τ t = min{c, lh ((w t x t), y t ) x t 2 } (4.11) and the predictions are : w t x t 1 t = τ iy i (x i x t + ν i ν t ). (4.12) i=1 Hence the hyperplane w t can be thought as composed by two parts, one for the old knowledge and the other one for the knowledge comes from the new instances. This approach can be generalized to allow the use of k different prior models w j batch and j = 1,..., k. We can expand the input vectors with k new dimensions : x t = (x t, ν 1,t,..., ν k,t ) R d+k where ν j,t = max{ 1, min{1, wbatch j x t}}. (4.13) Theoretical Analysis and Mistake Bound From the bound (4.9), taking into account the increased dimensionality of the instances, we have the following theorem: Let (x t, y t ), t = 1,..., T be a sequence of transformed instances as in (4.13), y t {+1, 1} and x t 1 for all t. Then, for any vector u R d+k the number of prediction mistakes made by TROL+ on this sequence of examples is bounded from above by: M 2 max {1 + k, 1/C} ( 1 2 u w 2 + C ) T l H (u; (x t, y t )), (4.14) t=1 where C is the aggressiveness parameter provided to TROL+. Now to compare this bound with the proposed mistake bound in OTL algorithm (3.2.4), we set C = 1 and use only one prior knowledge, i.e. k = 1. Given that the bound in (4.1) holds for any u, we can worsen the bound by setting u to be the optimal one for 36

41 4.1. HYBRID TRANSFER LEARNING the new knowledge alone or the prior knowledge alone, to have that M 4 min {Σ h, Σ f } (4.15) where and T T Σ h = l H (v t, y t ) l H (w batch x t, y t ) (4.16) t=1 t=1 1 Σ f = min u 2 u 2 + T l H (u x t, y t ) (4.17) t=1 The performance of TROL+ is always close to the best between the performance of the prior and the performance of the best batch classifier over the new knowledge. However, in our method we have the hinge loss l H and not the square loss l S as in OTL. It is known that the first one approximates the real 0/1 loss better than the second. Moreover, as discussed in (3.2.4), the OTL bounds does not directly link the performance to the two stages of the algorithm, while in TROL+ there is only one layer so we do not have this problem. Another difference with OTL is that TROL+ makes only a finite number of mistakes if there is a hyperplane u that correctly classifies all the samples Modified OTL : M-OTL In OTL [31] algorithm, the proposed method originally supposes the existence of one unique source domain. In case of multiple source tasks we suggest the naïve solution of averaging all the prior knowledge models and use the mean classifier as h(x). A different solution consists in assigning one weight to each source knowledge. In this case we start from α 1 = k j=1 α j,1 = γ 1 = 1 2 with α j,1 = 1 2k for j = 1..., k and then we update the weights with: α j,t+1 = α j,t s t (h j ) kj=1 α j,t s t (h j ) + γ t s t (f t ), γ t+1 = γ t s t (f t ) kj=1 α j,t s t (h j ) + γ t s t (f t ). (4.18) If we neglect the prior knowledge learning process as before, the total computational complexity of OTL matches the one of the online learning method used, since the cost of (3.28) and (4.18) is O(1). Thus we have O(T 2 ) as for PA-algorithm [5]. 37

42

43 Chapter 5 Experiments and Results 39

44 CHAPTER 5. EXPERIMENTS AND RESULTS 5.1 Experiments We assessed our proposed algorithms in the visual categories setting and evaluate the performance of our methods for online learning of visual categories. Following the setting proposed in [25], we used the Caltech-256 data set [15], for object category detection, considering a binary problem for each object class versus the background defined by the class clutter. By exploiting the taxonomy provided together with the images it is possible to set various transfer problems among differently related classes with one or multiple prior knowledge. For all the images we used the precomputed SIFT features of [14]. The training set for each class consisted of 60 samples and the testing set for each class included 100 examples. Each set contains an equal number of positive (object class) and negative (background) data samples. We considered 10 random orderings of the samples for each class and we present the average results on these ten splits both in terms of the average error rate for the online methods and of the recognition rate produced by the current training solution on the test set. The training samples are always followed by a negative one and vice-versa and we can call such a sequential training data as concept drift. For all experiments we used the Gaussian kernel K(x, x ) = exp( 1 σ 2 x x 2 ) (5.1) We fixed σ to the mean of the pairwise distances among the samples. We bench marked TROL and TROL+ against PA-algorithm trained on the target samples, Multi-KT (3.1.5) and OTL (3.2), where in case of multiple priors we considered the average of all the available models as source classifier. This thesis is based on the work described in the following paper : T.Tomassi, F.Orabona, M.Kaboli, B.Caputo. Leveraging over prior knowledge for online learning of visual categories submitted to BMVC2012 conference [27] Baselines To compare our experimental results we define the following baselines : NOTR : This is a batch least square support vector machine (3.1.4) strategy corresponding to learn exploiting only the target samples and classify the unseen test data samples. Multi-KT : This is the original batch transfer learning method presented in (3.1.5),[25]. PA: This is passive aggressive online learning strategy on the available target samples for no transfer. 40

45 5.1. EXPERIMENTS OTL : This is the online transfer learning method presented in (3.2), [31]. In case of multiple prior knowledges we consider the average of all the available models as source classifier. TR-OTL : This method considers as source knowledge for OTL the same Multi- KT output that we use as initialization in TROL. M-OTL : This is our modified version of OTL able to assign a different weights to each prior knowledge in case of multiple sources, with the update rule defined in (4.18) Initialization of Experiment All the online techniques initialized with Multi-KT (3.1.5) use its output model learned over n = 6 training images, corresponding to three positive and three negative samples. All the source models have been learned with LS-SVM (3.1.2). The value of the C parameter is chosen by cross validation on the sources and we used the same for the batch methods (Multi-KT and NOTR) applied on the new task. The C value for all the online methods is instead fixed to Single Source or Single prior knowledge We ran a first group of experiments on couples of related and unrelated classes. The first ones are chosen inside the macro classes defined by the data set taxonomy, for instance, transportation-ground-motorized, food-containers, while the second ones are extracted randomly. For all the couples we consider one of the classes as target task and the other as a prior data. We repeat the experiments by terminating the role of the two classes finally, we report the obtained results as an average over each of the two classes. Related Classes Figure.(5.1) is the recognition rate which results on the test set and is plotted as a function of the number of training samples. It shows the experiments on the related classes of coffee-mugs and beer-mugs. In this setting, all the transfer learning methods perform much better than learning from scratch. Our experiment also 41

46 CHAPTER 5. EXPERIMENTS AND RESULTS shows that the LS-SVM classifier (NOTR) has higher recognition rate than online classifier (PA). Hence all online learning method which initialized with Muklti-KT works almost same as Multi-KT batch transfer learning which is based on LS- SVM approach and even some times TROL+ has better performance than Multi- KT. In case of online transfer learning, TROL+ method performs better than the other methods (TROL, TR-OTL,and OTL). Moreover, OTL algorithm has the lower performance in comparison with both our proposed algorithm (TROL and TROL+) and TR-OTL method. Figure.(5.2) depicts the average online learning error on the test set as a function of the number of the training samples. It shows that TROL+ method has the best performance in term of online mistake rate in comparison with OTL and TR-OTL. Figure 5.1. Related Single source, recognition rate 42

47 5.1. EXPERIMENTS Figure 5.2. Related Single source, average online learning error Intermediate Related Classes In figure.(5.3) couple of schoolbus-dog experiment represents an intermediate condition, where transfer learning can be helpful. In this experiment our both suggested methods in TROL and TROL+ present a small advantage over TR-OTL in terms of recognition on the test set, while TROL+ and TR-OTL show the best performance on the online mistake rate in figure.(5.4). Moreover, OTL outperforms the PA method in term of recognition rate on the test set only for less than 30 training samples and has the highest online mistake rate than the other methods in figure.(5.4). 43

48 CHAPTER 5. EXPERIMENTS AND RESULTS Figure 5.3. Intermediate related Single source, recognition rate Figure 5.4. Intermediate related Single source, average online learning error 44

49 5.1. EXPERIMENTS Unrelated Classes When the prior knowledge is not informative imposing a brute force transfer which can only worsen the classification performance (negative transfer) (2.1.9). Figure.(5.5) shows the recognition rate results on the test set of the couple unrelated fireworks-treadmill. In this experiment, our both suggested methods TROL and TROL+ match the recognition performance of the corresponding no transfer online learning method PA. It means our suggested methods stand firm against the negative transfer information, while OTL and TR-OTL suffer from negative transfer knowledge. Figure.(5.6) shows the average online learning error rate on the test set as a function of the number of the training samples. From this figure we see that our proposed methods TROL and TROL+ has lowest online error rate that OTL and TR-OTL algorithms. Figure 5.5. Unrelated Single source, recognition rate 45

50 CHAPTER 5. EXPERIMENTS AND RESULTS Figure 5.6. Unrelated Single source, average online learning error Multiple source or Multi Prior knowledge We focus here on learning with multiple available prior knowledge models. This setting defines a novel testbed for OTL that has been previously analyzed only in the case of a single source. Figure.(5.7) shows the results of recognition rate on the test set on a group of four unrelated classes as motorbikes, airplanes, faces,and leopards (originally used in [13]). Each of them is considered in turn as target task while the remaining ones define the source knowledge. Despite the difference among the object categories, Multi-KT is able to define a good combination of priors and exploits them when learning on the new target, obtaining extremely good results in classification. TROL, TROL+, and TR-OTL methods initialized with Multi-KT match their recognition performance after 10 training samples. Hence, OTL method originally is not able to exploit multi source knowledge to have fair comparison with the other online methods we consider the average of the prior knowledge. In figure (5.7) we see that the OTL suffers from negative transfer information and has the worst performance. Moreover, the corresponding M-OTL version, based on different weights for each source classifier, does not have any advantage rather than the learning from scratch (PA), but at least its performance is not the worst. Our suggested methods TROL, TROL+ and also TR-OTL ( OTL which is initialized with Multi-KT) have the best performance with respect to all the other baselines in terms of recognition rate on the test sets. Moreover, performance of TROL+ and TROL are almost same as Multi-Kt batch transfer learning despite of the 46

51 5.1. EXPERIMENTS fact that LS-SVM batch learning (NOTR) has higher performance than on line learning method (PA). From figure.5.8 we conclude that TROL,TROL+, and TR- OTL methods has the lowest online mistake rate in comparison with the other online methods. A special remark is necessary here for the method named TROL+ (priors). This refers to the case in which each prior knowledge model is considered as a separate source, so we are augmenting the feature space with k = 3 new elements. This method outperforms OTL and M-OTL both in terms of mistake rate and recognition on the test set, roughly matching the batch performances of Multi-KT after 20 training samples. Figure classes Unrelated Source, recognition rate 47

52 CHAPTER 5. EXPERIMENTS AND RESULTS Figure classes Unrelated Source, average online learning error Value of Weights The following four plots on Figures (5.9), (5.10), (5.11), (5.12) show how the weights given to different source and target knowledge (different classes as motorbikes, airplanes, faces,and leopards) change in the OTL-related methods ( OTL,TR- OTL,and M-OTL). The information obtained as output from Multi-KT, used as source in TR-OTL, maintains a high weight in time. This demonstrates its usefulness for the learning process. We also see that the source knowledge loses its importance in time, or show a small weight, for OTL and M-OTL. Figures show the value of the weights given to source (S) and target (T) knowledge by the OTLrelated methods for one split. The line M-OTL (S) corresponds to the sum of all the weights separately given to the sources. The stars indicate the weight given to each of the source classes by Multi-KT and used in the input model to TR-OTL. 48

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Introduction to Support Vector Machines. Colin Campbell, Bristol University Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.

More information

Support Vector Machines Explained

Support Vector Machines Explained March 1, 2009 Support Vector Machines Explained Tristan Fletcher www.cs.ucl.ac.uk/staff/t.fletcher/ Introduction This document has been written in an attempt to make the Support Vector Machines (SVM),

More information

Support Vector Machine (SVM)

Support Vector Machine (SVM) Support Vector Machine (SVM) CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S Linear smoother ŷ = S y where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S 2 Online Learning: LMS and Perceptrons Partially adapted from slides by Ryan Gabbard

More information

Table 1: Summary of the settings and parameters employed by the additive PA algorithm for classification, regression, and uniclass.

Table 1: Summary of the settings and parameters employed by the additive PA algorithm for classification, regression, and uniclass. Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il

More information

A Simple Introduction to Support Vector Machines

A Simple Introduction to Support Vector Machines A Simple Introduction to Support Vector Machines Martin Law Lecture for CSE 802 Department of Computer Science and Engineering Michigan State University Outline A brief history of SVM Large-margin linear

More information

Several Views of Support Vector Machines

Several Views of Support Vector Machines Several Views of Support Vector Machines Ryan M. Rifkin Honda Research Institute USA, Inc. Human Intention Understanding Group 2007 Tikhonov Regularization We are considering algorithms of the form min

More information

Simple and efficient online algorithms for real world applications

Simple and efficient online algorithms for real world applications Simple and efficient online algorithms for real world applications Università degli Studi di Milano Milano, Italy Talk @ Centro de Visión por Computador Something about me PhD in Robotics at LIRA-Lab,

More information

Support Vector Machines with Clustering for Training with Very Large Datasets

Support Vector Machines with Clustering for Training with Very Large Datasets Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France theodoros.evgeniou@insead.fr Massimiliano

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Charlie Frogner 1 MIT 2011 1 Slides mostly stolen from Ryan Rifkin (Google). Plan Regularization derivation of SVMs. Analyzing the SVM problem: optimization, duality. Geometric

More information

Date: April 12, 2001. Contents

Date: April 12, 2001. Contents 2 Lagrange Multipliers Date: April 12, 2001 Contents 2.1. Introduction to Lagrange Multipliers......... p. 2 2.2. Enhanced Fritz John Optimality Conditions...... p. 12 2.3. Informative Lagrange Multipliers...........

More information

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

More information

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Semi-Supervised Support Vector Machines and Application to Spam Filtering Semi-Supervised Support Vector Machines and Application to Spam Filtering Alexander Zien Empirical Inference Department, Bernhard Schölkopf Max Planck Institute for Biological Cybernetics ECML 2006 Discovery

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

What is Linear Programming?

What is Linear Programming? Chapter 1 What is Linear Programming? An optimization problem usually has three essential ingredients: a variable vector x consisting of a set of unknowns to be determined, an objective function of x to

More information

Online Classification on a Budget

Online Classification on a Budget Online Classification on a Budget Koby Crammer Computer Sci. & Eng. Hebrew University Jerusalem 91904, Israel kobics@cs.huji.ac.il Jaz Kandola Royal Holloway, University of London Egham, UK jaz@cs.rhul.ac.uk

More information

DUOL: A Double Updating Approach for Online Learning

DUOL: A Double Updating Approach for Online Learning : A Double Updating Approach for Online Learning Peilin Zhao School of Comp. Eng. Nanyang Tech. University Singapore 69798 zhao6@ntu.edu.sg Steven C.H. Hoi School of Comp. Eng. Nanyang Tech. University

More information

Introduction to Online Learning Theory

Introduction to Online Learning Theory Introduction to Online Learning Theory Wojciech Kot lowski Institute of Computing Science, Poznań University of Technology IDSS, 04.06.2013 1 / 53 Outline 1 Example: Online (Stochastic) Gradient Descent

More information

Lecture 2: The SVM classifier

Lecture 2: The SVM classifier Lecture 2: The SVM classifier C19 Machine Learning Hilary 2015 A. Zisserman Review of linear classifiers Linear separability Perceptron Support Vector Machine (SVM) classifier Wide margin Cost function

More information

Big Data - Lecture 1 Optimization reminders

Big Data - Lecture 1 Optimization reminders Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Schedule Introduction Major issues Examples Mathematics

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

More information

Duality in General Programs. Ryan Tibshirani Convex Optimization 10-725/36-725

Duality in General Programs. Ryan Tibshirani Convex Optimization 10-725/36-725 Duality in General Programs Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: duality in linear programs Given c R n, A R m n, b R m, G R r n, h R r : min x R n c T x max u R m, v R r b T

More information

Linear Programming Notes V Problem Transformations

Linear Programming Notes V Problem Transformations Linear Programming Notes V Problem Transformations 1 Introduction Any linear programming problem can be rewritten in either of two standard forms. In the first form, the objective is to maximize, the material

More information

Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues

Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues Data Mining with Regression Teaching an old dog some new tricks Acknowledgments Colleagues Dean Foster in Statistics Lyle Ungar in Computer Science Bob Stine Department of Statistics The School of the

More information

1 Teaching notes on GMM 1.

1 Teaching notes on GMM 1. Bent E. Sørensen January 23, 2007 1 Teaching notes on GMM 1. Generalized Method of Moment (GMM) estimation is one of two developments in econometrics in the 80ies that revolutionized empirical work in

More information

Nonlinear Optimization: Algorithms 3: Interior-point methods

Nonlinear Optimization: Algorithms 3: Interior-point methods Nonlinear Optimization: Algorithms 3: Interior-point methods INSEAD, Spring 2006 Jean-Philippe Vert Ecole des Mines de Paris Jean-Philippe.Vert@mines.org Nonlinear optimization c 2006 Jean-Philippe Vert,

More information

Linear Threshold Units

Linear Threshold Units Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

More information

Least-Squares Intersection of Lines

Least-Squares Intersection of Lines Least-Squares Intersection of Lines Johannes Traa - UIUC 2013 This write-up derives the least-squares solution for the intersection of lines. In the general case, a set of lines will not intersect at a

More information

2.3 Convex Constrained Optimization Problems

2.3 Convex Constrained Optimization Problems 42 CHAPTER 2. FUNDAMENTAL CONCEPTS IN CONVEX OPTIMIZATION Theorem 15 Let f : R n R and h : R R. Consider g(x) = h(f(x)) for all x R n. The function g is convex if either of the following two conditions

More information

1 Solving LPs: The Simplex Algorithm of George Dantzig

1 Solving LPs: The Simplex Algorithm of George Dantzig Solving LPs: The Simplex Algorithm of George Dantzig. Simplex Pivoting: Dictionary Format We illustrate a general solution procedure, called the simplex algorithm, by implementing it on a very simple example.

More information

Online Passive-Aggressive Algorithms

Online Passive-Aggressive Algorithms Journal of Machine Learning Research 7 2006) 551 585 Submitted 5/05; Published 3/06 Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Joseph Keshet Shai Shalev-Shwartz Yoram Singer School of

More information

Machine Learning and Pattern Recognition Logistic Regression

Machine Learning and Pattern Recognition Logistic Regression Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,

More information

L25: Ensemble learning

L25: Ensemble learning L25: Ensemble learning Introduction Methods for constructing ensembles Combination strategies Stacked generalization Mixtures of experts Bagging Boosting CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna

More information

A Study on the Comparison of Electricity Forecasting Models: Korea and China

A Study on the Comparison of Electricity Forecasting Models: Korea and China Communications for Statistical Applications and Methods 2015, Vol. 22, No. 6, 675 683 DOI: http://dx.doi.org/10.5351/csam.2015.22.6.675 Print ISSN 2287-7843 / Online ISSN 2383-4757 A Study on the Comparison

More information

Adaptive Online Gradient Descent

Adaptive Online Gradient Descent Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650

More information

A NEW LOOK AT CONVEX ANALYSIS AND OPTIMIZATION

A NEW LOOK AT CONVEX ANALYSIS AND OPTIMIZATION 1 A NEW LOOK AT CONVEX ANALYSIS AND OPTIMIZATION Dimitri Bertsekas M.I.T. FEBRUARY 2003 2 OUTLINE Convexity issues in optimization Historical remarks Our treatment of the subject Three unifying lines of

More information

Distributed Machine Learning and Big Data

Distributed Machine Learning and Big Data Distributed Machine Learning and Big Data Sourangshu Bhattacharya Dept. of Computer Science and Engineering, IIT Kharagpur. http://cse.iitkgp.ac.in/~sourangshu/ August 21, 2015 Sourangshu Bhattacharya

More information

Creating a NL Texas Hold em Bot

Creating a NL Texas Hold em Bot Creating a NL Texas Hold em Bot Introduction Poker is an easy game to learn by very tough to master. One of the things that is hard to do is controlling emotions. Due to frustration, many have made the

More information

An Introduction to Machine Learning

An Introduction to Machine Learning An Introduction to Machine Learning L5: Novelty Detection and Regression Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au Tata Institute, Pune,

More information

Trading regret rate for computational efficiency in online learning with limited feedback

Trading regret rate for computational efficiency in online learning with limited feedback Trading regret rate for computational efficiency in online learning with limited feedback Shai Shalev-Shwartz TTI-C Hebrew University On-line Learning with Limited Feedback Workshop, 2009 June 2009 Shai

More information

Numerisches Rechnen. (für Informatiker) M. Grepl J. Berger & J.T. Frings. Institut für Geometrie und Praktische Mathematik RWTH Aachen

Numerisches Rechnen. (für Informatiker) M. Grepl J. Berger & J.T. Frings. Institut für Geometrie und Praktische Mathematik RWTH Aachen (für Informatiker) M. Grepl J. Berger & J.T. Frings Institut für Geometrie und Praktische Mathematik RWTH Aachen Wintersemester 2010/11 Problem Statement Unconstrained Optimality Conditions Constrained

More information

IFT3395/6390. Machine Learning from linear regression to Neural Networks. Machine Learning. Training Set. t (3.5, -2,..., 127, 0,...

IFT3395/6390. Machine Learning from linear regression to Neural Networks. Machine Learning. Training Set. t (3.5, -2,..., 127, 0,... IFT3395/6390 Historical perspective: back to 1957 (Prof. Pascal Vincent) (Rosenblatt, Perceptron ) Machine Learning from linear regression to Neural Networks Computer Science Artificial Intelligence Symbolic

More information

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S. AUTOMATION OF ENERGY DEMAND FORECASTING by Sanzad Siddique, B.S. A Thesis submitted to the Faculty of the Graduate School, Marquette University, in Partial Fulfillment of the Requirements for the Degree

More information

Lecture 6: Logistic Regression

Lecture 6: Logistic Regression Lecture 6: CS 194-10, Fall 2011 Laurent El Ghaoui EECS Department UC Berkeley September 13, 2011 Outline Outline Classification task Data : X = [x 1,..., x m]: a n m matrix of data points in R n. y { 1,

More information

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem

More information

Basics of Statistical Machine Learning

Basics of Statistical Machine Learning CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar

More information

Math 4310 Handout - Quotient Vector Spaces

Math 4310 Handout - Quotient Vector Spaces Math 4310 Handout - Quotient Vector Spaces Dan Collins The textbook defines a subspace of a vector space in Chapter 4, but it avoids ever discussing the notion of a quotient space. This is understandable

More information

Duality in Linear Programming

Duality in Linear Programming Duality in Linear Programming 4 In the preceding chapter on sensitivity analysis, we saw that the shadow-price interpretation of the optimal simplex multipliers is a very useful concept. First, these shadow

More information

Constrained optimization.

Constrained optimization. ams/econ 11b supplementary notes ucsc Constrained optimization. c 2010, Yonatan Katznelson 1. Constraints In many of the optimization problems that arise in economics, there are restrictions on the values

More information

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore. CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes

More information

Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

More information

Nonlinear Programming Methods.S2 Quadratic Programming

Nonlinear Programming Methods.S2 Quadratic Programming Nonlinear Programming Methods.S2 Quadratic Programming Operations Research Models and Methods Paul A. Jensen and Jonathan F. Bard A linearly constrained optimization problem with a quadratic objective

More information

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05 Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

More information

Logistic Regression for Spam Filtering

Logistic Regression for Spam Filtering Logistic Regression for Spam Filtering Nikhila Arkalgud February 14, 28 Abstract The goal of the spam filtering problem is to identify an email as a spam or not spam. One of the classic techniques used

More information

Predict the Popularity of YouTube Videos Using Early View Data

Predict the Popularity of YouTube Videos Using Early View Data 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

Week 1: Introduction to Online Learning

Week 1: Introduction to Online Learning Week 1: Introduction to Online Learning 1 Introduction This is written based on Prediction, Learning, and Games (ISBN: 2184189 / -21-8418-9 Cesa-Bianchi, Nicolo; Lugosi, Gabor 1.1 A Gentle Start Consider

More information

Tree based ensemble models regularization by convex optimization

Tree based ensemble models regularization by convex optimization Tree based ensemble models regularization by convex optimization Bertrand Cornélusse, Pierre Geurts and Louis Wehenkel Department of Electrical Engineering and Computer Science University of Liège B-4000

More information

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang Classifying Large Data Sets Using SVMs with Hierarchical Clusters Presented by :Limou Wang Overview SVM Overview Motivation Hierarchical micro-clustering algorithm Clustering-Based SVM (CB-SVM) Experimental

More information

A Potential-based Framework for Online Multi-class Learning with Partial Feedback

A Potential-based Framework for Online Multi-class Learning with Partial Feedback A Potential-based Framework for Online Multi-class Learning with Partial Feedback Shijun Wang Rong Jin Hamed Valizadegan Radiology and Imaging Sciences Computer Science and Engineering Computer Science

More information

Practical Guide to the Simplex Method of Linear Programming

Practical Guide to the Simplex Method of Linear Programming Practical Guide to the Simplex Method of Linear Programming Marcel Oliver Revised: April, 0 The basic steps of the simplex algorithm Step : Write the linear programming problem in standard form Linear

More information

Making Sense of the Mayhem: Machine Learning and March Madness

Making Sense of the Mayhem: Machine Learning and March Madness Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University atran3@stanford.edu ginzberg@stanford.edu I. Introduction III. Model The goal of our research

More information

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written

More information

The Effects of Start Prices on the Performance of the Certainty Equivalent Pricing Policy

The Effects of Start Prices on the Performance of the Certainty Equivalent Pricing Policy BMI Paper The Effects of Start Prices on the Performance of the Certainty Equivalent Pricing Policy Faculty of Sciences VU University Amsterdam De Boelelaan 1081 1081 HV Amsterdam Netherlands Author: R.D.R.

More information

Early defect identification of semiconductor processes using machine learning

Early defect identification of semiconductor processes using machine learning STANFORD UNIVERISTY MACHINE LEARNING CS229 Early defect identification of semiconductor processes using machine learning Friday, December 16, 2011 Authors: Saul ROSA Anton VLADIMIROV Professor: Dr. Andrew

More information

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j What is Kiva? An organization that allows people to lend small amounts of money via the Internet

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Sensitivity Analysis 3.1 AN EXAMPLE FOR ANALYSIS

Sensitivity Analysis 3.1 AN EXAMPLE FOR ANALYSIS Sensitivity Analysis 3 We have already been introduced to sensitivity analysis in Chapter via the geometry of a simple example. We saw that the values of the decision variables and those of the slack and

More information

Decompose Error Rate into components, some of which can be measured on unlabeled data

Decompose Error Rate into components, some of which can be measured on unlabeled data Bias-Variance Theory Decompose Error Rate into components, some of which can be measured on unlabeled data Bias-Variance Decomposition for Regression Bias-Variance Decomposition for Classification Bias-Variance

More information

Recovery of primal solutions from dual subgradient methods for mixed binary linear programming; a branch-and-bound approach

Recovery of primal solutions from dual subgradient methods for mixed binary linear programming; a branch-and-bound approach MASTER S THESIS Recovery of primal solutions from dual subgradient methods for mixed binary linear programming; a branch-and-bound approach PAULINE ALDENVIK MIRJAM SCHIERSCHER Department of Mathematical

More information

SAS Software to Fit the Generalized Linear Model

SAS Software to Fit the Generalized Linear Model SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling

More information

4.6 Linear Programming duality

4.6 Linear Programming duality 4.6 Linear Programming duality To any minimization (maximization) LP we can associate a closely related maximization (minimization) LP. Different spaces and objective functions but in general same optimal

More information

Randomization Approaches for Network Revenue Management with Customer Choice Behavior

Randomization Approaches for Network Revenue Management with Customer Choice Behavior Randomization Approaches for Network Revenue Management with Customer Choice Behavior Sumit Kunnumkal Indian School of Business, Gachibowli, Hyderabad, 500032, India sumit kunnumkal@isb.edu March 9, 2011

More information

Knowledge Discovery from patents using KMX Text Analytics

Knowledge Discovery from patents using KMX Text Analytics Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers

More information

Efficient online learning of a non-negative sparse autoencoder

Efficient online learning of a non-negative sparse autoencoder and Machine Learning. Bruges (Belgium), 28-30 April 2010, d-side publi., ISBN 2-93030-10-2. Efficient online learning of a non-negative sparse autoencoder Andre Lemme, R. Felix Reinhart and Jochen J. Steil

More information

CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.

CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning. Lecture Machine Learning Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht milos@cs.pitt.edu 539 Sennott

More information

The equivalence of logistic regression and maximum entropy models

The equivalence of logistic regression and maximum entropy models The equivalence of logistic regression and maximum entropy models John Mount September 23, 20 Abstract As our colleague so aptly demonstrated ( http://www.win-vector.com/blog/20/09/the-simplerderivation-of-logistic-regression/

More information

Standard Deviation Estimator

Standard Deviation Estimator CSS.com Chapter 905 Standard Deviation Estimator Introduction Even though it is not of primary interest, an estimate of the standard deviation (SD) is needed when calculating the power or sample size of

More information

Penalized regression: Introduction

Penalized regression: Introduction Penalized regression: Introduction Patrick Breheny August 30 Patrick Breheny BST 764: Applied Statistical Modeling 1/19 Maximum likelihood Much of 20th-century statistics dealt with maximum likelihood

More information

Online learning of multi-class Support Vector Machines

Online learning of multi-class Support Vector Machines IT 12 061 Examensarbete 30 hp November 2012 Online learning of multi-class Support Vector Machines Xuan Tuan Trinh Institutionen för informationsteknologi Department of Information Technology Abstract

More information

In this section, we will consider techniques for solving problems of this type.

In this section, we will consider techniques for solving problems of this type. Constrained optimisation roblems in economics typically involve maximising some quantity, such as utility or profit, subject to a constraint for example income. We shall therefore need techniques for solving

More information

From N to N+1: Multiclass Transfer Incremental Learning

From N to N+1: Multiclass Transfer Incremental Learning From N to N+1: Multiclass Transfer Incremental Learning Ilja Kuzborskij 1,2, Francesco Orabona 3, and Barbara Caputo 1 1 Idiap Research Institute, Switzerland, 2 École Polytechnique Fédérale de Lausanne

More information

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS Systems of Equations and Matrices Representation of a linear system The general system of m equations in n unknowns can be written a x + a 2 x 2 + + a n x n b a

More information

Local features and matching. Image classification & object localization

Local features and matching. Image classification & object localization Overview Instance level search Local features and matching Efficient visual recognition Image classification & object localization Category recognition Image classification: assigning a class label to

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

More information

Machine Learning in Spam Filtering

Machine Learning in Spam Filtering Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov kt@ut.ee Institute of Computer Science, University of Tartu Overview Spam is Evil ML for Spam Filtering: General Idea, Problems.

More information

Numerical methods for American options

Numerical methods for American options Lecture 9 Numerical methods for American options Lecture Notes by Andrzej Palczewski Computational Finance p. 1 American options The holder of an American option has the right to exercise it at any moment

More information

Linear Programming Notes VII Sensitivity Analysis

Linear Programming Notes VII Sensitivity Analysis Linear Programming Notes VII Sensitivity Analysis 1 Introduction When you use a mathematical model to describe reality you must make approximations. The world is more complicated than the kinds of optimization

More information

Increasing for all. Convex for all. ( ) Increasing for all (remember that the log function is only defined for ). ( ) Concave for all.

Increasing for all. Convex for all. ( ) Increasing for all (remember that the log function is only defined for ). ( ) Concave for all. 1. Differentiation The first derivative of a function measures by how much changes in reaction to an infinitesimal shift in its argument. The largest the derivative (in absolute value), the faster is evolving.

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Multiple Kernel Learning on the Limit Order Book

Multiple Kernel Learning on the Limit Order Book JMLR: Workshop and Conference Proceedings 11 (2010) 167 174 Workshop on Applications of Pattern Analysis Multiple Kernel Learning on the Limit Order Book Tristan Fletcher Zakria Hussain John Shawe-Taylor

More information

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max

More information

Mathematical finance and linear programming (optimization)

Mathematical finance and linear programming (optimization) Mathematical finance and linear programming (optimization) Geir Dahl September 15, 2009 1 Introduction The purpose of this short note is to explain how linear programming (LP) (=linear optimization) may

More information

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Introduction to Machine Learning Lecture 1 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Introduction Logistics Prerequisites: basics concepts needed in probability and statistics

More information

MSCA 31000 Introduction to Statistical Concepts

MSCA 31000 Introduction to Statistical Concepts MSCA 31000 Introduction to Statistical Concepts This course provides general exposure to basic statistical concepts that are necessary for students to understand the content presented in more advanced

More information

Walrasian Demand. u(x) where B(p, w) = {x R n + : p x w}.

Walrasian Demand. u(x) where B(p, w) = {x R n + : p x w}. Walrasian Demand Econ 2100 Fall 2015 Lecture 5, September 16 Outline 1 Walrasian Demand 2 Properties of Walrasian Demand 3 An Optimization Recipe 4 First and Second Order Conditions Definition Walrasian

More information

Monotonicity Hints. Abstract

Monotonicity Hints. Abstract Monotonicity Hints Joseph Sill Computation and Neural Systems program California Institute of Technology email: joe@cs.caltech.edu Yaser S. Abu-Mostafa EE and CS Deptartments California Institute of Technology

More information

Supervised and unsupervised learning - 1

Supervised and unsupervised learning - 1 Chapter 3 Supervised and unsupervised learning - 1 3.1 Introduction The science of learning plays a key role in the field of statistics, data mining, artificial intelligence, intersecting with areas in

More information

Learning is a very general term denoting the way in which agents:

Learning is a very general term denoting the way in which agents: What is learning? Learning is a very general term denoting the way in which agents: Acquire and organize knowledge (by building, modifying and organizing internal representations of some external reality);

More information

Recurrent Neural Networks

Recurrent Neural Networks Recurrent Neural Networks Neural Computation : Lecture 12 John A. Bullinaria, 2015 1. Recurrent Neural Network Architectures 2. State Space Models and Dynamical Systems 3. Backpropagation Through Time

More information