Revisiting Output Coding for Sequential Supervised Learning
|
|
|
- Shauna Mills
- 10 years ago
- Views:
Transcription
1 Revisiting Output Coding for Sequential Supervised Learning Guohua Hao and Alan Fern School of Electrical Engineering and Computer Science Oregon State University Corvallis, OR USA {haog, Abstract Markov models are commonly used for joint inference of label sequences. Unfortunately, inference scales quadratically in the number of labels, which is problematic for training methods where inference is repeatedly preformed and is the primary computational bottleneck for large label sets. Recent work has used output coding to address this issue by converting a problem with many labels to a set of problems with binary labels. Models were independently trained for each binary problem, at a much reduced computational cost, and then combined for joint inference over the original labels. Here we revisit this idea and show through experiments on synthetic and benchmark data sets that the approach can perform poorly when it is critical to explicitly capture the Markovian transition structure of the large-label problem. We then describe a simple cascade-training approach and show that it can improve performance on such problems with negligible computational overhead. 1 Introduction Sequence labeling is the problem of assigning a class label to each element of an input sequence. We study supervised learning for sequence-labeling problems where the number of labels is large. Markov models are widely used for sequence labeling, however, the time complexity of standard inference algorithms, such as Viterbi and forward-backward, scales quadratically with the number of labels. When this number is large, this can be problematic when predictions must be made quickly. Even more problematic is the fact that a number of recent approaches for training Markov models, e.g. [Lafferty et al., 1; Tsochantaridis et al., 4; Collins, 2], repeatedly perform inference during learning. As the number of labels grows, such approaches quickly become computationally impractical. This problem has led to recent work that considers various approaches to reducing computation while maintaining high accuracy. One approach has been to integrate approximateinference techniques such as beam search into the learning process [Collins and Roark, 4; Pal et al., 6]. A second approach has been to consider learning subclasses of Markov models that facilitate sub-quadratic inference e.g. using linear embedding models [Felzenszwalb et al., 3], or bounding the number of distinct transition costs [Siddiqi and Moore, 5]. Both of these approaches have demonstrated good performance on a number of sequence-labeling problems that satisfy the assumptions of the models. A third recent approach, which is the focus of this paper, is sequential error-correcting output coding (SECOC) [Cohn et al., 5] motivated by the success of error-correcting output coding (ECOC) for non-sequential problems [Dietterich and Bakiri, 1995]. The idea is to use an output code to reduce a multi-label sequence-learning problem to a set of binary-label problems. SECOC solves each binary problem independently and then combines the learned models to make predictions over the original large label set. The computational cost of learning and inferencescales linearly in the number of binary models and is independent of the number of labels. Experiments on several data sets showed that SECOC significantly reduced training and labeling time with little loss in accuracy. While the initial SECOC results were encouraging, the study did not address SECOC s general applicability and its potential limitations. For non-sequential learning, ECOC has been shown to be a formal reduction from multi-label to binary label classification [Langford and Beygelzimer, 5]. One contribution of this paper is to show that this result does not hold for sequential learning. That is, there are sequence labeling problems, such that for any output code, SECOC performs poorly compared to directly learning on the large label set, even assuming optimal learning for the binary problems. Given the theoretical limitations of SECOC, and the prior empirical success, the main goals of this paper are: 1) to better understand the types of problems for which SECOC is suited, 2) to suggest a simple approach to overcoming the limitations of SECOC and to evaluate its performance. We present experiments on synthetic and benchmark data sets. The results show that the originally introduced SECOC performs poorly for problems where the Markovian transition structure is important, but where the transition information is not captured well by the input features. These results suggest that when it is critical to explicitly capture Markovian transition structure SECOC may not be a good choice. In response we introduce a simple extension to SECOC called cascaded SECOC where the predictions of previously learned binary models are used as features for later models. Cascading allows for richer transition structure to be encoded in the learned model with little computational overhead. Our results show that cascading can significantly improve on SECOC for problems where capturing the Markovian structure is critical. 2486
2 2 Conditional Random Fields We study the sequence-labeling problem, where the goal is to learn a classifier that when given an observation sequence X =(x 1, x 2,...,x T ) returns a label sequence Y = (y 1,y 2,...,y T ) that assigns the correct class label to each observation. In this work, we will use conditional random fields (CRFs) for sequence labeling. A (sequential) CRF is a Markov random field [Geman and Geman, 1984] over the label sequence Y globally conditioned on the observation sequence X. The Hammersley-Clifford theorem states that the conditional distribution P (Y X) has the following form: P (Y X) = 1 Z(X) exp Ψ(y t 1,y t, X), where Z(X) is a normalization constant, and Ψ(y t 1,y t, X) is the potential function defined on the clique (y t 1,y t ).For simplicity, we only consider the pair-wise cliques (y t 1,y t ), i. e. chain-structured CRFs, and assume that the potential function does not depend on position t. Asin [Lafferty et al., 1], the potential function is usually represented as a linear combination of binary features: Ψ(y t 1,y t, X) = λ αf α(y t, X)+ γ β g β (y t,y t 1, X). α The functions f capture the relationship between the label y t and the input sequence X; the boolean functions g capture the transitions between y t 1 and y t (conditioned on X). Many CRF training algorithms have been proposed for training the parameters λ and γ, however, most all of these methods involve the repeated computation of the partition function Z(X) and/or maximizing over label sequences, which is usually done using the forward-backward and Viterbi algorithms. The time complexity of these algorithms is O(T L k+1 ),wherel is the number of class labels, k is the order of Markov model, and T is the sequence length. So even for first order CRFs, training and inference scale quadratically in the number of class labels, which becomes computationally demanding for large label sets. Below we describe the use of output coding for combating this computational burden. 3 ECOC for Sequence Labeling For non-sequential supervised classification, the idea of Error-Correcting Output Coding (ECOC) has been successfully used to solve multi-class problems [Dietterich and Bakiri, 1995]. A multi-class learning problem with training examples {(x i,y i )} is reduced to a number of binaryclass learning problems by assigning each class label j abinary vector, or code word, C j of length n. A code matrix M is built by taking the code words as rows. Each column b k in M can be regarded as a binary partition of the original label set, b k (y) {, 1}. We can then learn a binary classifier h k using training examples {(x i,b k (y i ))}. Topredict the label of a new instance x, we concatenate the predictions of each binary classifier to get a vector H(x) = (h 1 (x),h 2 (x),...,h n (x). The predicted label ŷ is then given by ŷ = argmin j Δ(H(x),C j ),whereδ is some distance measure, such as Hamming distance. In some implementations H(x) stores probabilities rather than binary labels. t β This idea has recently been applied to sequence labeling problems under the name sequential error-correcting output coding (SECOC) [Cohn et al., 5], with the motivation of reducing computation time for large label sets. In SECOC, instead of training a multi-class CRF, we train a binary-class CRF h k for each column b k of a code matrix M. More specifically the training data for CRF h k is given by {(X i,b k (Y i ))}, where b k (Y )=(b k (y 1 ),b k (y 2 ),...,b k (y T )) is a binary label sequence. Given a set of binary classifiers for a code matrix and an observation sequence X we compute the multilabel output sequence as follows. We first use the forwardbackward algorithm on each h k to compute the probability that each sequence position should be labeled 1. For each sequence element x t we then form a probability vector H(x t )=(p 1,..., p n ) where p k is the probability computed by h k for x t. We then let the label for x t be the class label y t whose codeword is closest to H(x t ) based on L 1 distance. The complexity of this inference process is just O(n T ) where n is the codeword length, which is typically much smaller than the number of labels squared. Thus, SECOC can significantly speed inference time for large label sets. Prior work [Cohn et al., 5] has demonstrated on several problems that SECOC can significantly speed training time without significantly hurting accuracy. However, this work did not address the potential limitations of the SECOC approach. A key characteristic of SECOC is that each binary CRF is trained completely independently of one another and each binary CRF only sees a very coarse view of the multiclass label sequences. Intuitively, it appears difficult to represent complex multi-class transition models between y t 1 and y t using such independent chains. This raises the fundamental question of the representational capacity of the SECOC model. The following counter example gives a partial answer to this question, showing that the SECOC model is unable to represent relatively simple multi-label transition models. SECOC Counter Example. Consider a simple Markov model with three states Y = {1, 2, 3} and deterministic transitions 1 2, 2 3 and 3 1. A 3-by-3 diagonal code matrix is sufficient for capturing all non-trivial codewords for this label set i. e. all non-trivial binary partitions of Y. Below we show an example label sequence and the corresponding binary code sequences. y 1 y 2 y 3 y 4 y 5 y 6 y 7 Y b 1 (Y ) b 2 (Y ) 1 1 b 3 (Y ) 1 1 Given a label sequence Y = {1, 2, 3, 1, 2, 3, 1,...}, afirstorder Markov model learned for b 1 (Y ) will converge to P (y t =1 y t 1 =)=P (y t = y t 1 =)=.5. It can be shown that as the sequence length grows such a model will make i.i.d. predictions according to the stationary distribution that predicts 1 with probability 1/3. The same is true for b 2 and b 3. Since the i.i.d. predictions between the binary CRFs are independent, using these models to make predictions via SECOC will yield a substantial error rate, even though the sequence is perfectly predictable. Independent first-order binary transition models are simply unable to capture the transition structure in this problem. Our experimental results will show that this deficiency is also exhibited in real data. 2487
3 4 Cascaded SECOC Training of CRFs Each binary CRF in SECOC has very limited knowledge about the Markovian transition structure. One possibility to improve this situation is to provide limited coupling between the binary CRFs. One way to do this is to include observation features in each binary CRF that record the binary predictions of previous CRFs. We call this approach cascaded SECOC (c-secoc), as opposed to independent SECOC (). Given training examples S = {(X i,y i )}, lety (j) i be the prediction of the binary CRF learned with the j-th binary partition b j and y (j) it be the t-th element of Y (j) i. To train the binary CRF h k based on binary partition b k, each training example (X i,y i ) is extended to (X i,b k(y i )), where each x it the union of the observation features x is it and the predictions of the previous h binary CRFs at sequence positions from t l to t + r (l = r =1in our experiments) except position t: x it =(x it,y (k 1) i,t l,...,y(k 1) i,t 1,y(k 1) i,t+1,...,y(k 1) i,t+r,...,y (k h) i,t l,...,y(k h) i,t 1,y(k h) i,t+1,...,y(k h) i,t+r ). We refer to h as the cascade history length. We do not use the previous binary predictions at position t as part of x it,since such features have significant autocorrelation which can easily lead to overfitting. To predict Y for a given observation sequence X, we make predictions from the first binary CRF to the last, feeding predictions into later binary CRFs as appropriate, and then use the same decoding process as. Via the use of previous binary predictions, c-secoc has the potential to capture Markovian transition structure that i- SECOC cannot. Our experiments show that this is important for problems where the transition structure is critical to sequence labeling, but is not reflected in the observation features. The computational overhead of c-secoc over i- SECOC is to increase the number of observation features, which typically has negligible impact on the overall training time. As the cascade history grows, however, there is the potential for c-secoc to overfit with the additional features. We will discuss this issue further in the next section. 5 Experimental Results We compare CRFs trained using, c-secoc, and beam search over the full label set. We consider two existing base CRF algorithms: gradient-tree boosting (GTB) [Dietterich et al., 4] and voted perceptron (VP) [Collins, 2]. GTB is able to construct complex features from the primitive observations and labels, whereas VP is only able to combine the observations and labels linearly. Thus, GTB has more expressive power, but can require substantially more computational effort. In all cases we use the forward-backward algorithm to make label-sequence predictions and measure accuracy according to the fraction of correctly labeled sequence elements. We used random code matrices constrained so that no columns are identical or complementary, and no class labels have the same code word. First, we consider a non-sequential baseline model denoted as which treats all sequence elements as independent examples, effectively using non-sequential ECOC at the sequence element level. In particular, we train using i- SECOC with zeroth-order binary CRF, i. e. CRFs with no transition model. This model allows us to assess the accuracy that can be attained by looking at only a window of observation features. Second, we denote by the use of to train first-order binary CRFs. Third, we denote by c-secoc(h) the use of c-secoc to train first-order binary CRFs using a cascade history of length h. Summary of Results. The results presented below justify five main claims: 1) can fail to capture significant transition structures, leading to poor accuracy. Such observations were not made in the original evaluation of [Cohn et al., 5]. 2) c-secoc can significantly improve on through the use of cascade features. 3) The performance of c-secoc can depend strongly on the base CRF algorithm. In particular, it appears critical that the algorithm be able to capture complex (non-linear) interactions in the observation and cascade features. 4) c-secoc can improve on models trained using beam search when GTB is used as the base CRF algorithm. 5) When using weaker base learning methods such as VP, beam search can outperform c-secoc. 5.1 Nettalk Data Set The Nettalk task [Sejnowski and Rosenberg, 1987] is to assign a phoneme and stress symbol to each letter of a word so that the word is pronounced correctly. Here the observations correspond to letters yielding a total of 26 binary observation features at each sequence position. Labels correspond to phoneme-stress pairs yielding a total of 134 labels. We use the standard training and test sequences. Comparing to. Figures 1a and 1b show the results for training our various models using GTB with window sizes 1 and 3. For window size 1, we see that is able to significantly improve over, which indicates that is able to capture useful transition structure to improve accuracy. However, we see that by increasing the cascade history length c-secoc is able to substantially improve over. Even with h =1the accuracy improves by over percentage points. This indicates that the independent CRF training strategy of is unable to capture important transition structure in this problem. c-secoc is able to exploit the cascade history features in order to capture this transition information, which is particularly critical in this problem where apparently just using the information in the observation window of length 1 is not sufficient to make accurate predictions (as indicated by the results). Results for window size 3 are similar. However, the improvement of over and of c-secoc over i- SECOC are smaller. This is expected since the larger window size spans multiple sequence positions, allowing the model to capture transition information using the observations alone, making the need for an explicit transition model less important. Nevertheless, both SECOC methods can capture useful transition structure that cannot, with c-secoc benefiting from the use of cascade features. For both window sizes, we see that c-secoc performs best for a particular cascade history length, and increasing beyond that length decreases accuracy by a small amount. This indicates that c-secoc can suffer from overfitting as the cascade history grows. Figures 1c and 1d show corresponding results for VP. We still observe that including cascade features allows c-secoc 2488
4 c-secoc(2) c-secoc(5) c-secoc(11) (a) window size 1 with GTB c-secoc(2) c-secoc(9) 69 (b) window size 3 with GTB 56 c-secoc(5) c-secoc(11) c-secoc() c-secoc(all) (c) window size 1 with VP c-secoc(9) c-secoc() c-secoc(all) (d) window size 3 with VP Figure 1: Nettalk data set with window sizes 1 and 3 trained by GTB and VP. to improve upon, though compared to GTB longer cascade histories are required to achieve improvement. No overfitting was observed up to cascade history length. However, we were able to observe overfitting after code length 1 by including all possible cascade history bits, denoted by c-secoc(all). All of our overfitting results suggest that in practice a limited validation process should be used to select the cascade history for c-secoc. For large label sets, trying a small number of cascade history lengths is a much more practical alternative than training on the full label set. GTB versus VP. c-secoc using GTB performs significantly better than using VP. We explain this by noting that the critical feature of c-secoc is that the binary CRFs are able exploit the cascade features in order to better capture the transition structure. VP considers only linear combinations of these features, whereas GTB is able to capture non-linear relationships by inducing complex combinations of these features and hence capturing richer transition structure. This indicates that when using c-secoc it can be important to use training methods such as GTB that are able to capture rich patterns in the observations and hence of the cascade features. Comparing to Beam Search. Here we compare the performance of c-secoc with multi-label CRF models trained using beam search in place of full-viterbi and forwardbackward inference, which is a common approach to achieving practical inference with large label sets. For beam search, we tried various beam-sizes within reasonable computational limits. Figure 2 shows the accuracy/training-time trade-offs for the best beam-search results and the best results achieved for SECOC with code-word bits and various cascade histories. The graph shows the test set performance versus the amount of training time in CPU seconds. First notice that GTB with beam search is significantly worse than for VP. We believe this is because GTB requires forward-backward inference during training whereas VP does not, and this is more adversely affected by beam search. Rather, c-secoc using GTB performs significantly better than VP with beam search. 5.2 Noun Phrase Chunking We consider the CoNLL Noun Phrase Chunking (NPC) shared task that involves labeling the words of sentences. This was one of the problems used in the original evaluation of. There are 121 class labels, which are combinations of Part-of-speech tagging labels and NPC labels, and 21,589 observation features for each word in the sequences. There are 8,936 sequences in the training set and 2,12 sequences in the test set. Due to the large number of observation features, we can not get good results for GTB using our current implementation and only present results for VP. Comparing to. As shown in Figure 3a, for window size 1, outperforms and incorporating cascade features allows c-secoc to outperform by a small margin. Again we see overfitting for c-secoc for larger numbers of cascade features. Moving to window size 3 changes the story. Here a large amount of observation information is included from the current and adjacent sentence positions, and as a result the performs as well as any of the SECOC approaches. The large amount of observation information at each sequence position appears to capture enough transition information so that SECOC gets little benefit from learning an explicit transition model. This suggests that the performance of the SECOC methods in this domain are primarily a reflection of the ability of non-sequential ECOC. This is interesting given the results in [Cohn et al., 5], where all domains involved a large amount of local observation information. IID results were not reported there. 2489
5 (GTB) c-secoc(gtb) Beam(GTB) (a) window size 1 65 (GTB) 63 c-secoc(gtb) Beam(GTB) 61 (b) window size 3 Figure 2: Nettalk: comparison between SECOC and beam search c-secoc(7) 76 (a) window size (b) window size 3 Figure 3: NPC data set with window sizes 1 and 3 trained by VP. Comparing to Beam Search. We compare the accuracy versus training-time for models trained with SECOC and with beam search, using the already described experimental setup. Figure 4 shows that beam search performs much better than the c-secoc methods within the same training time using VP as the base learning algorithm. We believe the poor performance of c-secoc compared to beam search is that using VP does not allow for rich patterns to be captured in the observations which include the cascade features. We hypothesize, as observed in Nettalk, that c-secoc would perform as well or better than beam search given a base CRF method that can capture complex patterns in the observations. 5.3 Synthetic Data Sets Our results suggest that does poorly when critical transition structure is not captured by the observation features and that c-secoc can improve on in such cases. Here we further evaluate these hypotheses. We generated data using a hidden Markov model (HMM) with labels/states {l 1,...,l } and possible observations {o 1,...,o }. To specify the observation distribution, for each label l i we randomly draw a set O i of k o observations not including o i. If the current state is l i, then the HMM generates the observation o i with probability p o and otherwise generates a randomly selected observation from O i. The observation model is made more important by increasing p o and decreasing k o. The transition model is defined in a similar way. For each label l i we randomly draw a set L i of k l labels not including l i. If the current state is l i then the HMM sets the next state to l i with probability p l and otherwise generates a random next state from L i. Increasing p l and decreasing k l increases the transition model importance. Transition Data Set. Here we use p o =.2, k o =8, p l =.6, k l =2, so that the transition structure is very important and observation features are quite uninformative. Figure 5a shows the results for GTB training using window size 1. We see that is significantly better than indicating that it is able to exploit transition structure not available to. However, including cascade features allows c-secoc to further improve performance, showing that was unable to fully exploit information in the transition model. As in our previous experiments we observe that c-secoc does overfit for the largest number of cascade features. For window size 3 (graph not shown) the results are similar except that the relative improvements are less since more observation information is available, making the transition model less critical. These results mirror those for the Nettalk data set. Both Data Set. Here we use p o =.6, k o =2, p l =.6, k l =2so that both the transition structure and observation features are very informative. Figure 5b shows results for GTB with window size 3. The observation features provide a large amount of information and performance of is similar to that of the SECOC variants. Also we see that c-secoc is unable to improve on in this case. This suggests that the observation information captures the bulk of the transition information, and the performance of the SECOC methods is a reflection of non-sequential ECOC, rather than of their ability to explicitly capture transition structure. 6 Summary We uncovered empirical and theoretical shortcomings of independently trained SECOC. Independent training of binary 249
6 (VP) c-secoc(vp) 76 (a) window size 1 84 (VP) c-secoc(vp) 8 (b) window size 3 Figure 4: NPC: comparison between SECOC and beam search using VP c-secoc(7) (a) window size 1 on transition data set Figure 5: Synthetic data sets trained by GTB. 55 c-secoc(2) c-secoc(5) (b) window size 3 on both data set CRFs can perform poorly when it is critical to explicitly capture complex transition models. We proposed cascaded SECOC and showed that it can improve accuracy in such situations. We also showed that when using less powerful CRF base learning algorithms, approaches other than SECOC (e.g. beam search) may be preferable. Future directions include efficient validation procedures for selecting cascade history lengths, incremental generation of code words, and a wide comparison of methods for dealing with large label sets. Acknowledgments We thank John Langford for discussion of the counter example to independent SECOC and Thomas Dietterich for his support. This work was supported by NSF grant IIS References [Cohn et al., 5] Trevor Cohn, Andrew Smith, and Miles Osborne. Scaling conditional random fields using error correcting codes. In Annual Meeting of the Association for Computational Linguistics, pages 17, 5. [Collins and Roark, 4] Michael Collins and Brian Roark. Incremental parsing with the perceptron algorithm. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, pages , 4. [Collins, 2] M. Collins. Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. In Empirical Methods in NLP, pages 1 8, 2. [Dietterich and Bakiri, 1995] Thomas G. Dietterich and Ghulum Bakiri. Solving multiclass learning problems via error-correcting output codes. JAIR, 2: , [Dietterich et al., 4] Thomas G. Dietterich, Adam Ashenfelter, and Yaroslav Bulatov. Training conditional random fields via gradient tree boosting. In ICML, 4. [Felzenszwalb et al., 3] P. Felzenszwalb, D. Huttenlocher, and J. Kleinberg. Fast algorithms for large-state-space hmms with applications to web usage analysis. In NIPS 16, 3. [Geman and Geman, 1984] S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. PAMI, 6: , [Lafferty et al., 1] John Lafferty, Andrew McCallum, and Fernando Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, pages 2 289, 1. [Langford and Beygelzimer, 5] John Langford and Alina Beygelzimer. Sensitive error correcting output codes. In COLT, 5. [Pal et al., 6] Chris Pal, Charles Sutton, and Andrew McCallum. Sparse forward-backward using minimum divergence beams for fast training of conditional random fields. In International Conference on Acoustics, Speech, and Signal Processing, 6. [Sejnowski and Rosenberg, 1987] T. J. Sejnowski and C. R. Rosenberg. Parallel networks that learn to pronouce English text. Complex Systems, 1: , [Siddiqi and Moore, 5] Sajid Siddiqi and Andrew Moore. Fast inference and learning in large-state-space hmms. In ICML, 5. [Tsochantaridis et al., 4] Ioannis Tsochantaridis, Thomas Hofmann, Thorsten Joachims, and Yasemin Altun. Support vector machine learning for interdependent and structured output spaces. In ICML,
Conditional Random Fields: An Introduction
Conditional Random Fields: An Introduction Hanna M. Wallach February 24, 2004 1 Labeling Sequential Data The task of assigning label sequences to a set of observation sequences arises in many fields, including
Learning and Inference over Constrained Output
IJCAI 05 Learning and Inference over Constrained Output Vasin Punyakanok Dan Roth Wen-tau Yih Dav Zimak Department of Computer Science University of Illinois at Urbana-Champaign {punyakan, danr, yih, davzimak}@uiuc.edu
Multi-class Classification: A Coding Based Space Partitioning
Multi-class Classification: A Coding Based Space Partitioning Sohrab Ferdowsi, Svyatoslav Voloshynovskiy, Marcin Gabryel, and Marcin Korytkowski University of Geneva, Centre Universitaire d Informatique,
Statistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
Sub-class Error-Correcting Output Codes
Sub-class Error-Correcting Output Codes Sergio Escalera, Oriol Pujol and Petia Radeva Computer Vision Center, Campus UAB, Edifici O, 08193, Bellaterra, Spain. Dept. Matemàtica Aplicada i Anàlisi, Universitat
Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski [email protected]
Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trakovski [email protected] Neural Networks 2 Neural Networks Analogy to biological neural systems, the most robust learning systems
Course: Model, Learning, and Inference: Lecture 5
Course: Model, Learning, and Inference: Lecture 5 Alan Yuille Department of Statistics, UCLA Los Angeles, CA 90095 [email protected] Abstract Probability distributions on structured representation.
Learning is a very general term denoting the way in which agents:
What is learning? Learning is a very general term denoting the way in which agents: Acquire and organize knowledge (by building, modifying and organizing internal representations of some external reality);
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)
large-scale machine learning revisited Léon Bottou Microsoft Research (NYC) 1 three frequent ideas in machine learning. independent and identically distributed data This experimental paradigm has driven
Machine Learning for Sequential Data: A Review
Machine Learning for Sequential Data: A Review Thomas G. Dietterich Oregon State University, Corvallis, Oregon, USA, [email protected], WWW home page: http://www.cs.orst.edu/~tgd Abstract. Statistical learning
Linear Threshold Units
Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear
Learning to Rank Revisited: Our Progresses in New Algorithms and Tasks
The 4 th China-Australia Database Workshop Melbourne, Australia Oct. 19, 2015 Learning to Rank Revisited: Our Progresses in New Algorithms and Tasks Jun Xu Institute of Computing Technology, Chinese Academy
Chapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
Semantic parsing with Structured SVM Ensemble Classification Models
Semantic parsing with Structured SVM Ensemble Classification Models Le-Minh Nguyen, Akira Shimazu, and Xuan-Hieu Phan Japan Advanced Institute of Science and Technology (JAIST) Asahidai 1-1, Nomi, Ishikawa,
Supervised Learning (Big Data Analytics)
Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used
Distributed Structured Prediction for Big Data
Distributed Structured Prediction for Big Data A. G. Schwing ETH Zurich [email protected] T. Hazan TTI Chicago M. Pollefeys ETH Zurich R. Urtasun TTI Chicago Abstract The biggest limitations of learning
Multi-Class and Structured Classification
Multi-Class and Structured Classification [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] [ p y( )] http://www.cs.berkeley.edu/~jordan/courses/294-fall09 Basic Classification in ML Input Output
Direct Loss Minimization for Structured Prediction
Direct Loss Minimization for Structured Prediction David McAllester TTI-Chicago [email protected] Tamir Hazan TTI-Chicago [email protected] Joseph Keshet TTI-Chicago [email protected] Abstract In discriminative
Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05
Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification
A Systematic Cross-Comparison of Sequence Classifiers
A Systematic Cross-Comparison of Sequence Classifiers Binyamin Rozenfeld, Ronen Feldman, Moshe Fresko Bar-Ilan University, Computer Science Department, Israel [email protected], [email protected],
Spatial Statistics Chapter 3 Basics of areal data and areal data modeling
Spatial Statistics Chapter 3 Basics of areal data and areal data modeling Recall areal data also known as lattice data are data Y (s), s D where D is a discrete index set. This usually corresponds to data
Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval
Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
An Introduction to Data Mining
An Introduction to Intel Beijing [email protected] January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail
Semi-Supervised Learning for Blog Classification
Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (2008) Semi-Supervised Learning for Blog Classification Daisuke Ikeda Department of Computational Intelligence and Systems Science,
Lecture 3: Linear methods for classification
Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,
One-sided Support Vector Regression for Multiclass Cost-sensitive Classification
One-sided Support Vector Regression for Multiclass Cost-sensitive Classification Han-Hsing Tu [email protected] Hsuan-Tien Lin [email protected] Department of Computer Science and Information
Building a Question Classifier for a TREC-Style Question Answering System
Building a Question Classifier for a TREC-Style Question Answering System Richard May & Ari Steinberg Topic: Question Classification We define Question Classification (QC) here to be the task that, given
MapReduce Approach to Collective Classification for Networks
MapReduce Approach to Collective Classification for Networks Wojciech Indyk 1, Tomasz Kajdanowicz 1, Przemyslaw Kazienko 1, and Slawomir Plamowski 1 Wroclaw University of Technology, Wroclaw, Poland Faculty
Tagging with Hidden Markov Models
Tagging with Hidden Markov Models Michael Collins 1 Tagging Problems In many NLP problems, we would like to model pairs of sequences. Part-of-speech (POS) tagging is perhaps the earliest, and most famous,
A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier
A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,
Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014
Probabilistic Models for Big Data Alex Davies and Roger Frigola University of Cambridge 13th February 2014 The State of Big Data Why probabilistic models for Big Data? 1. If you don t have to worry about
STA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! [email protected]! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct
10-601. Machine Learning. http://www.cs.cmu.edu/afs/cs/academic/class/10601-f10/index.html
10-601 Machine Learning http://www.cs.cmu.edu/afs/cs/academic/class/10601-f10/index.html Course data All up-to-date info is on the course web page: http://www.cs.cmu.edu/afs/cs/academic/class/10601-f10/index.html
Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics
Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics Part I: Factorizations and Statistical Modeling/Inference Amnon Shashua School of Computer Science & Eng. The Hebrew University
Big Data Analytics CSCI 4030
High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising
CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York
BME I5100: Biomedical Signal Processing Linear Discrimination Lucas C. Parra Biomedical Engineering Department CCNY 1 Schedule Week 1: Introduction Linear, stationary, normal - the stuff biology is not
Efficient Recovery of Secrets
Efficient Recovery of Secrets Marcel Fernandez Miguel Soriano, IEEE Senior Member Department of Telematics Engineering. Universitat Politècnica de Catalunya. C/ Jordi Girona 1 i 3. Campus Nord, Mod C3,
Introduction to Online Learning Theory
Introduction to Online Learning Theory Wojciech Kot lowski Institute of Computing Science, Poznań University of Technology IDSS, 04.06.2013 1 / 53 Outline 1 Example: Online (Stochastic) Gradient Descent
Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University [email protected]
Bayesian Machine Learning (ML): Modeling And Inference in Big Data Zhuhua Cai Google Rice University [email protected] 1 Syllabus Bayesian ML Concepts (Today) Bayesian ML on MapReduce (Next morning) Bayesian
Towards better accuracy for Spam predictions
Towards better accuracy for Spam predictions Chengyan Zhao Department of Computer Science University of Toronto Toronto, Ontario, Canada M5S 2E4 [email protected] Abstract Spam identification is crucial
Introduction to Deep Learning Variational Inference, Mean Field Theory
Introduction to Deep Learning Variational Inference, Mean Field Theory 1 Iasonas Kokkinos [email protected] Center for Visual Computing Ecole Centrale Paris Galen Group INRIA-Saclay Lecture 3: recap
II. RELATED WORK. Sentiment Mining
Sentiment Mining Using Ensemble Classification Models Matthew Whitehead and Larry Yaeger Indiana University School of Informatics 901 E. 10th St. Bloomington, IN 47408 {mewhiteh, larryy}@indiana.edu Abstract
Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence
Artificial Neural Networks and Support Vector Machines CS 486/686: Introduction to Artificial Intelligence 1 Outline What is a Neural Network? - Perceptron learners - Multi-layer networks What is a Support
Lecture 8 February 4
ICS273A: Machine Learning Winter 2008 Lecture 8 February 4 Scribe: Carlos Agell (Student) Lecturer: Deva Ramanan 8.1 Neural Nets 8.1.1 Logistic Regression Recall the logistic function: g(x) = 1 1 + e θt
Simple and efficient online algorithms for real world applications
Simple and efficient online algorithms for real world applications Università degli Studi di Milano Milano, Italy Talk @ Centro de Visión por Computador Something about me PhD in Robotics at LIRA-Lab,
Introduction to Support Vector Machines. Colin Campbell, Bristol University
Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.
Linear Codes. Chapter 3. 3.1 Basics
Chapter 3 Linear Codes In order to define codes that we can encode and decode efficiently, we add more structure to the codespace. We shall be mainly interested in linear codes. A linear code of length
Large Scale Learning to Rank
Large Scale Learning to Rank D. Sculley Google, Inc. [email protected] Abstract Pairwise learning to rank methods such as RankSVM give good performance, but suffer from the computational burden of optimizing
(Quasi-)Newton methods
(Quasi-)Newton methods 1 Introduction 1.1 Newton method Newton method is a method to find the zeros of a differentiable non-linear function g, x such that g(x) = 0, where g : R n R n. Given a starting
Cell Phone based Activity Detection using Markov Logic Network
Cell Phone based Activity Detection using Markov Logic Network Somdeb Sarkhel [email protected] 1 Introduction Mobile devices are becoming increasingly sophisticated and the latest generation of smart
Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research [email protected]
Introduction to Machine Learning Lecture 1 Mehryar Mohri Courant Institute and Google Research [email protected] Introduction Logistics Prerequisites: basics concepts needed in probability and statistics
Question 2 Naïve Bayes (16 points)
Question 2 Naïve Bayes (16 points) About 2/3 of your email is spam so you downloaded an open source spam filter based on word occurrences that uses the Naive Bayes classifier. Assume you collected the
Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S
Linear smoother ŷ = S y where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S 2 Online Learning: LMS and Perceptrons Partially adapted from slides by Ryan Gabbard
CSE 473: Artificial Intelligence Autumn 2010
CSE 473: Artificial Intelligence Autumn 2010 Machine Learning: Naive Bayes and Perceptron Luke Zettlemoyer Many slides over the course adapted from Dan Klein. 1 Outline Learning: Naive Bayes and Perceptron
Survey on Multiclass Classification Methods
Survey on Multiclass Classification Methods Mohamed Aly November 2005 Abstract Supervised classification algorithms aim at producing a learning model from a labeled training set. Various
Machine Learning Final Project Spam Email Filtering
Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE
CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.
Lecture Machine Learning Milos Hauskrecht [email protected] 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht [email protected] 539 Sennott
A HYBRID GENETIC ALGORITHM FOR THE MAXIMUM LIKELIHOOD ESTIMATION OF MODELS WITH MULTIPLE EQUILIBRIA: A FIRST REPORT
New Mathematics and Natural Computation Vol. 1, No. 2 (2005) 295 303 c World Scientific Publishing Company A HYBRID GENETIC ALGORITHM FOR THE MAXIMUM LIKELIHOOD ESTIMATION OF MODELS WITH MULTIPLE EQUILIBRIA:
Machine Learning and Pattern Recognition Logistic Regression
Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,
Programming Exercise 3: Multi-class Classification and Neural Networks
Programming Exercise 3: Multi-class Classification and Neural Networks Machine Learning November 4, 2011 Introduction In this exercise, you will implement one-vs-all logistic regression and neural networks
Training Methods for Adaptive Boosting of Neural Networks for Character Recognition
Submission to NIPS*97, Category: Algorithms & Architectures, Preferred: Oral Training Methods for Adaptive Boosting of Neural Networks for Character Recognition Holger Schwenk Dept. IRO Université de Montréal
MapReduce/Bigtable for Distributed Optimization
MapReduce/Bigtable for Distributed Optimization Keith B. Hall Google Inc. [email protected] Scott Gilpin Google Inc. [email protected] Gideon Mann Google Inc. [email protected] Abstract With large data
Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues
Data Mining with Regression Teaching an old dog some new tricks Acknowledgments Colleagues Dean Foster in Statistics Lyle Ungar in Computer Science Bob Stine Department of Statistics The School of the
Learning outcomes. Knowledge and understanding. Competence and skills
Syllabus Master s Programme in Statistics and Data Mining 120 ECTS Credits Aim The rapid growth of databases provides scientists and business people with vast new resources. This programme meets the challenges
A survey on click modeling in web search
A survey on click modeling in web search Lianghao Li Hong Kong University of Science and Technology Outline 1 An overview of web search marketing 2 An overview of click modeling 3 A survey on click models
CSC2420 Fall 2012: Algorithm Design, Analysis and Theory
CSC2420 Fall 2012: Algorithm Design, Analysis and Theory Allan Borodin November 15, 2012; Lecture 10 1 / 27 Randomized online bipartite matching and the adwords problem. We briefly return to online algorithms
Hidden Markov Models Chapter 15
Hidden Markov Models Chapter 15 Mausam (Slides based on Dan Klein, Luke Zettlemoyer, Alex Simma, Erik Sudderth, David Fernandez-Baca, Drena Dobbs, Serafim Batzoglou, William Cohen, Andrew McCallum, Dan
SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING
AAS 07-228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations
FUNDAMENTALS of INFORMATION THEORY and CODING DESIGN
DISCRETE "ICS AND ITS APPLICATIONS Series Editor KENNETH H. ROSEN FUNDAMENTALS of INFORMATION THEORY and CODING DESIGN Roberto Togneri Christopher J.S. desilva CHAPMAN & HALL/CRC A CRC Press Company Boca
Bayesian Statistics: Indian Buffet Process
Bayesian Statistics: Indian Buffet Process Ilker Yildirim Department of Brain and Cognitive Sciences University of Rochester Rochester, NY 14627 August 2012 Reference: Most of the material in this note
CNFSAT: Predictive Models, Dimensional Reduction, and Phase Transition
CNFSAT: Predictive Models, Dimensional Reduction, and Phase Transition Neil P. Slagle College of Computing Georgia Institute of Technology Atlanta, GA [email protected] Abstract CNFSAT embodies the P
Introducing diversity among the models of multi-label classification ensemble
Introducing diversity among the models of multi-label classification ensemble Lena Chekina, Lior Rokach and Bracha Shapira Ben-Gurion University of the Negev Dept. of Information Systems Engineering and
How Conditional Random Fields Learn Dynamics: An Example-Based Study
Computer Communication & Collaboration (2013) Submitted on 27/May/2013 How Conditional Random Fields Learn Dynamics: An Example-Based Study Mohammad Javad Shafiee School of Electrical & Computer Engineering,
Machine Learning. CUNY Graduate Center, Spring 2013. Professor Liang Huang. [email protected]
Machine Learning CUNY Graduate Center, Spring 2013 Professor Liang Huang [email protected] http://acl.cs.qc.edu/~lhuang/teaching/machine-learning Logistics Lectures M 9:30-11:30 am Room 4419 Personnel
L25: Ensemble learning
L25: Ensemble learning Introduction Methods for constructing ensembles Combination strategies Stacked generalization Mixtures of experts Bagging Boosting CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna
A fast multi-class SVM learning method for huge databases
www.ijcsi.org 544 A fast multi-class SVM learning method for huge databases Djeffal Abdelhamid 1, Babahenini Mohamed Chaouki 2 and Taleb-Ahmed Abdelmalik 3 1,2 Computer science department, LESIA Laboratory,
Semi-Supervised Support Vector Machines and Application to Spam Filtering
Semi-Supervised Support Vector Machines and Application to Spam Filtering Alexander Zien Empirical Inference Department, Bernhard Schölkopf Max Planck Institute for Biological Cybernetics ECML 2006 Discovery
Introduction to Machine Learning Using Python. Vikram Kamath
Introduction to Machine Learning Using Python Vikram Kamath Contents: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Introduction/Definition Where and Why ML is used Types of Learning Supervised Learning Linear Regression
Statistical Machine Translation
Statistical Machine Translation Some of the content of this lecture is taken from previous lectures and presentations given by Philipp Koehn and Andy Way. Dr. Jennifer Foster National Centre for Language
Segmentation and Classification of Online Chats
Segmentation and Classification of Online Chats Justin Weisz Computer Science Department Carnegie Mellon University Pittsburgh, PA 15213 [email protected] Abstract One method for analyzing textual chat
