Semantic Compositionality through Recursive Matrix-Vector Spaces
|
|
|
- Eustace Dean
- 10 years ago
- Views:
Transcription
1 Semantic Compositionality through Recursive Matrix-Vector Spaces Richard Socher Brody Huval Christopher D. Manning Andrew Y. Ng Computer Science Department, Stanford University Abstract Single-word vector space models have been very successful at learning lexical information. However, they cannot capture the compositional meaning of longer phrases, preventing them from a deeper understanding of language. We introduce a recursive neural network () model that learns compositional vector representations for phrases and sentences of arbitrary syntactic type and length. Our model assigns a vector and a matrix to every node in a parse tree: the vector captures the inherent meaning of the constituent, while the matrix captures how it changes the meaning of neighboring words or phrases. This matrix-vector can learn the meaning of operators in propositional logic and natural language. The model obtains state of the art performance on three different experiments: predicting fine-grained sentiment distributions of adverb-adjective pairs; classifying sentiment labels of movie reviews and classifying semantic relationships such as cause-effect or topic-message between nouns using the syntactic path between them. 1 Introduction Semantic word vector spaces are at the core of many useful natural language applications such as search query expansions (Jones et al., 26), fact extraction for information retrieval (Paşca et al., 26) and automatic annotation of text with disambiguated Wikipedia links (Ratinov et al., 211), among many others (Turney and Pantel, 21). In these models the meaning of a word is encoded as a vector computed from co-occurrence statistics of a word and its neighboring words. Such vectors have been shown to correlate well with human judgments of word similarity (Griffiths et al., 27). Ba= f(ba, Ab)= Recursive Matrix-Vector Model Ab= very good movie... ( a, A ) ( b, B ) ( c, C )... - vector - matrix Figure 1: A recursive neural network which learns semantic vector representations of phrases in a tree structure. Each word and phrase is represented by a vector and a matrix, e.g., very = (a, A). The matrix is applied to neighboring vectors. The same function is repeated to combine the phrase very good with movie. Despite their success, single word vector models are severely limited since they do not capture compositionality, the important quality of natural language that allows speakers to determine the meaning of a longer expression based on the meanings of its words and the rules used to combine them (Frege, 1892). This prevents them from gaining a deeper understanding of the semantics of longer phrases or sentences. Recently, there has been much progress in capturing compositionality in vector spaces, e.g., (Mitchell and Lapata, 21; Baroni and Zamparelli, 21; Zanzotto et al., 21; Yessenalina and Cardie, 211; Socher et al., 211c) (see related work). We extend these approaches with a more general and powerful model of semantic composition. We present a novel recursive neural network model for semantic compositionality. In our context, compositionality is the ability to learn compositional vector representations for various types of phrases and sentences of arbitrary length. Fig. 1 shows an illustration of the model in which each constituent (a word or longer phrase) has a matrix-vector (MV)
2 representation. The vector captures the meaning of that constituent. The matrix captures how it modifies the meaning of the other word that it combines with. A representation for a longer phrase is computed bottom-up by recursively combining the words according to the syntactic structure of a parse tree. Since the model uses the MV representation with a neural network as the final merging function, we call our model a matrix-vector recursive neural network (MV-). We show that the ability to capture semantic compositionality in a syntactically plausible way translates into state of the art performance on various tasks. The first experiment demonstrates that our model can learn fine-grained semantic compositionality. The task is to predict a sentiment distribution over movie reviews of adverb-adjective pairs such as unbelievably sad or really awesome. The MV- is the only model that is able to properly negate sentiment when adjectives are combined with not. The MV- outperforms previous state of the art models on full sentence sentiment prediction of movie reviews. The last experiment shows that the MV- can also be used to find relationships between words using the learned phrase vectors. The relationship between words is recursively constructed and composed by words of arbitrary type in the variable length syntactic path between them. On the associated task of classifying relationships between nouns in arbitrary positions of a sentence the model outperforms all previous approaches on the SemEval-21 Task 8 competition (Hendrickx et al., 21). It outperforms all but one of the previous approaches without using any hand-designed semantic resources such as WordNet or FrameNet. By adding WordNet hypernyms, POS and NER tags our model outperforms the state of the art that uses significantly more resources. The code for our model is available at 2 MV-: A Recursive Matrix-Vector Model The dominant approach for building representations of multi-word units from single word vector representations has been to form a linear combination of the single word representations, such as a sum or weighted average. This happens in information retrieval and in various text similarity functions based on lexical similarity. These approaches can work well when the meaning of a text is literally the sum of its parts, but fails when words function as operators that modify the meaning of another word: the meaning of extremely strong cannot be captured as the sum of word representations for extremely and strong. The model of Socher et al. (211c) provided a new possibility for moving beyond a linear combination, through use of a matrix W that multiplied the word vectors (a, b), and a nonlinearity function g (such as a sigmoid or tanh). They compute the parent vector p that describes both words as ( [ ]) a p = g W (1) b and apply this function recursively inside a binarized parse tree so that it can compute vectors for multiword sequences. Even though the nonlinearity allows to express a wider range of functions, it is almost certainly too much to expect a single fixed W matrix to be able to capture the meaning combination effects of all natural language operators. After all, inside the function g, we have the same linear transformation for all possible pairs of word vectors. Recent work has started to capture the behavior of natural language operators inside semantic vector spaces by modeling them as matrices, which would allow a matrix for extremely to appropriately modify vectors for smelly or strong (Baroni and Zamparelli, 21; Zanzotto et al., 21). These approaches are along the right lines but so far have been restricted to capture linear functions of pairs of words whereas we would like nonlinear functions to compute compositional meaning representations for multi-word phrases or full sentences. The MV- combines the strengths of both of these ideas by (i) assigning a vector and a matrix to every word and (ii) learning an input-specific, nonlinear, compositional function for computing vector and matrix representations for multi-word sequences of any syntactic type. Assigning vector-matrix representations to all words instead of only to words of one part of speech category allows for greater flexibility which benefits performance. If a word lacks operator semantics, its matrix can be an identity matrix. However, if a word acts mainly as an operator,
3 such as extremely, its vector can become close to zero, while its matrix gains a clear operator meaning, here magnifying the meaning of the modified word in both positive and negative directions. In this section we describe the initial word representations, the details of combining two words as well as the multi-word extensions. This is followed by an explanation of our training procedure. 2.1 Matrix-Vector Neural Word Representation We represent a word as both a continuous vector and a matrix of parameters. We initialize all word vectors x R n with pre-trained 5-dimensional word vectors from the unsupervised model of Collobert and Weston (28). Using Wikipedia text, their model learns word vectors by predicting how likely it is for each word to occur in its context. Similar to other local co-occurrence based vector space models, the resulting word vectors capture syntactic and semantic information. Every word is also associated with a matrix X. In all experiments, we initialize matrices as X = I + ɛ, i.e., the identity plus a small amount of Gaussian noise. If the vectors have dimensionality n, then each word s matrix has dimensionality X R n n. While the initialization is random, the vectors and matrices will subsequently be modified to enable a sequence of words to compose a vector that can predict a distribution over semantic labels. Henceforth, we represent any phrase or sentence of length m as an ordered list of vectormatrix pairs ((a, A),..., (m, M)), where each pair is retrieved based on the word at that position. 2.2 Composition Models for Two Words We first review composition functions for two words. In order to compute a parent vector p from two consecutive words and their respective vectors a and b, Mitchell and Lapata (21) give as their most general function: p = f(a, b, R, K),where R is the a-priori known syntactic relation and K is background knowledge. There are many possible functions f. For our models, there is a constraint on p which is that it has the same dimensionality as each of the input vectors. This way, we can compare p easily with its children and p can be the input to a composition with another word. The latter is a requirement that will become clear in the next section. This excludes tensor products which were outperformed by simpler weighted addition and multiplication methods in (Mitchell and Lapata, 21). We will explore methods that do not require any manually designed semantic resources as background knowledge K. No explicit knowledge about the type of relation R is used. Instead we want the model to capture this implicitly via the learned matrices. We propose the following combination function which is input dependent: ( [ Ba p = f A,B (a, b) = f(ba, Ab) = g W Ab ]), (2) where A, B are matrices for single words, the global W R n 2n is a matrix that maps both transformed words back into the same n-dimensional space. The element-wise function g could be simply the identity function but we use instead a nonlinearity such as the sigmoid or hyperbolic tangent tanh. Such a nonlinearity will allow us to approximate a wider range of functions beyond purely linear functions. We can also add a bias term before applying g but omit this for clarity. Rewriting the two transformed vectors as one vector z, we get p = g(w z) which is a single layer neural network. In this model, the word matrices can capture compositional effects specific to each word, whereas W captures a general composition function. This function builds upon and generalizes several recent models in the literature. The most related work is that of (Mitchell and Lapata, 21; Zanzotto et al., 21) who introduced and explored the composition function p = Ba + Ab for word pairs. This model is a special case of Eq. 2 when we set W = [II] (i.e. two concatenated identity matrices) and g(x) = x (the identity function). Baroni and Zamparelli (21) computed the parent vector of adjective-noun pairs by p = Ab, where A is an adjective matrix and b is a vector for a noun. This cannot capture nouns modifying other nouns, e.g., disk drive. This model too is a special case of the above model with B = n n. Lastly, the models of (Socher et al., 211b; Socher et al., 211c; Socher et al., 211a) as described above are also special cases with both A and B set to the identity matrix. We will compare to these special cases in our experiments.
4 Matrix-Vector Recursive Neural Network Cp1 ( p2, P2 ) p2 = g(w [ P1c ]) (p1, P1) P2 = WM very good movie (a, A) (b, B) (c, C) [ P1 C ] Figure 2: Example of how the MV- merges a phrase with another word at a nonterminal node of a parse tree. 2.3 Recursive Compositions of Multiple Words and Phrases This section describes how we extend a word-pair matrix-vector-based compositional model to learn vectors and matrices for longer sequences of words. The main idea is to apply the same function f to pairs of constituents in a parse tree. For this to work, we need to take as input a binary parse tree of a phrase or sentence and also compute matrices at each nonterminal parent node. The function f can be readily used for phrase vectors since it is recursively compatible (p has the same dimensionality as its children). For computing nonterminal phrase matrices, we define the function [ ] A P = f M (A, B) = W M, (3) B where W M R n 2n, so P R n n just like each input matrix. After two words form a constituent in the parse tree, this constituent can now be merged with another one by applying the same functions f and f M. For instance, to compute the vectors and matrices depicted in Fig. 2, we first merge words a and b and their matrices: p 1 = f(ba, Ab), P 1 = f M (A, B). The resulting vector-matrix pair (p 1, P 1 ) can now be used to compute the full phrase when combining it with word c and computing p 2 = f(cp 1, P 1 c), P 2 = f M (P 1, C). The model computes vectors and matrices in a bottom-up fashion, applying the functions f, f M to its own previous output (i.e. recursively) until it reaches the top node of the tree which represents the entire sentence. For experiments with longer sequences we will compare to standard s and the special case of the MV- that computes the parent by p = Ab + Ba, which we name the linear Matrix-Vector Recursion model (linear MVR). Previously, this model had not been trained for multi-word sequences. Sec. 6 talks about alternatives for compositionality. 2.4 Objective Functions for Training One of the advantages of -based models is that each node of a tree has associated with it a distributed vector representation (the parent vector p) which can also be seen as features describing that phrase. We train these representations by adding on top of each parent node a simple softmax classifier to predict a class distribution over, e.g., sentiment or relationship classes: d(p) = softmax(w label p). If there are K labels, then d R K is a K-dimensional multinomial distribution. For the applications below (excluding logic), the corresponding error function E(s, t, θ) that we minimize for a sentence s and its tree t is the sum of cross-entropy errors at all nodes. The only other methods that use this type of objective function are (Socher et al., 211b; Socher et al., 211c), who also combine it with either a score or reconstruction error. Hence, for comparisons to other related work, we need to merge variations of computing the parent vector p with this classifier. The main difference is that the MV- has more flexibility since it has an input specific recursive function f A,B to compute each parent. In the following applications, we will use the softmax classifier to predict both sentiment distributions and noun-noun relationships. 2.5 Learning Let θ = (W, W M, W label, L, L M ) be our model parameters and λ a vector with regularization hyperparameters for all model parameters. L and L M are the sets of all word vectors and word matrices. The gradient of the overall objective function J becomes: J θ = 1 N (x,t) E(x, t; θ) θ + λθ. (4) To compute this gradient, we first compute all tree nodes (p i, P i ) from the bottom-up and then take derivatives of the softmax classifiers at each node in the tree from the top down. Derivatives are computed efficiently via backpropagation through structure (Goller and Küchler, 1996). Even though the
5 objective is not convex, we found that L-BFGS run over the complete training data (batch mode) minimizes the objective well in practice and convergence is smooth. For more information see (Socher et al., 21). 2.6 Low-Rank Matrix Approximations If every word is represented by an n-dimensional vector and additionally by an n n matrix, the dimensionality of the whole model may become too large with commonly used vector sizes of n = 1. In order to reduce the number of parameters, we represent word matrices by the following low-rank plus diagonal approximation: A = UV + diag(a), (5) where U R n r, V R r n, a R n and we set the rank for all experiments to r = Discussion: Evaluation and Generality Evaluation of compositional vector spaces is a complex task. Most related work compares similarity judgments of unsupervised models to those of human judgments and aims at high correlation. These evaluations can give important insights. However, even with good correlation the question remains how these models would perform on downstream NLP tasks such as sentiment detection. We experimented with unsupervised learning of general vector-matrix representations by having the MV- predict words in their correct context. Initializing the models with these general representations, did not improve the performance on the tasks we consider. For sentiment analysis, this is not surprising since antonyms often get similar vectors during unsupervised learning from co-occurrences due to high similarity of local syntactic contexts. In our experiments, the high prediction performance came from supervised learning of meaning representations using labeled data. While these representations are task-specific, they could be used across tasks in a multi-task learning setup. However, in order to fairly compare to related work, we use only the supervised data of each task. Before we describe our fullscale experiments, we analyze the model s expressive powers. 3 Model Analysis This section analyzes the model with two proof-ofconcept studies. First, we examine its ability to learn operator semantics for adverb-adjective pairs. If a model cannot correctly capture how an adverb operates on the meaning of adjectives, then there s little chance it can learn operators for more complex relationships. The second study analyzes whether the MV- can learn simple boolean operators of propositional logic such as conjunctives or negation from truth values. Again, if a model did not have this ability, then there s little chance it could learn these frequently occurring phenomena from the noisy language of real texts such as movie reviews. 3.1 Predicting Sentiment Distributions of Adverb-Adjective Pairs The first study considers the prediction of finegrained sentiment distributions of adverb-adjective pairs and analyzes different possibilities for computing the parent vector p. The results show that the MV- operators are powerful enough to capture the operational meanings of various types of adverbs. For example, very is an intensifier, pretty is an attenuator, and not can negate or strongly attenuate the positivity of an adjective. For instance not great is still pretty good and not terrible; see Potts (21) for details. We use a publicly available IMDB dataset of extracted adverb-adjective pairs from movie reviews. 1 The dataset provides the distribution over star ratings: Each consecutive word pair appears a certain number of times in reviews that have also associated with them an overall rating of the movie. After normalizing by the total number of occurrences, one gets a multinomial distribution over ratings. Only word pairs that appear at least 5 times are kept. Of the remaining pairs, we use 4211 randomly sampled ones for training and a separate set of 184 for testing. We never give the algorithm sentiment distributions for single words, and, while single words overlap between training and testing, the test set consists of never before seen word pairs. The softmax classifier is trained to minimize the cross entropy error. Hence, an evaluation in terms of KL-divergence is the most reasonable choice. It is 1
6 Method Avg KL Uniform.327 Mean train 93 p = 1 2 (a + b) 3 p = a b 3 p = [a; b] 1 p = Ab 3.93 Linear MVR.92 MV fairly annoying MV not annoying MV unbelievably annoying MV fairly awesome MV not awesome MV unbelievably awesome MV fairly sad MV not sad Training Pair unbelievably sad MV Figure 3: Left: Average KL-divergence for predicting sentiment distributions of unseen adverb-adjective pairs of the test set. See text for p descriptions. Lower is better. The main difference in the KL divergence comes from the few negation pairs in the test set. Right: Predicting sentiment distributions (over 1-1 stars on the x-axis) of adverbadjective pairs. Each row has the same adverb and each column the same adjective. Many predictions are similar between the two models. The and linear MVR are not able to modify the sentiment correctly: not awesome is more positive than fairly awesome and not annoying has a similar shape as unbelievably annoying. Predictions of the linear MVR model are almost identical to the standard for these examples. defined as KL(g p) = i g i log(g i /p i ), where g is the gold distribution and p is the predicted one. We compare to several baselines and ablations of the MV- model. An (adverb,adjective) pair is described by its vectors (a, b) and matrices (A, B). 1 p =.5(a + b), vector average 2. p = a b, element-wise vector multiplication 3. p = [a; b], vector concatenation 4. p = Ab, similar to (Baroni and Lenci, 21) 5. p = g(w [a; b]),, similar to Socher et al. 6. p = Ab + Ba, Linear MVR, similar to (Mitchell and Lapata, 21; Zanzotto et al., 21) 7. p = g(w [Ba; Ab]), MV- The final distribution is always predicted by a softmax classifier whose inputs p vary for each of the models. This objective function (see Sec. 2.4) is different to all previously published work except that of (Socher et al., 211c). We cross-validated all models over regularization parameters for word vectors, the softmax classifier, the parameter W and the word operators (1 4, 1 3 ) and word vector sizes (n = 6, 8, 1, 12, 15, 2). All models performed best at vector sizes of below 12. Hence, it is the model s power and not the number of parameters that determines the performance. The table in Fig. 3 shows the average KL-divergence on the test set. It shows that the idea of matrix-vector representations for all words and having a nonlinearity are both important. The MV- which combines these two ideas is best able to learn the various compositional effects. The main difference in KL divergence comes from the few negation cases in the test set. Fig. 3 shows examples of predicted distributions. Many of the predictions are accurate and similar between the top models. However, only the MV- has enough expressive power to allow negation to completely shift the sentiment with respect to an adjective. A negated adjective carrying negative sentiment becomes slightly positive, whereas not awesome is correctly attenuated. All three top models correctly capture the U-shape of unbelievably sad. This pair peaks at both the negative and positive spectrum because it is ambiguous. When referring to the performance of actors, it is very negative, but, when talking about the plot, many people enjoy sad and thought-provoking movies. The p = Ab model does not perform well because it cannot model the fact that for an adjective like sad, the operator of unbelievably behaves differently.
7 Figure 4: Training trees for the MV- to learn propositional operators. The model learns vectors and operators for (and) and (negation). The model outputs the exact representations of and respectively at the top node. Hence, the operators can be combined recursively an arbitrary number of times for more complex logical functions. 3.2 Logic- and Vector-based Compositionality Another natural question is whether the MV- can, in general, capture some of the simple boolean logic that is sometimes found in language. In other words, can it learn some of the propositional logic operators such as and, or, not in terms of vectors and matrices from a few examples. Answering this question can also be seen as a first step towards bridging the gap between logic-based, formal semantics (Montague, 1974) and vector space models. The logic-based view of language accounts nicely for compositionality by directly mapping syntactic constituents to lambda calculus expressions. At the word level, the focus is on function words, and nouns and adjectives are often defined only in terms of the sets of entities they denote in the world. Most words are treated as atomic symbols with no relation to each other. There have been many attempts at automatically parsing natural language to a logical form using recursive compositional rules. Conversely, vector space models have the attractive property that they can automatically extract knowledge from large corpora without supervision. Unlike logic-based approaches, these models allow us to make fine-grained statements about the semantic similarity of words which correlate well with human judgments (Griffiths et al., 27). Logic-based approaches are often seen as orthogonal to distributional vector-based approaches. However, Garrette et al. (211) recently introduced a combination of a vector space model inside a Markov Logic Network. One open question is whether vector-based models can learn some of the simple logic encountered in language such as negation or conjunctives. To this end, we illustrate in a simple example that our MV- model and its learned word matrices (operators) have the ability to learn propositional logic operators such as,, (and, or, not). This is a necessary (though not sufficient) condition for the ability to pick up these phenomena in real datasets and tasks such as sentiment detection which we focus on in the subsequent sections. Our setup is as follows. We train on 6 strictly right-branching trees as in Fig. 4. We consider the 1- dimensional case and fix the representation for to (t = 1, T = 1) and to (f =, F = 1). Fixing the operators to the 1 1 identity matrix 1 is essentially ignoring them. The objective is then to create a perfect reconstruction of (t, T ) or (f, F ) (depending on the formula), which we achieve by the least squares error between the top vector s representation and the corresponding truth value, e.g. for : min p top t 2 + P top T 2. As our function g (see Eq. 2), we use a linear threshold unit: g(x) = max(min(x, 1), ). Giving the derivatives computed for the objective function for the examples in Fig. 4 to a standard L-BFGS optimizer quickly yields a training error of. Hence, the output of these 6 examples has exactly one of the truth representations, making it recursively compatible with further combinations of operators. Thus, we can combine these operators to construct any propositional logic function of any number of inputs (including xor). Hence, this MV- is complete in terms of propositional logic. 4 Predicting Movie Review Ratings In this section, we analyze the model s performance on full length sentences. We compare to previous state of the art methods on a standard benchmark dataset of movie reviews (Pang and Lee, 25; Nakagawa et al., 21; Socher et al., 211c). This dataset consists of 1, positive and negative single sentences describing movie sentiment. In this and the next experiment we use binarized trees from the Stanford Parser (Klein and Manning, 23). We use the exact same setup and parameters (regularization, word vector size, etc.) as the published code of Socher et al. (211c)
8 Method Acc. Tree-CRF (Nakagawa et al., 21) 77.3 RAE (Socher et al., 211c) 77.7 Linear MVR 77.1 MV- 79. Table 1: Accuracy of classification on full length movie review polarity (MR). S. C. Review sentence 1 The film is bright and flashy in all the right ways. Not always too whimsical for its own good this strange hybrid of crime thriller, quirky character study, third-rate romance and female empowerment fantasy never really finds the tonal or thematic glue it needs. Doesn t come close to justifying the hype that surrounded its debut at the Sundance film festival two years ago. x Director Hoffman, his writer and Kline s agent should serve detention. 1 x A bodice-ripper for intellectuals. Table 2: Hard movie review examples of positive (1) and negative () sentiment (S.) that of all methods only the MV- predicted correctly (C: ) or could not classify as correct either (C: x). Table 1 shows comparisons to the system of (Nakagawa et al., 21), a dependency tree based classification method that uses CRFs with hidden variables. The state of the art recursive autoencoder model of Socher et al. (211c) obtained 77.7% accuracy. Our new MV- gives the highest performance, outperforming also the linear MVR (Sec. 2.2). Table 2 shows several hard examples that only the MV- was able to classify correctly. None of the methods correctly classified the last two examples which require more world knowledge. 5 Classification of Semantic Relationships The previous task considered global classification of an entire phrase or sentence. In our last experiment we show that the MV- can also learn how a syntactic context composes an aggregate meaning of the semantic relationships between words. In particular, the task is finding semantic relationships between pairs of nominals. For instance, in the sentence My [apartment] e1 has a pretty large [kitchen] e2., we want to predict that the kitchen and apartment are in a component-whole relationship. Predicting such MV- for Relationship Classification Classifier: Message-Topic the [movie] showed [wars] Figure 5: The MV- learns vectors in the path connecting two words (dotted lines) to determine their semantic relationship. It takes into consideration a variable length sequence of various word types in that path. semantic relations is useful for information extraction and thesaurus construction applications. Many approaches use features for all words on the path between the two words of interest. We show that by building a single compositional semantics for the minimal constituent including both terms one can achieve a higher performance. This task requires the ability to deal with sequences of words of arbitrary type and length in between the two nouns in question.fig. 5 explains our method for classifying nominal relationships. We first find the path in the parse tree between the two words whose relation we want to classify. We then select the highest node of the path and classify the relationship using that node s vector as features. We apply the same type of MV- model as in sentiment to the subtree spanned by the two words. We use the dataset and evaluation framework of SemEval-21 Task 8 (Hendrickx et al., 21). There are 9 ordered relationships (with two directions) and an undirected other class, resulting in 19 classes. Among the relationships are: messagetopic, cause-effect, instrument-agency (etc. see Table 3 for list). A pair is counted as correct if the order of the words in the relationship is correct. Table 4 lists results for several competing methods together with the resources and features used by each method. We compare to the systems of the competition which are described in Hendrickx et al. (21) as well as the and linear MVR. Most systems used a considerable amount of handdesigned semantic resources. In contrast to these methods, the MV- only needs a parser for the tree structure and learns all semantics from unlabeled corpora and the training data. Only the SemEval training dataset is specific to this task, the re-
9 Relationship Sentence with labeled nouns for which to predict relationships Cause-Effect(e2,e1) Avian [influenza] e1 is an infectious disease caused by type a strains of the influenza [virus] e2. Entity-Origin(e1,e2) The [mother] e1 left her native [land] e2 about the same time and they were married in that city. Message-Topic(e2,e1) Roadside [attractions] e1 are frequently advertised with [billboards] e2 to attract tourists. Product-Producer(e1,e2) A child is told a [lie] e1 for several years by their [parents] e2 before he/she realizes that... Entity-Destination(e1,e2) The accident has spread [oil] e1 into the [ocean] e2. Member-Collection(e2,e1) The siege started, with a [regiment] e1 of lightly armored [swordsmen] e2 ramming down the gate. Instrument-Agency(e2,e1) The core of the [analyzer] e1 identifies the paths using the constraint propagation [method] e2. Component-Whole(e2,e1) The size of a [tree] e1 [crown] e2 is strongly correlated with the growth of the tree. Content-Container(e1,e2) The hidden [camera] e1, found by a security guard, was hidden in a business card-sized [leaflet box] e2 placed at an unmanned ATM in Tokyo s Minato ward in early September. Table 3: Examples of correct classifications of ordered, semantic relations between nouns by the MV-. Note that the final classifier is a recursive, compositional function of all the words in the syntactic path between the bracketed words. The paths vary in length and the words vary in type. Classifier Feature Sets F1 SVM POS, stemming, syntactic patterns 6 SVM word pair, words in between 72.5 SVM POS, WordNet, stemming, syntactic 74.8 patterns SVM POS, WordNet, morphological features, 77.6 thesauri, Google n-grams MaxEnt POS, WordNet, morphological features, 77.6 noun compound system, the- sauri, Google n-grams SVM POS, WordNet, prefixes and other 82.2 morphological features, POS, dependency parse features, Levin classes, PropBank, FrameNet, NomLex-Plus, Google n-grams, paraphrases, TextRunner Lin.MVR MV POS,WordNet,NER 77.6 Lin.MVR POS,WordNet,NER 78.7 MV- POS,WordNet,NER 82.4 Table 4: Learning methods, their feature sets and F1 results for predicting semantic relations between nouns. The MV- outperforms all but one method without any additional feature sets. By adding three such features, it obtains state of the art performance. maining inputs and the training setup are the same as in previous sentiment experiments. The best method on this dataset (Rink and Harabagiu, 21) obtains 82.2% F1. In order to see whether our system can improve over this system, we added three features to the MV- vector and trained another softmax classifier. The features and their performance increases were POS tags (+.9); WordNet hypernyms (+1.3) and named entity tags (NER) of the two words (+.6). Features were computed using the code of Ciaramita and Altun (26). 3 With these features, the performance improved over the state of the art system. Table 3 shows random correct classification examples. 6 Related work Distributional approaches have become omnipresent for the recognition of semantic similarity between words and the treatment of compositionality has seen much progress in recent years. Hence, we cannot do justice to the large amount of literature. Commonly, single words are represented as vectors of distributional characteristics e.g., their frequencies in specific syntactic relations or their co-occurrences with given context words (Pado and Lapata, 27; Baroni and Lenci, 21; Turney and Pantel, 21). These representations have proven very effective in sense discrimination and disambiguation (Schütze, 1998), automatic thesaurus extraction (Lin, 1998; Curran, 24) and selectional preferences. There are several sophisticated ideas for compositionality in vector spaces. Mitchell and Lapata (21) present an overview of the most important compositional models, from simple vector addition and component-wise multiplication to tensor products, and convolution (Metcalfe, 199). They measured the similarity between word pairs such as compound nouns or verb-object pairs and compared these with human similarity judgments. Simple vector averaging or multiplication performed best, hence our focus on related baselines above. 3 sourceforge.net/projects/supersensetag/
10 Other important models are tensor products (Clark and Pulman, 27), quantum logic (Widdows, 28), holographic reduced representations (Plate, 1995) and the Compositional Matrix Space model (Rudolph and Giesbrecht, 21). s are related to autoencoder models such as the recursive autoassociative memory (RAAM) (Pollack, 199) or recurrent neural networks (Elman, 1991). Bottou (211) and Hinton (199) discussed related models such as recursive autoencoders for text understanding. Our model builds upon and generalizes the models of (Mitchell and Lapata, 21; Baroni and Zamparelli, 21; Zanzotto et al., 21; Socher et al., 211c) (see Sec. 2.2). We compare to them in our experiments. Yessenalina and Cardie (211) introduce a sentiment analysis model that describes words as matrices and composition as matrix multiplication. Since matrix multiplication is associative, this cannot capture different scopes of negation or syntactic differences. Their model, is a special case of our encoding model (when you ignore vectors, fix the tree to be strictly branching in one direction and use as the matrix composition function P = AB). Since our classifiers are trained on the vectors, we cannot compare to this approach directly. Grefenstette and Sadrzadeh (211) learn matrices for verbs in a categorical model. The trained matrices improve correlation with human judgments on the task of identifying relatedness of subjectverb-object triplets. 7 Conclusion We introduced a new model towards a complete treatment of compositionality in word vector spaces. Our model builds on a syntactically plausible parse tree and can handle compositional phenomena. The main novelty of our model is the combination of matrix-vector representations with a recursive neural network. It can learn both the meaning vectors of a word and how that word modifies its neighbors (via its matrix). The MV- combines attractive theoretical properties with good performance on large, noisy datasets. It generalizes several models in the literature, can learn propositional logic, accurately predicts sentiment and can be used to classify semantic relationships between nouns in a sentence. Acknowledgments We thank for great discussions about the paper: John Platt, Chris Potts, Josh Tenenbaum, Mihai Surdeanu, Quoc Le and Kevin Miller. The authors gratefully acknowledges the support of the Defense Advanced Research Projects Agency (DARPA) Machine Reading Program under Air Force Research Laboratory (AFRL) prime contract no. FA C-181, and the DARPA Deep Learning program under contract number FA865-1-C-72. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the view of DARPA, AFRL, or the US government. References M. Baroni and A. Lenci. 21. Distributional memory: A general framework for corpus-based semantics. Computational Linguistics, 36(4): M. Baroni and Roberto Zamparelli. 21. Nouns are vectors, adjectives are matrices: Representing adjective-noun constructions in semantic space. In EMNLP. L. Bottou From machine learning to machine reasoning. CoRR, abs/ M. Ciaramita and Y. Altun. 26. Broad-coverage sense disambiguation and information extraction with a supersense sequence tagger. In EMNLP. S. Clark and S. Pulman. 27. Combining symbolic and distributional models of meaning. In Proceedings of the AAAI Spring Symposium on Quantum Interaction, pages R. Collobert and J. Weston. 28. A unified architecture for natural language processing: deep neural networks with multitask learning. In ICML. J. Curran. 24. From Distributional to Semantic Similarity. Ph.D. thesis, University of Edinburgh. J. L. Elman Distributed representations, simple recurrent networks, and grammatical structure. Machine Learning, 7(2-3). G. Frege Über Sinn und Bedeutung. In Zeitschrift für Philosophie und philosophische Kritik, 1. D. Garrette, K. Erk, and R. Mooney Integrating Logical Representations with Probabilistic Information using Markov Logic. In Proceedings of the International Conference on Computational Semantics. C. Goller and A. Küchler Learning taskdependent distributed representations by backpropagation through structure. In Proceedings of the International Conference on Neural Networks (ICNN-96).
11 E. Grefenstette and M. Sadrzadeh Experimental support for a categorical compositional distributional model of meaning. In EMNLP. T. L. Griffiths, J. B. Tenenbaum, and M. Steyvers. 27. Topics in semantic representation. Psychological Review, 114. I. Hendrickx, S.N. Kim, Z. Kozareva, P. Nakov, D. Ó Séaghdha, S. Padó, M. Pennacchiotti, L. Romano, and S. Szpakowicz. 21. Semeval-21 task 8: Multi-way classification of semantic relations between pairs of nominals. In Proceedings of the 5th International Workshop on Semantic Evaluation. G. E. Hinton Mapping part-whole hierarchies into connectionist networks. Artificial Intelligence, 46(1-2). R. Jones, B. Rey, O. Madani, and W. Greiner. 26. Generating query substitutions. In Proceedings of the 15th international conference on World Wide Web. D. Klein and C. D. Manning. 23. Accurate unlexicalized parsing. In ACL. D. Lin Automatic retrieval and clustering of similar words. In Proceedings of COLING-ACL, pages E. J. Metcalfe A compositive holographic associative recall model. Psychological Review, 88: J. Mitchell and M. Lapata. 21. Composition in distributional models of semantics. Cognitive Science, 34(8): R. Montague English as a formal language. Linguaggi nella Societa e nella Tecnica, pages T. Nakagawa, K. Inui, and S. Kurohashi. 21. Dependency tree-based sentiment classification using CRFs with hidden variables. In NAACL, HLT. M. Paşca, D. Lin, J. Bigham, A. Lifchits, and A. Jain. 26. Names and similarities on the web: fact extraction in the fast lane. In ACL. S. Pado and M. Lapata. 27. Dependency-based construction of semantic space models. Computational Linguistics, 33(2): B. Pang and L. Lee. 25. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In ACL, pages T. A. Plate Holographic reduced representations. IEEE Transactions on Neural Networks, 6(3): J. B. Pollack Recursive distributed representations. Artificial Intelligence, 46, November. C. Potts. 21. On the negativity of negation. In David Lutz and Nan Li, editors, Proceedings of Semantics and Linguistic Theory 2. CLC Publications, Ithaca, NY. L. Ratinov, D. Roth, D. Downey, and M. Anderson Local and global algorithms for disambiguation to wikipedia. In ACL. B. Rink and S. Harabagiu. 21. UTD: Classifying semantic relations by combining lexical and semantic resources. In Proceedings of the 5th International Workshop on Semantic Evaluation. S. Rudolph and E. Giesbrecht. 21. Compositional matrix-space models of language. In ACL. H. Schütze Automatic word sense discrimination. Computational Linguistics, 24: R. Socher, C. D. Manning, and A. Y. Ng. 21. Learning continuous phrase representations and syntactic parsing with recursive neural networks. In Proceedings of the NIPS-21 Deep Learning and Unsupervised Feature Learning Workshop. R. Socher, E. H. Huang, J. Pennington, A. Y. Ng, and C. D. Manning. 211a. Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection. In NIPS. MIT Press. R. Socher, C. Lin, A. Y. Ng, and C.D. Manning. 211b. Parsing Natural Scenes and Natural Language with Recursive Neural Networks. In ICML. R. Socher, J. Pennington, E. H. Huang, A. Y. Ng, and C. D. Manning. 211c. Semi-Supervised Recursive Autoencoders for Predicting Sentiment Distributions. In EMNLP. P. D. Turney and P. Pantel. 21. From frequency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research, 37: D. Widdows. 28. Semantic vector products: Some initial investigations. In Proceedings of the Second AAAI Symposium on Quantum Interaction. A. Yessenalina and C. Cardie Compositional matrix-space models for sentiment analysis. In EMNLP. F.M. Zanzotto, I. Korkontzelos, F. Fallucchi, and S. Manandhar. 21. Estimating linear models for compositional distributional semantics. COLING.
Building a Question Classifier for a TREC-Style Question Answering System
Building a Question Classifier for a TREC-Style Question Answering System Richard May & Ari Steinberg Topic: Question Classification We define Question Classification (QC) here to be the task that, given
Chapter 8. Final Results on Dutch Senseval-2 Test Data
Chapter 8 Final Results on Dutch Senseval-2 Test Data The general idea of testing is to assess how well a given model works and that can only be done properly on data that has not been seen before. Supervised
The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2
2nd International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2016) The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2 1 School of
Identifying Focus, Techniques and Domain of Scientific Papers
Identifying Focus, Techniques and Domain of Scientific Papers Sonal Gupta Department of Computer Science Stanford University Stanford, CA 94305 [email protected] Christopher D. Manning Department of
Reasoning With Neural Tensor Networks for Knowledge Base Completion
Reasoning With Neural Tensor Networks for Knowledge Base Completion Richard Socher, Danqi Chen*, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305,
Clustering Connectionist and Statistical Language Processing
Clustering Connectionist and Statistical Language Processing Frank Keller [email protected] Computerlinguistik Universität des Saarlandes Clustering p.1/21 Overview clustering vs. classification supervised
VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter
VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter Gerard Briones and Kasun Amarasinghe and Bridget T. McInnes, PhD. Department of Computer Science Virginia Commonwealth University Richmond,
6.2.8 Neural networks for data mining
6.2.8 Neural networks for data mining Walter Kosters 1 In many application areas neural networks are known to be valuable tools. This also holds for data mining. In this chapter we discuss the use of neural
Course: Model, Learning, and Inference: Lecture 5
Course: Model, Learning, and Inference: Lecture 5 Alan Yuille Department of Statistics, UCLA Los Angeles, CA 90095 [email protected] Abstract Probability distributions on structured representation.
Distributed Representations of Sentences and Documents
Quoc Le Tomas Mikolov Google Inc, 1600 Amphitheatre Parkway, Mountain View, CA 94043 [email protected] [email protected] Abstract Many machine learning algorithms require the input to be represented as
Predict Influencers in the Social Network
Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, [email protected] Department of Electrical Engineering, Stanford University Abstract Given two persons
Semi-Supervised Learning for Blog Classification
Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (2008) Semi-Supervised Learning for Blog Classification Daisuke Ikeda Department of Computational Intelligence and Systems Science,
Programming Exercise 3: Multi-class Classification and Neural Networks
Programming Exercise 3: Multi-class Classification and Neural Networks Machine Learning November 4, 2011 Introduction In this exercise, you will implement one-vs-all logistic regression and neural networks
RECURSIVE DEEP LEARNING FOR NATURAL LANGUAGE PROCESSING AND COMPUTER VISION
RECURSIVE DEEP LEARNING FOR NATURAL LANGUAGE PROCESSING AND COMPUTER VISION A DISSERTATION SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN
SENTIMENT ANALYSIS: A STUDY ON PRODUCT FEATURES
University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln Dissertations and Theses from the College of Business Administration Business Administration, College of 4-1-2012 SENTIMENT
Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence
Artificial Neural Networks and Support Vector Machines CS 486/686: Introduction to Artificial Intelligence 1 Outline What is a Neural Network? - Perceptron learners - Multi-layer networks What is a Support
STA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! [email protected]! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct
Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski [email protected]
Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trakovski [email protected] Neural Networks 2 Neural Networks Analogy to biological neural systems, the most robust learning systems
Sense Making in an IOT World: Sensor Data Analysis with Deep Learning
Sense Making in an IOT World: Sensor Data Analysis with Deep Learning Natalia Vassilieva, PhD Senior Research Manager GTC 2016 Deep learning proof points as of today Vision Speech Text Other Search & information
Nouns are vectors, adjectives are matrices: Representing adjective-noun constructions in semantic space
Nouns are vectors, adjectives are matrices: Representing adjective-noun constructions in semantic space Marco Baroni and Roberto Zamparelli Center for Mind/Brain Sciences, University of Trento Rovereto
Predicting the Stock Market with News Articles
Predicting the Stock Market with News Articles Kari Lee and Ryan Timmons CS224N Final Project Introduction Stock market prediction is an area of extreme importance to an entire industry. Stock price is
Tensor Factorization for Multi-Relational Learning
Tensor Factorization for Multi-Relational Learning Maximilian Nickel 1 and Volker Tresp 2 1 Ludwig Maximilian University, Oettingenstr. 67, Munich, Germany [email protected] 2 Siemens AG, Corporate
Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data
CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear
Recurrent Convolutional Neural Networks for Text Classification
Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence Recurrent Convolutional Neural Networks for Text Classification Siwei Lai, Liheng Xu, Kang Liu, Jun Zhao National Laboratory of
Statistical Models in Data Mining
Statistical Models in Data Mining Sargur N. Srihari University at Buffalo The State University of New York Department of Computer Science and Engineering Department of Biostatistics 1 Srihari Flood of
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
Pixels Description of scene contents. Rob Fergus (NYU) Antonio Torralba (MIT) Yair Weiss (Hebrew U.) William T. Freeman (MIT) Banksy, 2006
Object Recognition Large Image Databases and Small Codes for Object Recognition Pixels Description of scene contents Rob Fergus (NYU) Antonio Torralba (MIT) Yair Weiss (Hebrew U.) William T. Freeman (MIT)
Classifying Manipulation Primitives from Visual Data
Classifying Manipulation Primitives from Visual Data Sandy Huang and Dylan Hadfield-Menell Abstract One approach to learning from demonstrations in robotics is to make use of a classifier to predict if
Data Mining Yelp Data - Predicting rating stars from review text
Data Mining Yelp Data - Predicting rating stars from review text Rakesh Chada Stony Brook University [email protected] Chetan Naik Stony Brook University [email protected] ABSTRACT The majority
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
Computational Linguistics and Learning from Big Data. Gabriel Doyle UCSD Linguistics
Computational Linguistics and Learning from Big Data Gabriel Doyle UCSD Linguistics From not enough data to too much Finding people: 90s, 700 datapoints, 7 years People finding you: 00s, 30000 datapoints,
Brill s rule-based PoS tagger
Beáta Megyesi Department of Linguistics University of Stockholm Extract from D-level thesis (section 3) Brill s rule-based PoS tagger Beáta Megyesi Eric Brill introduced a PoS tagger in 1992 that was based
Efficient online learning of a non-negative sparse autoencoder
and Machine Learning. Bruges (Belgium), 28-30 April 2010, d-side publi., ISBN 2-93030-10-2. Efficient online learning of a non-negative sparse autoencoder Andre Lemme, R. Felix Reinhart and Jochen J. Steil
POSBIOTM-NER: A Machine Learning Approach for. Bio-Named Entity Recognition
POSBIOTM-NER: A Machine Learning Approach for Bio-Named Entity Recognition Yu Song, Eunji Yi, Eunju Kim, Gary Geunbae Lee, Department of CSE, POSTECH, Pohang, Korea 790-784 Soo-Jun Park Bioinformatics
Transition-Based Dependency Parsing with Long Distance Collocations
Transition-Based Dependency Parsing with Long Distance Collocations Chenxi Zhu, Xipeng Qiu (B), and Xuanjing Huang Shanghai Key Laboratory of Intelligent Information Processing, School of Computer Science,
Domain Classification of Technical Terms Using the Web
Systems and Computers in Japan, Vol. 38, No. 14, 2007 Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J89-D, No. 11, November 2006, pp. 2470 2482 Domain Classification of Technical Terms Using
NATURAL LANGUAGE QUERY PROCESSING USING PROBABILISTIC CONTEXT FREE GRAMMAR
NATURAL LANGUAGE QUERY PROCESSING USING PROBABILISTIC CONTEXT FREE GRAMMAR Arati K. Deshpande 1 and Prakash. R. Devale 2 1 Student and 2 Professor & Head, Department of Information Technology, Bharati
Ming-Wei Chang. Machine learning and its applications to natural language processing, information retrieval and data mining.
Ming-Wei Chang 201 N Goodwin Ave, Department of Computer Science University of Illinois at Urbana-Champaign, Urbana, IL 61801 +1 (917) 345-6125 [email protected] http://flake.cs.uiuc.edu/~mchang21 Research
CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet
CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet Muhammad Atif Qureshi 1,2, Arjumand Younus 1,2, Colm O Riordan 1,
Lecture 8 February 4
ICS273A: Machine Learning Winter 2008 Lecture 8 February 4 Scribe: Carlos Agell (Student) Lecturer: Deva Ramanan 8.1 Neural Nets 8.1.1 Logistic Regression Recall the logistic function: g(x) = 1 1 + e θt
Data Mining Algorithms Part 1. Dejan Sarka
Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka ([email protected]) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses
How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.
Svetlana Sokolova President and CEO of PROMT, PhD. How the Computer Translates Machine translation is a special field of computer application where almost everyone believes that he/she is a specialist.
Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval
Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!
Open Domain Information Extraction. Günter Neumann, DFKI, 2012
Open Domain Information Extraction Günter Neumann, DFKI, 2012 Improving TextRunner Wu and Weld (2010) Open Information Extraction using Wikipedia, ACL 2010 Fader et al. (2011) Identifying Relations for
Machine Learning and Data Mining -
Machine Learning and Data Mining - Perceptron Neural Networks Nuno Cavalheiro Marques ([email protected]) Spring Semester 2010/2011 MSc in Computer Science Multi Layer Perceptron Neurons and the Perceptron
Supporting Online Material for
www.sciencemag.org/cgi/content/full/313/5786/504/dc1 Supporting Online Material for Reducing the Dimensionality of Data with Neural Networks G. E. Hinton* and R. R. Salakhutdinov *To whom correspondence
Recurrent Neural Networks
Recurrent Neural Networks Neural Computation : Lecture 12 John A. Bullinaria, 2015 1. Recurrent Neural Network Architectures 2. State Space Models and Dynamical Systems 3. Backpropagation Through Time
DEPENDENCY PARSING JOAKIM NIVRE
DEPENDENCY PARSING JOAKIM NIVRE Contents 1. Dependency Trees 1 2. Arc-Factored Models 3 3. Online Learning 3 4. Eisner s Algorithm 4 5. Spanning Tree Parsing 6 References 7 A dependency parser analyzes
. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns
Outline Part 1: of data clustering Non-Supervised Learning and Clustering : Problem formulation cluster analysis : Taxonomies of Clustering Techniques : Data types and Proximity Measures : Difficulties
Improving Word Representations via Global Context and Multiple Word Prototypes
Improving Word Representations via Global Context and Multiple Word Prototypes Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford,
Learning is a very general term denoting the way in which agents:
What is learning? Learning is a very general term denoting the way in which agents: Acquire and organize knowledge (by building, modifying and organizing internal representations of some external reality);
Forecasting Trade Direction and Size of Future Contracts Using Deep Belief Network
Forecasting Trade Direction and Size of Future Contracts Using Deep Belief Network Anthony Lai (aslai), MK Li (lilemon), Foon Wang Pong (ppong) Abstract Algorithmic trading, high frequency trading (HFT)
A Knowledge-Poor Approach to BioCreative V DNER and CID Tasks
A Knowledge-Poor Approach to BioCreative V DNER and CID Tasks Firoj Alam 1, Anna Corazza 2, Alberto Lavelli 3, and Roberto Zanoli 3 1 Dept. of Information Eng. and Computer Science, University of Trento,
Web Document Clustering
Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,
Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05
Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification
ANALYSIS OF LEXICO-SYNTACTIC PATTERNS FOR ANTONYM PAIR EXTRACTION FROM A TURKISH CORPUS
ANALYSIS OF LEXICO-SYNTACTIC PATTERNS FOR ANTONYM PAIR EXTRACTION FROM A TURKISH CORPUS Gürkan Şahin 1, Banu Diri 1 and Tuğba Yıldız 2 1 Faculty of Electrical-Electronic, Department of Computer Engineering
Chapter 4: Artificial Neural Networks
Chapter 4: Artificial Neural Networks CS 536: Machine Learning Littman (Wu, TA) Administration icml-03: instructional Conference on Machine Learning http://www.cs.rutgers.edu/~mlittman/courses/ml03/icml03/
Supervised Feature Selection & Unsupervised Dimensionality Reduction
Supervised Feature Selection & Unsupervised Dimensionality Reduction Feature Subset Selection Supervised: class labels are given Select a subset of the problem features Why? Redundant features much or
Attribution. Modified from Stuart Russell s slides (Berkeley) Parts of the slides are inspired by Dan Klein s lecture material for CS 188 (Berkeley)
Machine Learning 1 Attribution Modified from Stuart Russell s slides (Berkeley) Parts of the slides are inspired by Dan Klein s lecture material for CS 188 (Berkeley) 2 Outline Inductive learning Decision
Sentence paraphrase detection: When determiners and word order make the difference
Sentence paraphrase detection: When determiners and word order make the difference Nghia Pham University of Trento [email protected] Yao Zhong Zhang The University of Tokyo [email protected] Raffaella
Taxonomy learning factoring the structure of a taxonomy into a semantic classification decision
Taxonomy learning factoring the structure of a taxonomy into a semantic classification decision Viktor PEKAR Bashkir State University Ufa, Russia, 450000 [email protected] Steffen STAAB Institute AIFB,
Experiments in Web Page Classification for Semantic Web
Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address
Machine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
Movie Classification Using k-means and Hierarchical Clustering
Movie Classification Using k-means and Hierarchical Clustering An analysis of clustering algorithms on movie scripts Dharak Shah DA-IICT, Gandhinagar Gujarat, India [email protected] Saheb Motiani
Active Learning SVM for Blogs recommendation
Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the
Sentiment Analysis for Movie Reviews
Sentiment Analysis for Movie Reviews Ankit Goyal, [email protected] Amey Parulekar, [email protected] Introduction: Movie reviews are an important way to gauge the performance of a movie. While providing
A Learning Based Method for Super-Resolution of Low Resolution Images
A Learning Based Method for Super-Resolution of Low Resolution Images Emre Ugur June 1, 2004 [email protected] Abstract The main objective of this project is the study of a learning based method
Sentiment analysis on news articles using Natural Language Processing and Machine Learning Approach.
Sentiment analysis on news articles using Natural Language Processing and Machine Learning Approach. Pranali Chilekar 1, Swati Ubale 2, Pragati Sonkambale 3, Reema Panarkar 4, Gopal Upadhye 5 1 2 3 4 5
Comparing Support Vector Machines, Recurrent Networks and Finite State Transducers for Classifying Spoken Utterances
Comparing Support Vector Machines, Recurrent Networks and Finite State Transducers for Classifying Spoken Utterances Sheila Garfield and Stefan Wermter University of Sunderland, School of Computing and
1. Classification problems
Neural and Evolutionary Computing. Lab 1: Classification problems Machine Learning test data repository Weka data mining platform Introduction Scilab 1. Classification problems The main aim of a classification
Linguistic Regularities in Continuous Space Word Representations
Linguistic Regularities in Continuous Space Word Representations Tomas Mikolov, Wen-tau Yih, Geoffrey Zweig Microsoft Research Redmond, WA 98052 Abstract Continuous space language models have recently
Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic
Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic by Sigrún Helgadóttir Abstract This paper gives the results of an experiment concerned with training three different taggers on tagged
Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. ~ Spring~r
Bing Liu Web Data Mining Exploring Hyperlinks, Contents, and Usage Data With 177 Figures ~ Spring~r Table of Contents 1. Introduction.. 1 1.1. What is the World Wide Web? 1 1.2. ABrief History of the Web
Deep learning applications and challenges in big data analytics
Najafabadi et al. Journal of Big Data (2015) 2:1 DOI 10.1186/s40537-014-0007-7 RESEARCH Open Access Deep learning applications and challenges in big data analytics Maryam M Najafabadi 1, Flavio Villanustre
Cell Phone based Activity Detection using Markov Logic Network
Cell Phone based Activity Detection using Markov Logic Network Somdeb Sarkhel [email protected] 1 Introduction Mobile devices are becoming increasingly sophisticated and the latest generation of smart
Data quality in Accounting Information Systems
Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania
Component Ordering in Independent Component Analysis Based on Data Power
Component Ordering in Independent Component Analysis Based on Data Power Anne Hendrikse Raymond Veldhuis University of Twente University of Twente Fac. EEMCS, Signals and Systems Group Fac. EEMCS, Signals
Chapter 12 Discovering New Knowledge Data Mining
Chapter 12 Discovering New Knowledge Data Mining Becerra-Fernandez, et al. -- Knowledge Management 1/e -- 2004 Prentice Hall Additional material 2007 Dekai Wu Chapter Objectives Introduce the student to
Learning to Process Natural Language in Big Data Environment
CCF ADL 2015 Nanchang Oct 11, 2015 Learning to Process Natural Language in Big Data Environment Hang Li Noah s Ark Lab Huawei Technologies Part 1: Deep Learning - Present and Future Talk Outline Overview
Towards better accuracy for Spam predictions
Towards better accuracy for Spam predictions Chengyan Zhao Department of Computer Science University of Toronto Toronto, Ontario, Canada M5S 2E4 [email protected] Abstract Spam identification is crucial
Sentiment Analysis. D. Skrepetos 1. University of Waterloo. NLP Presenation, 06/17/2015
Sentiment Analysis D. Skrepetos 1 1 Department of Computer Science University of Waterloo NLP Presenation, 06/17/2015 D. Skrepetos (University of Waterloo) Sentiment Analysis NLP Presenation, 06/17/2015
Parsing Software Requirements with an Ontology-based Semantic Role Labeler
Parsing Software Requirements with an Ontology-based Semantic Role Labeler Michael Roth University of Edinburgh [email protected] Ewan Klein University of Edinburgh [email protected] Abstract Software
II. RELATED WORK. Sentiment Mining
Sentiment Mining Using Ensemble Classification Models Matthew Whitehead and Larry Yaeger Indiana University School of Informatics 901 E. 10th St. Bloomington, IN 47408 {mewhiteh, larryy}@indiana.edu Abstract
What Is This, Anyway: Automatic Hypernym Discovery
What Is This, Anyway: Automatic Hypernym Discovery Alan Ritter and Stephen Soderland and Oren Etzioni Turing Center Department of Computer Science and Engineering University of Washington Box 352350 Seattle,
An Analysis of Single-Layer Networks in Unsupervised Feature Learning
An Analysis of Single-Layer Networks in Unsupervised Feature Learning Adam Coates 1, Honglak Lee 2, Andrew Y. Ng 1 1 Computer Science Department, Stanford University {acoates,ang}@cs.stanford.edu 2 Computer
A picture is worth five captions
A picture is worth five captions Learning visually grounded word and sentence representations Grzegorz Chrupała (with Ákos Kádár and Afra Alishahi) Learning word (and phrase) meanings Cross-situational
MapReduce Approach to Collective Classification for Networks
MapReduce Approach to Collective Classification for Networks Wojciech Indyk 1, Tomasz Kajdanowicz 1, Przemyslaw Kazienko 1, and Slawomir Plamowski 1 Wroclaw University of Technology, Wroclaw, Poland Faculty
Neural Networks for Machine Learning. Lecture 13a The ups and downs of backpropagation
Neural Networks for Machine Learning Lecture 13a The ups and downs of backpropagation Geoffrey Hinton Nitish Srivastava, Kevin Swersky Tijmen Tieleman Abdel-rahman Mohamed A brief history of backpropagation
Subspace Analysis and Optimization for AAM Based Face Alignment
Subspace Analysis and Optimization for AAM Based Face Alignment Ming Zhao Chun Chen College of Computer Science Zhejiang University Hangzhou, 310027, P.R.China [email protected] Stan Z. Li Microsoft
A Systematic Cross-Comparison of Sequence Classifiers
A Systematic Cross-Comparison of Sequence Classifiers Binyamin Rozenfeld, Ronen Feldman, Moshe Fresko Bar-Ilan University, Computer Science Department, Israel [email protected], [email protected],
Author Gender Identification of English Novels
Author Gender Identification of English Novels Joseph Baena and Catherine Chen December 13, 2013 1 Introduction Machine learning algorithms have long been used in studies of authorship, particularly in
CNFSAT: Predictive Models, Dimensional Reduction, and Phase Transition
CNFSAT: Predictive Models, Dimensional Reduction, and Phase Transition Neil P. Slagle College of Computing Georgia Institute of Technology Atlanta, GA [email protected] Abstract CNFSAT embodies the P
Latent Dirichlet Markov Allocation for Sentiment Analysis
Latent Dirichlet Markov Allocation for Sentiment Analysis Ayoub Bagheri Isfahan University of Technology, Isfahan, Iran Intelligent Database, Data Mining and Bioinformatics Lab, Electrical and Computer
Sentiment Analysis of Movie Reviews and Twitter Statuses. Introduction
Sentiment Analysis of Movie Reviews and Twitter Statuses Introduction Sentiment analysis is the task of identifying whether the opinion expressed in a text is positive or negative in general, or about
Università degli Studi di Bologna
Università degli Studi di Bologna DEIS Biometric System Laboratory Incremental Learning by Message Passing in Hierarchical Temporal Memory Davide Maltoni Biometric System Laboratory DEIS - University of
Personalized Hierarchical Clustering
Personalized Hierarchical Clustering Korinna Bade, Andreas Nürnberger Faculty of Computer Science, Otto-von-Guericke-University Magdeburg, D-39106 Magdeburg, Germany {kbade,nuernb}@iws.cs.uni-magdeburg.de
Generating SQL Queries Using Natural Language Syntactic Dependencies and Metadata
Generating SQL Queries Using Natural Language Syntactic Dependencies and Metadata Alessandra Giordani and Alessandro Moschitti Department of Computer Science and Engineering University of Trento Via Sommarive
