Neural Word Embedding as Implicit Matrix Factorization

Size: px
Start display at page:

Download "Neural Word Embedding as Implicit Matrix Factorization"

Transcription

1 Neural Word Embedding as Implicit Matrix Factorization Omer Levy Department of Computer Science Bar-Ilan University Yoav Goldberg Department of Computer Science Bar-Ilan University Abstract We analyze skip-gram with negative-sampling (SGNS), a word embedding method introduced by Mikolov et al., and show that it is implicitly factorizing a word-context matrix, whose cells are the pointwise mutual information (PMI) of the respective word and context pairs, shifted by a global constant. We find that another embedding method, NCE, is implicitly factorizing a similar matrix, where each cell is the (shifted) log conditional probability of a word given its context. We show that using a sparse Shifted Positive PMI word-context matrix to represent words improves results on two word similarity tasks and one of two analogy tasks. When dense low-dimensional vectors are preferred, exact factorization with SVD can achieve solutions that are at least as good as SGNS s solutions for word similarity tasks. On analogy questions SGNS remains superior to SVD. We conjecture that this stems from the weighted nature of SGNS s factorization. 1 Introduction Most tasks in natural language processing and understanding involve looking at words, and could benefit from word representations that do not treat individual words as unique symbols, but instead reflect similarities and dissimilarities between them. The common paradigm for deriving such representations is based on the distributional hypothesis of Harris [15], which states that words in similar contexts have similar meanings. This has given rise to many word representation methods in the NLP literature, the vast majority of whom can be described in terms of a word-context matrix M in which each row i corresponds to a word, each column j to a context in which the word appeared, and each matrix entry M ij corresponds to some association measure between the word and the context. Words are then represented as rows in M or in a dimensionality-reduced matrix based on M. Recently, there has been a surge of work proposing to represent words as dense vectors, derived using various training methods inspired from neural-network language modeling [3, 9, 23, 21]. These representations, referred to as neural embeddings or word embeddings, have been shown to perform well in a variety of NLP tasks [26, 10, 1]. In particular, a sequence of papers by Mikolov and colleagues [20, 21] culminated in the skip-gram with negative-sampling (SGNS) training method which is both efficient to train and provides state-of-the-art results on various linguistic tasks. The training method (as implemented in the word2vec software package) is highly popular, but not well understood. While it is clear that the training objective follows the distributional hypothesis by trying to maximize the dot-product between the vectors of frequently occurring word-context pairs, and minimize it for random word-context pairs very little is known about the quantity being optimized by the algorithm, or the reason it is expected to produce good word representations. In this work, we aim to broaden the theoretical understanding of neural-inspired word embeddings. Specifically, we cast SGNS s training method as weighted matrix factorization, and show that its objective is implicitly factorizing a shifted PMI matrix the well-known word-context PMI matrix from the word-similarity literature, shifted by a constant offset. A similar result holds also for the 1

2 NCE embedding method of Mnih and Kavukcuoglu [24]. While it is impractical to directly use the very high-dimensional and dense shifted PMI matrix, we propose to approximate it with the positive shifted PMI matrix (Shifted PPMI), which is sparse. Shifted PPMI is far better at optimizing SGNS s objective, and performs slightly better than word2vec derived vectors on several linguistic tasks. Finally, we suggest a simple spectral algorithm that is based on performing SVD over the Shifted PPMI matrix. The spectral algorithm outperforms both SGNS and the Shifted PPMI matrix on the word similarity tasks, and is scalable to large corpora. However, it lags behind the SGNS-derived representation on word-analogy tasks. We conjecture that this behavior is related to the fact that SGNS performs weighted matrix factorization, giving more influence to frequent pairs, as opposed to SVD, which gives the same weight to all matrix cells. While the weighted and non-weighted objectives share the same optimal solution (perfect reconstruction of the shifted PMI matrix), they result in different generalizations when combined with dimensionality constraints. 2 Background: Skip-Gram with Negative Sampling (SGNS) Our departure point is SGNS the skip-gram neural embedding model introduced in [20] trained using the negative-sampling procedure presented in [21]. In what follows, we summarize the SGNS model and introduce our notation. A detailed derivation of the SGNS model is available in [14]. Setting and Notation The skip-gram model assumes a corpus of words w V W and their contexts c V C, where V W and V C are the word and context vocabularies. In [20, 21] the words come from an unannotated textual corpus of words w 1, w 2,..., w n (typically n is in the billions) and the contexts for word w i are the words surrounding it in an L-sized window w i L,..., w i 1, w i+1,..., w i+l. Other definitions of contexts are possible [18]. We denote the collection of observed words and context pairs as D. We use #(w, c) to denote the number of times the pair (w, c) appears in D. Similarly, #(w) = c V C #(w, c ) and #(c) = w V W #(w, c) are the number of times w and c occurred in D, respectively. Each word w V W is associated with a vector w R d and similarly each context c V C is represented as a vector c R d, where d is the embedding s dimensionality. The entries in the vectors are latent, and treated as parameters to be learned. We sometimes refer to the vectors w as rows in a V W d matrix W, and to the vectors c as rows in a V C d matrix C. In such cases, W i (C i ) refers to the vector representation of the ith word (context) in the corresponding vocabulary. When referring to embeddings produced by a specific method x, we will usually use W x and C x explicitly, but may use just W and C when the method is clear from the discussion. SGNS s Objective Consider a word-context pair (w, c). Did this pair come from the observed data D? Let P (D = 1 w, c) be the probability that (w, c) came from the data, and P (D = 0 w, c) = 1 P (D = 1 w, c) the probability that (w, c) did not. The distribution is modeled as: 1 P (D = 1 w, c) = σ( w c) = 1 + e w c where w and c (each a d-dimensional vector) are the model parameters to be learned. The negative sampling objective tries to maximize P (D = 1 w, c) for observed (w, c) pairs while maximizing P (D = 0 w, c) for randomly sampled negative examples (hence the name negative sampling ), under the assumption that randomly selecting a context for a given word is likely to result in an unobserved (w, c) pair. SGNS s objective for a single (w, c) observation is then: log σ( w c) + k E cn P D [log σ( w c N )] (1) where k is the number of negative samples and c N is the sampled context, drawn according to the empirical unigram distribution P D (c) = #(c). 1 1 In the algorithm described in [21], the negative contexts are sampled according to p 3/4 (c) = #c3/4 Z instead of the unigram distribution #c. Sampling according to Z p3/4 indeed produces somewhat superior results on some of the semantic evaluation tasks. It is straight-forward to modify the PMI metric in a similar fashion by replacing the p(c) term with p 3/4 (c), and doing so shows similar trends in the matrix-based methods as it does in word2vec s stochastic gradient based training method. We do not explore this further in this paper, and report results using the unigram distribution. 2

3 The objective is trained in an online fashion using stochastic gradient updates over the observed pairs in the corpus D. The global objective then sums over the observed (w, c) pairs in the corpus: l = #(w, c) (log σ( w c) + k E cn P D [log σ( w c N )]) (2) c V C w V W Optimizing this objective makes observed word-context pairs have similar embeddings, while scattering unobserved pairs. Intuitively, words that appear in similar contexts should have similar embeddings, though we are not familiar with a formal proof that SGNS does indeed maximize the dot-product of similar words. 3 SGNS as Implicit Matrix Factorization SGNS embeds both words and their contexts into a low-dimensional space R d, resulting in the word and context matrices W and C. The rows of matrix W are typically used in NLP tasks (such as computing word similarities) while C is ignored. It is nonetheless instructive to consider the product W C = M. Viewed this way, SGNS can be described as factorizing an implicit matrix M of dimensions V W V C into two smaller matrices. Which matrix is being factorized? A matrix entry M ij corresponds to the dot product W i C j = w i c j. Thus, SGNS is factorizing a matrix in which each row corresponds to a word w V W, each column corresponds to a context c V C, and each cell contains a quantity f(w, c) reflecting the strength of association between that particular word-context pair. Such word-context association matrices are very common in the NLP and word-similarity literature, see e.g. [29, 2]. That said, the objective of SGNS (equation 2) does not explicitly state what this association metric is. What can we say about the association function f(w, c)? In other words, which matrix is SGNS factorizing? 3.1 Characterizing the Implicit Matrix Consider the global objective (equation 2) above. For sufficiently large dimensionality d (i.e. allowing for a perfect reconstruction of M), each product w c can assume a value independently of the others. Under these conditions, we can treat the objective l as a function of independent w c terms, and find the values of these terms that maximize it. We begin by rewriting equation 2: l = #(w, c) (log σ( w c)) + #(w, c) (k E cn P D [log σ( w c N )]) w V W c V C w V W c V C = #(w, c) (log σ( w c)) + #(w) (k E cn P D [log σ( w c N )]) (3) w V W c V C w V W and explicitly expressing the expectation term: E cn P D [log σ( w c N )] = #(c N ) log σ( w c N ) c N V C = #(c) log σ( w c) + c N V C \{c} #(c N ) Combining equations 3 and 4 reveals the local objective for a specific (w, c) pair: l(w, c) = #(w, c) log σ( w c) + k #(w) #(c) log σ( w c N ) (4) log σ( w c) (5) To optimize the objective, we define x = w c and find its partial derivative with respect to x: l #(c) = #(w, c) σ( x) k #(w) σ(x) x We compare the derivative to zero, and after some simplification, arrive at: #(w, c) #(w, c) e 2x k #(w) #(c) 1 e x k #(w) #(c) = 0 3

4 If we define y = e x, this equation becomes a quadratic equation of y, which has two solutions, y = 1 (which is invalid given the definition of y) and: y = #(w, c) k #(w) #(c) = #(w, c) #w #(c) Substituting y with e x and x with w c reveals: ( #(w, c) w c = log #(w) #(c) 1 ) ( ) #(w, c) = log log k (6) k #(w) #(c) ) Interestingly, the expression log is the well-known pointwise mutual information ( #(w,c) #(w) #(c) (PMI) of (w, c), which we discuss in depth below. Finally, we can describe the matrix M that SGNS is factorizing: M SGNS ij = W i C j = w i c j = P MI(w i, c j ) log k (7) For a negative-sampling value of k = 1, the SGNS objective is factorizing a word-context matrix in which the association between a word and its context is measured by f(w, c) = P MI(w, c). We refer to this matrix as the PMI matrix, M P MI. For negative-sampling values k > 1, SGNS is factorizing a shifted PMI matrix M P MI k = M P MI log k. Other embedding methods can also be cast as factorizing implicit word-context matrices. Using a similar derivation, it can be shown that noise-contrastive estimation (NCE) [24] is factorizing the (shifted) log-conditional-probability matrix: M NCE ij = w i c j = log 3.2 Weighted Matrix Factorization ( #(w, c) #(c) 1 k ) log k = log P (w c) log k (8) We obtained that SGNS s objective is optimized by setting w c = P MI(w, c) log k for every (w, c) pair. However, this assumes that the dimensionality of w and c is high enough to allow for perfect reconstruction. When perfect reconstruction is not possible, some w c products must deviate from their optimal values. Looking at the pair-specific objective (equation 5) reveals that the loss for a pair (w, c) depends on its number of observations (#(w, c)) and expected negative samples (k #(w) #(c)/). SGNS s objective can now be cast as a weighted matrix factorization problem, seeking the optimal d-dimensional factorization of the matrix M P MI log k under a metric which pays more for deviations on frequent (w, c) pairs than deviations on infrequent ones. 3.3 Pointwise Mutual Information Pointwise mutual information is an information-theoretic association measure between a pair of discrete outcomes x and y, defined as: P MI(x, y) = log P (x, y) P (x)p (y) In our case, P MI(w, c) measures the association between a word w and a context c by calculating the log of the ratio between their joint probability (the frequency in which they occur together) and their marginal probabilities (the frequency in which they occur independently). PMI can be estimated empirically by considering the actual number of observations in a corpus: P MI(w, c) = log #(w, c) #(w) #(c) The use of PMI as a measure of association in NLP was introduced by Church and Hanks [8] and widely adopted for word similarity tasks [11, 27, 29]. Working with the PMI matrix presents some computational challenges. The rows of M PMI contain many entries of word-context pairs (w, c) that were never observed in the corpus, for which (9) (10) 4

5 P MI(w, c) = log 0 =. Not only is the matrix ill-defined, it is also dense, which is a major practical issue because of its huge dimensions V W V C. One could smooth the probabilities using, for instance, a Dirichlet prior by adding a small fake count to the underlying counts matrix, rendering all word-context pairs observed. While the resulting matrix will not contain any infinite values, it will remain dense. An alternative approach, commonly used in NLP, is to replace the M PMI matrix with M0 PMI, in which P MI(w, c) = 0 in cases #(w, c) = 0, resulting in a sparse matrix. We note that M0 PMI is inconsistent, in the sense that observed but bad (uncorrelated) word-context pairs have a negative matrix entry, while unobserved (hence worse) ones have 0 in their corresponding cell. Consider for example a pair of relatively frequent words (high P (w) and P (c)) that occur only once together. There is strong evidence that the words tend not to appear together, resulting in a negative PMI value, and hence a negative matrix entry. On the other hand, a pair of frequent words (same P (w) and P (c)) that is never observed occurring together in the corpus, will receive a value of 0. A sparse and consistent alternative from the NLP literature is to use the positive PMI (PPMI) metric, in which all negative values are replaced by 0: P P MI(w, c) = max (P MI (w, c), 0) (11) When representing words, there is some intuition behind ignoring negative values: humans can easily think of positive associations (e.g. Canada and snow ) but find it much harder to invent negative ones ( Canada and desert ). This suggests that the perceived similarity of two words is more influenced by the positive context they share than by the negative context they share. It therefore makes some intuitive sense to discard the negatively associated contexts and mark them as uninformative (0) instead. 2 Indeed, it was shown that the PPMI metric performs very well on semantic similarity tasks [5]. Both M PMI 0 and M PPMI are well known to the NLP community. In particular, systematic comparisons of various word-context association metrics show that PMI, and more so PPMI, provide the best results for a wide range of word-similarity tasks [5, 16]. It is thus interesting that the PMI matrix emerges as the optimal solution for SGNS s objective. 4 Alternative Word Representations As SGNS with k = 1 is attempting to implicitly factorize the familiar matrix M PMI, a natural algorithm would be to use the rows of M PPMI directly when calculating word similarities. Though PPMI is only an approximation of the original PMI matrix, it still brings the objective function very close to its optimum (see Section 5.1). In this section, we propose two alternative word representations that build upon M PPMI. 4.1 Shifted PPMI While the PMI matrix emerges from SGNS with k = 1, it was shown that different values of k can substantially improve the resulting embedding. With k > 1, the association metric in the implicitly factorized matrix is P MI(w, c) log(k). This suggests the use of Shifted PPMI (SPPMI), a novel association metric which, to the best of our knowledge, was not explored in the NLP and wordsimilarity communities: SP P MI k (w, c) = max (P MI (w, c) log k, 0) (12) As with SGNS, certain values of k can improve the performance of M SPPMI k on different tasks. 4.2 Spectral Dimensionality Reduction: SVD over Shifted PPMI While sparse vector representations work well, there are also advantages to working with dense lowdimensional vectors, such as improved computational efficiency and, arguably, better generalization. 2 A notable exception is the case of syntactic similarity. For example, all verbs share a very strong negative association with being preceded by determiners, and past tense verbs have a very strong negative association to be preceded by be verbs and modals. 5

6 An alternative matrix factorization method to SGNS s stochastic gradient training is truncated Singular Value Decomposition (SVD) a basic algorithm from linear algebra which is used to achieve the optimal rank d factorization with respect to L 2 loss [12]. SVD factorizes M into the product of three matrices U Σ V, where U and V are orthonormal and Σ is a diagonal matrix of singular values. Let Σ d be the diagonal matrix formed from the top d singular values, and let U d and V d be the matrices produced by selecting the corresponding columns from U and V. The matrix M d = U d Σ d Vd is the matrix of rank d that best approximates the original matrix M, in the sense that it minimizes the approximation errors. That is, M d = arg min Rank(M )=d M M 2. When using SVD, the dot-products between the rows of W = U d Σ d are equal to the dot-products between rows of M d. In the context of word-context matrices, the dense, d dimensional rows of W are perfect substitutes for the very high-dimensional rows of M d. Indeed another common approach in the NLP literature is factorizing the PPMI matrix M PPMI with SVD, and then taking the rows of W SVD = U d Σ d and C SVD = V d as word and context representations, respectively. However, using the rows of W SVD as word representations consistently under-perform the W SGNS embeddings derived from SGNS when evaluated on semantic tasks. Symmetric SVD We note that in the SVD-based factorization, the resulting word and context matrices have very different properties. In particular, the context matrix C SVD is orthonormal while the word matrix W SVD is not. On the other hand, the factorization achieved by SGNS s training procedure is much more symmetric, in the sense that neither W W2V nor C W2V is orthonormal, and no particular bias is given to either of the matrices in the training objective. We therefore propose achieving similar symmetry with the following factorization: W SVD 1/2 = U d Σ d C SVD 1/2 = V d Σ d (13) While it is not theoretically clear why the symmetric approach is better for semantic tasks, it does work much better empirically. 3 SVD versus SGNS The spectral algorithm has two computational advantages over stochastic gradient training. First, it is exact, and does not require learning rates or hyper-parameter tuning. Second, it can be easily trained on count-aggregated data (i.e. {(w, c, #(w, c))} triplets), making it applicable to much larger corpora than SGNS s training procedure, which requires each observation of (w, c) to be presented separately. On the other hand, the stochastic gradient method has advantages as well: in contrast to SVD, it distinguishes between observed and unobserved events; SVD is known to suffer from unobserved values [17], which are very common in word-context matrices. More importantly, SGNS s objective weighs different (w, c) pairs differently, preferring to assign correct values to frequent (w, c) pairs while allowing more error for infrequent pairs (see Section 3.2). Unfortunately, exact weighted SVD is a hard computational problem [25]. Finally, because SGNS cares only about observed (and sampled) (w, c) pairs, it does not require the underlying matrix to be a sparse one, enabling optimization of dense matrices, such as the exact P MI log k matrix. The same is not feasible when using SVD. An interesting middle-ground between SGNS and SVD is the use of stochastic matrix factorization (SMF) approaches, common in the collaborative filtering literature [17]. In contrast to SVD, the SMF approaches are not exact, and do require hyper-parameter tuning. On the other hand, they are better than SVD at handling unobserved values, and can integrate importance weighting for examples, much like SGNS s training procedure. However, like SVD and unlike SGNS s procedure, the SMF approaches work over aggregated (w, c) statistics allowing (w, c, f(w, c)) triplets as input, making the optimization objective more direct, and scalable to significantly larger corpora. SMF approaches have additional advantages over both SGNS and SVD, such as regularization, opening the way to a range of possible improvements. We leave the exploration of SMF-based algorithms for word embeddings to future work. 3 The approach can be generalized to W SVDα = U d (Σ d ) α, making α a tunable parameter. This observation was previously made by Caron [7] and investigated in [6, 28], showing that different values of α indeed perform better than others for various tasks. In particular, setting α = 0 performs well for many tasks. We do not explore tuning the α parameter in this work. 6

7 Method PMI log k SPPMI SVD SGNS d = 100 d = 500 d = 1000 d = 100 d = 500 d = 1000 k = 1 0% % 26.1% 25.2% 24.2% 31.4% 29.4% 7.40% k = 5 0% % 95.8% 95.1% 94.9% 39.3% 36.0% 7.13% k = 15 0% % 266% 266% 265% 7.80% 6.37% 5.97% Table 1: Percentage of deviation from the optimal objective value (lower values are better). See 5.1 for details. 5 Empirical Results We compare the matrix-based algorithms to SGNS in two aspects. First, we measure how well each algorithm optimizes the objective, and then proceed to evaluate the methods on various linguistic tasks. We find that for some tasks there is a large discrepancy between optimizing the objective and doing well on the linguistic task. Experimental Setup All models were trained on English Wikipedia, pre-processed by removing non-textual elements, sentence splitting, and tokenization. The corpus contains 77.5 million sentences, spanning 1.5 billion tokens. All models were derived using a window of 2 tokens to each side of the focus word, ignoring words that appeared less than 100 times in the corpus, resulting in vocabularies of 189,533 terms for both words and contexts. To train the SGNS models, we used a modified version of word2vec which receives a sequence of pre-extracted word-context pairs [18]. 4 We experimented with three values of k (number of negative samples in SGNS, shift parameter in PMI-based methods): 1, 5, 15. For SVD, we take W = U d Σ d as explained in Section Optimizing the Objective Now that we have an analytical solution for the objective, we can measure how well each algorithm optimizes this objective in practice. To do so, we calculated l, the value of the objective (equation 2) given each word (and context) representation. 5 For sparse matrix representations, we substituted w c with the matching cell s value (e.g. for SPPMI, we set w c = max(pmi(w, c) log k, 0)). Each algorithm s l value was compared to l Opt, the objective when setting w c = PMI(w, c) log k, which was shown to be optimal (Section 3.1). The percentage of deviation from the optimum is defined by (l l Opt )/(l Opt ) and presented in table 1. We observe that SPPMI is indeed a near-perfect approximation of the optimal solution, even though it discards a lot of information when considering only positive cells. We also note that for the factorization methods, increasing the dimensionality enables better solutions, as expected. SVD is slightly better than SGNS at optimizing the objective for d 500 and k = 1. However, while SGNS is able to leverage higher dimensions and reduce its error significantly, SVD fails to do so. Furthermore, SVD becomes very erroneous as k increases. We hypothesize that this is a result of the increasing number of zero-cells, which may cause SVD to prefer a factorization that is very close to the zero matrix, since SVD s L 2 objective is unweighted, and does not distinguish between observed and unobserved matrix cells. 5.2 Performance of Word Representations on Linguistic Tasks Linguistic Tasks and Datasets We evaluated the word representations on four dataset, covering word similarity and relational analogy tasks. We used two datasets to evaluate pairwise word similarity: Finkelstein et al. s WordSim353 [13] and Bruni et al. s MEN [4]. These datasets contain word pairs together with human-assigned similarity scores. The word vectors are evaluated by ranking the pairs according to their cosine similarities, and measuring the correlation (Spearman s ρ) with the human ratings. The two analogy datasets present questions of the form a is to a as b is to b, where b is hidden, and must be guessed from the entire vocabulary. The Syntactic dataset [22] contains 8000 morpho Since it is computationally expensive to calculate the exact objective, we approximated it. First, instead of enumerating every observed word-context pair in the corpus, we sampled 10 million such pairs, according to their prevalence. Second, instead of calculating the expectation term explicitly (as in equation 4), we sampled a negative example {(w, c N )} for each one of the 10 million positive examples, using the contexts unigram distribution, as done by SGNS s optimization procedure (explained in Section 2). 7

8 WS353 (WORDSIM) [13] MEN (WORDSIM) [4] MIXED ANALOGIES [20] SYNT. ANALOGIES [22] Representation Corr. Representation Corr. Representation Acc. Representation Acc. SVD (k=5) SVD (k=1) SPPMI (k=1) SGNS (k=15) SPPMI (k=15) SVD (k=5) SPPMI (k=5) SGNS (k=5) SPPMI (k=5) SPPMI (k=5) SGNS (k=15) SGNS (k=1) 0.59 SGNS (k=15) SPPMI (k=15) SGNS (k=5) SPPMI (k=5) SVD (k=15) SGNS (k=15) SPPMI (k=15) SVD (k=1) SVD (k=1) SGNS (k=5) SVD (k=1) SPPMI (k=1) SGNS (k=5) SVD (k=15) SGNS (k=1) SPPMI (k=15) SGNS (k=1) SGNS (k=1) SVD (k=5) SVD (k=5) SPPMI (k=1) SPPMI (k=1) SVD (k=15) SVD (k=15) Table 2: A comparison of word representations on various linguistic tasks. The different representations were created by three algorithms (SPPMI, SVD, SGNS) with d = 1000 and different values of k. syntactic analogy questions, such as good is to best as smart is to smartest. The Mixed dataset [20] contains questions, about half of the same kind as in Syntactic, and another half of a more semantic nature, such as capital cities ( Paris is to France as Tokyo is to Japan ). After filtering questions involving out-of-vocabulary words, i.e. words that appeared in English Wikipedia less than 100 times, we remain with 7118 instances in Syntactic and instances in Mixed. The analogy questions are answered using Levy and Goldberg s similarity multiplication method [19], which is state-of-the-art in analogy recovery: arg max b V W \{a,b,a} cos(b, a ) cos(b, b)/(cos(b, a)+ε). The evaluation metric for the analogy questions is the percentage of questions for which the argmax result was the correct answer (b ). Results Table 2 shows the experiments results. On the word similarity task, SPPMI yields better results than SGNS, and SVD improves even more. However, the difference between the top PMIbased method and the top SGNS configuration in each dataset is small, and it is reasonable to say that they perform on-par. It is also evident that different values of k have a significant effect on all methods: SGNS generally works better with higher values of k, whereas SPPMI and SVD prefer lower values of k. This may be due to the fact that only positive values are retained, and high values of k may cause too much loss of information. A similar observation was made for SGNS and SVD when observing how well they optimized the objective (Section 5.1). Nevertheless, tuning k can significantly increase the performance of SPPMI over the traditional PPMI configuration (k = 1). The analogies task shows different behavior. First, SVD does not perform as well as SGNS and SPPMI. More interestingly, in the syntactic analogies dataset, SGNS significantly outperforms the rest. This trend is even more pronounced when using the additive analogy recovery method [22] (not shown). Linguistically speaking, the syntactic analogies dataset is quite different from the rest, since it relies more on contextual information from common words such as determiners ( the, each, many ) and auxiliary verbs ( will, had ) to solve correctly. We conjecture that SGNS performs better on this task because its training procedure gives more influence to frequent pairs, as opposed to SVD s objective, which gives the same weight to all matrix cells (see Section 3.2). 6 Conclusion We analyzed the SGNS word embedding algorithms, and showed that it is implicitly factorizing the (shifted) word-context PMI matrix M PMI log k using per-observation stochastic gradient updates. We presented SPPMI, a modification of PPMI inspired by our theoretical findings. Indeed, using SPPMI can improve upon the traditional PPMI matrix. Though SPPMI provides a far better solution to SGNS s objective, it does not necessarily perform better than SGNS on linguistic tasks, as evident with syntactic analogies. We suspect that this may be related to SGNS down-weighting rare words, which PMI-based methods are known to exaggerate. We also experimented with an alternative matrix factorization method, SVD. Although SVD was relatively poor at optimizing SGNS s objective, it performed slightly better than the other methods on word similarity datasets. However, SVD underperforms on the word-analogy task. One of the main differences between the SVD and SGNS is that SGNS performs weighted matrix factorization, which may be giving it an edge in the analogy task. As future work we suggest investigating weighted matrix factorizations of word-context matrices with PMI-based association metrics. Acknowledgements This work was partially supported by the EC-funded project EXCITEMENT (FP7ICT ). We thank Ido Dagan and Peter Turney for their valuable insights. 8

9 References [1] Marco Baroni, Georgiana Dinu, and Germán Kruszewski. Dont count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In ACL, [2] Marco Baroni and Alessandro Lenci. Distributional memory: A general framework for corpus-based semantics. Computational Linguistics, 36(4): , [3] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic language model. Journal of Machine Learning Research, 3: , [4] Elia Bruni, Gemma Boleda, Marco Baroni, and Nam Khanh Tran. Distributional semantics in technicolor. In ACL, [5] John A Bullinaria and Joseph P Levy. Extracting semantic representations from word co-occurrence statistics: a computational study. Behavior Research Methods, 39(3): , [6] John A Bullinaria and Joseph P Levy. Extracting semantic representations from word co-occurrence statistics: Stop-lists, stemming, and SVD. Behavior Research Methods, 44(3): , [7] John Caron. Experiments with LSA scoring: optimal rank and basis. In Proceedings of the SIAM Computational Information Retrieval Workshop, pages , [8] Kenneth Ward Church and Patrick Hanks. Word association norms, mutual information, and lexicography. Computational linguistics, 16(1):22 29, [9] Ronan Collobert and Jason Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In ICML, [10] Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. Natural language processing (almost) from scratch. The Journal of Machine Learning Research, [11] Ido Dagan, Fernando Pereira, and Lillian Lee. Similarity-based estimation of word cooccurrence probabilities. In ACL, [12] C Eckart and G Young. The approximation of one matrix by another of lower rank. Psychometrika, 1: , [13] Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin. Placing search in context: The concept revisited. ACM TOIS, [14] Yoav Goldberg and Omer Levy. word2vec explained: deriving Mikolov et al. s negative-sampling wordembedding method. arxiv preprint arxiv: , [15] Zellig Harris. Distributional structure. Word, 10(23): , [16] Douwe Kiela and Stephen Clark. A systematic study of semantic vector space model parameters. In Workshop on Continuous Vector Space Models and their Compositionality, [17] Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems. Computer, [18] Omer Levy and Yoav Goldberg. Dependency-based word embeddings. In ACL, [19] Omer Levy and Yoav Goldberg. Linguistic regularities in sparse and explicit word representations. In CoNLL, [20] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. CoRR, abs/ , [21] Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. Distributed representations of words and phrases and their compositionality. In NIPS, [22] Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. In NAACL, [23] Andriy Mnih and Geoffrey E Hinton. A scalable hierarchical distributed language model. In Advances in Neural Information Processing Systems, pages , [24] Andriy Mnih and Koray Kavukcuoglu. Learning word embeddings efficiently with noise-contrastive estimation. In NIPS, [25] Nathan Srebro and Tommi Jaakkola. Weighted low-rank approximations. In ICML, [26] Joseph Turian, Lev Ratinov, and Yoshua Bengio. Word representations: a simple and general method for semi-supervised learning. In ACL, [27] Peter D. Turney. Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In ECML, [28] Peter D. Turney. Domain and function: A dual-space model of semantic relations and compositions. Journal of Artificial Intelligence Research, 44: , [29] Peter D. Turney and Patrick Pantel. From frequency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research,

How To Find The Similarities Between Two Words In A Word (Man) And Woman (Woman) In A Sentence (Man And Woman) In The Same Sentence

How To Find The Similarities Between Two Words In A Word (Man) And Woman (Woman) In A Sentence (Man And Woman) In The Same Sentence Linguistic Regularities in Sparse and Explicit Word Representations Omer Levy and Yoav Goldberg Computer Science Department Bar-Ilan University Ramat-Gan, Israel {omerlevy,yoav.goldberg}@gmail.com Abstract

More information

The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2

The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2 2nd International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2016) The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2 1 School of

More information

GloVe: Global Vectors for Word Representation

GloVe: Global Vectors for Word Representation GloVe: Global Vectors for Word Representation Jeffrey Pennington, Richard Socher, Christopher D. Manning Computer Science Department, Stanford University, Stanford, CA 94305 jpennin@stanford.edu, richard@socher.org,

More information

Linguistic Regularities in Continuous Space Word Representations

Linguistic Regularities in Continuous Space Word Representations Linguistic Regularities in Continuous Space Word Representations Tomas Mikolov, Wen-tau Yih, Geoffrey Zweig Microsoft Research Redmond, WA 98052 Abstract Continuous space language models have recently

More information

Semi-Supervised Learning for Blog Classification

Semi-Supervised Learning for Blog Classification Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (2008) Semi-Supervised Learning for Blog Classification Daisuke Ikeda Department of Computational Intelligence and Systems Science,

More information

Issues in evaluating semantic spaces using word analogies

Issues in evaluating semantic spaces using word analogies Issues in evaluating semantic spaces using word analogies Tal Linzen LSCP & IJN École Normale Supérieure PSL Research University tal.linzen@ens.fr Abstract The offset method for solving word analogies

More information

Distributed Representations of Words and Phrases and their Compositionality

Distributed Representations of Words and Phrases and their Compositionality Distributed Representations of Words and Phrases and their Compositionality Tomas Mikolov mikolov@google.com Ilya Sutskever ilyasu@google.com Kai Chen kai@google.com Greg Corrado gcorrado@google.com Jeffrey

More information

Simple Language Models for Spam Detection

Simple Language Models for Spam Detection Simple Language Models for Spam Detection Egidio Terra Faculty of Informatics PUC/RS - Brazil Abstract For this year s Spam track we used classifiers based on language models. These models are used to

More information

The Singular Value Decomposition in Symmetric (Löwdin) Orthogonalization and Data Compression

The Singular Value Decomposition in Symmetric (Löwdin) Orthogonalization and Data Compression The Singular Value Decomposition in Symmetric (Löwdin) Orthogonalization and Data Compression The SVD is the most generally applicable of the orthogonal-diagonal-orthogonal type matrix decompositions Every

More information

W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015

W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015 W. Heath Rushing Adsurgo LLC Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare Session H-1 JTCC: October 23, 2015 Outline Demonstration: Recent article on cnn.com Introduction

More information

An Introduction to Random Indexing

An Introduction to Random Indexing MAGNUS SAHLGREN SICS, Swedish Institute of Computer Science Box 1263, SE-164 29 Kista, Sweden mange@sics.se Introduction Word space models enjoy considerable attention in current research on semantic indexing.

More information

Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC

Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC 1. Introduction A popular rule of thumb suggests that

More information

A Comparison Framework of Similarity Metrics Used for Web Access Log Analysis

A Comparison Framework of Similarity Metrics Used for Web Access Log Analysis A Comparison Framework of Similarity Metrics Used for Web Access Log Analysis Yusuf Yaslan and Zehra Cataltepe Istanbul Technical University, Computer Engineering Department, Maslak 34469 Istanbul, Turkey

More information

Collaborative Filtering. Radek Pelánek

Collaborative Filtering. Radek Pelánek Collaborative Filtering Radek Pelánek 2015 Collaborative Filtering assumption: users with similar taste in past will have similar taste in future requires only matrix of ratings applicable in many domains

More information

Tensor Factorization for Multi-Relational Learning

Tensor Factorization for Multi-Relational Learning Tensor Factorization for Multi-Relational Learning Maximilian Nickel 1 and Volker Tresp 2 1 Ludwig Maximilian University, Oettingenstr. 67, Munich, Germany nickel@dbs.ifi.lmu.de 2 Siemens AG, Corporate

More information

Lecture Topic: Low-Rank Approximations

Lecture Topic: Low-Rank Approximations Lecture Topic: Low-Rank Approximations Low-Rank Approximations We have seen principal component analysis. The extraction of the first principle eigenvalue could be seen as an approximation of the original

More information

Data Mining Yelp Data - Predicting rating stars from review text

Data Mining Yelp Data - Predicting rating stars from review text Data Mining Yelp Data - Predicting rating stars from review text Rakesh Chada Stony Brook University rchada@cs.stonybrook.edu Chetan Naik Stony Brook University cnaik@cs.stonybrook.edu ABSTRACT The majority

More information

Statistical machine learning, high dimension and big data

Statistical machine learning, high dimension and big data Statistical machine learning, high dimension and big data S. Gaïffas 1 14 mars 2014 1 CMAP - Ecole Polytechnique Agenda for today Divide and Conquer principle for collaborative filtering Graphical modelling,

More information

CHAPTER VII CONCLUSIONS

CHAPTER VII CONCLUSIONS CHAPTER VII CONCLUSIONS To do successful research, you don t need to know everything, you just need to know of one thing that isn t known. -Arthur Schawlow In this chapter, we provide the summery of the

More information

BUILDING A PREDICTIVE MODEL AN EXAMPLE OF A PRODUCT RECOMMENDATION ENGINE

BUILDING A PREDICTIVE MODEL AN EXAMPLE OF A PRODUCT RECOMMENDATION ENGINE BUILDING A PREDICTIVE MODEL AN EXAMPLE OF A PRODUCT RECOMMENDATION ENGINE Alex Lin Senior Architect Intelligent Mining alin@intelligentmining.com Outline Predictive modeling methodology k-nearest Neighbor

More information

CHAPTER 8 FACTOR EXTRACTION BY MATRIX FACTORING TECHNIQUES. From Exploratory Factor Analysis Ledyard R Tucker and Robert C.

CHAPTER 8 FACTOR EXTRACTION BY MATRIX FACTORING TECHNIQUES. From Exploratory Factor Analysis Ledyard R Tucker and Robert C. CHAPTER 8 FACTOR EXTRACTION BY MATRIX FACTORING TECHNIQUES From Exploratory Factor Analysis Ledyard R Tucker and Robert C MacCallum 1997 180 CHAPTER 8 FACTOR EXTRACTION BY MATRIX FACTORING TECHNIQUES In

More information

Search Engines. Stephen Shaw <stesh@netsoc.tcd.ie> 18th of February, 2014. Netsoc

Search Engines. Stephen Shaw <stesh@netsoc.tcd.ie> 18th of February, 2014. Netsoc Search Engines Stephen Shaw Netsoc 18th of February, 2014 Me M.Sc. Artificial Intelligence, University of Edinburgh Would recommend B.A. (Mod.) Computer Science, Linguistics, French,

More information

Linguistic Regularities in Sparse and Explicit Word Representations

Linguistic Regularities in Sparse and Explicit Word Representations Linguistic Regularities in Sparse and Explicit Word Representations Omer Levy and Yoav Goldberg Computer Science Department Bar-Ilan University Ramat-Gan, Israel {omerlevy,yoav.goldberg}@gmail.com Abstract

More information

Recurrent Convolutional Neural Networks for Text Classification

Recurrent Convolutional Neural Networks for Text Classification Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence Recurrent Convolutional Neural Networks for Text Classification Siwei Lai, Liheng Xu, Kang Liu, Jun Zhao National Laboratory of

More information

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics Part I: Factorizations and Statistical Modeling/Inference Amnon Shashua School of Computer Science & Eng. The Hebrew University

More information

A Systematic Study of Semantic Vector Space Model Parameters

A Systematic Study of Semantic Vector Space Model Parameters A Systematic Study of Semantic Vector Space Model Parameters Douwe Kiela University of Cambridge Computer Laboratory douwe.kiela@cl.cam.ac.uk Stephen Clark University of Cambridge Computer Laboratory sc609@cam.ac.uk

More information

Microblog Sentiment Analysis with Emoticon Space Model

Microblog Sentiment Analysis with Emoticon Space Model Microblog Sentiment Analysis with Emoticon Space Model Fei Jiang, Yiqun Liu, Huanbo Luan, Min Zhang, and Shaoping Ma State Key Laboratory of Intelligent Technology and Systems, Tsinghua National Laboratory

More information

dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING

dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING ABSTRACT In most CRM (Customer Relationship Management) systems, information on

More information

DATA ANALYSIS II. Matrix Algorithms

DATA ANALYSIS II. Matrix Algorithms DATA ANALYSIS II Matrix Algorithms Similarity Matrix Given a dataset D = {x i }, i=1,..,n consisting of n points in R d, let A denote the n n symmetric similarity matrix between the points, given as where

More information

A Large Scale Evaluation of Distributional Semantic Models: Parameters, Interactions and Model Selection

A Large Scale Evaluation of Distributional Semantic Models: Parameters, Interactions and Model Selection A Large Scale Evaluation of Distributional Semantic Models: Parameters, Interactions and Model Selection Gabriella Lapesa 2,1 1 Universität Osnabrück Institut für Kognitionswissenschaft Albrechtstr. 28,

More information

Three New Graphical Models for Statistical Language Modelling

Three New Graphical Models for Statistical Language Modelling Andriy Mnih Geoffrey Hinton Department of Computer Science, University of Toronto, Canada amnih@cs.toronto.edu hinton@cs.toronto.edu Abstract The supremacy of n-gram models in statistical language modelling

More information

15.062 Data Mining: Algorithms and Applications Matrix Math Review

15.062 Data Mining: Algorithms and Applications Matrix Math Review .6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop

More information

Mehtap Ergüven Abstract of Ph.D. Dissertation for the degree of PhD of Engineering in Informatics

Mehtap Ergüven Abstract of Ph.D. Dissertation for the degree of PhD of Engineering in Informatics INTERNATIONAL BLACK SEA UNIVERSITY COMPUTER TECHNOLOGIES AND ENGINEERING FACULTY ELABORATION OF AN ALGORITHM OF DETECTING TESTS DIMENSIONALITY Mehtap Ergüven Abstract of Ph.D. Dissertation for the degree

More information

Data, Measurements, Features

Data, Measurements, Features Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are

More information

Leveraging Frame Semantics and Distributional Semantic Polieties

Leveraging Frame Semantics and Distributional Semantic Polieties Leveraging Frame Semantics and Distributional Semantics for Unsupervised Semantic Slot Induction in Spoken Dialogue Systems (Extended Abstract) Yun-Nung Chen, William Yang Wang, and Alexander I. Rudnicky

More information

Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression

Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression Saikat Maitra and Jun Yan Abstract: Dimension reduction is one of the major tasks for multivariate

More information

Globally Optimal Crowdsourcing Quality Management

Globally Optimal Crowdsourcing Quality Management Globally Optimal Crowdsourcing Quality Management Akash Das Sarma Stanford University akashds@stanford.edu Aditya G. Parameswaran University of Illinois (UIUC) adityagp@illinois.edu Jennifer Widom Stanford

More information

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar

More information

Chapter 8. Final Results on Dutch Senseval-2 Test Data

Chapter 8. Final Results on Dutch Senseval-2 Test Data Chapter 8 Final Results on Dutch Senseval-2 Test Data The general idea of testing is to assess how well a given model works and that can only be done properly on data that has not been seen before. Supervised

More information

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 123 CHAPTER 7 BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 7.1 Introduction Even though using SVM presents

More information

Folksonomies versus Automatic Keyword Extraction: An Empirical Study

Folksonomies versus Automatic Keyword Extraction: An Empirical Study Folksonomies versus Automatic Keyword Extraction: An Empirical Study Hend S. Al-Khalifa and Hugh C. Davis Learning Technology Research Group, ECS, University of Southampton, Southampton, SO17 1BJ, UK {hsak04r/hcd}@ecs.soton.ac.uk

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

Making Sense of the Mayhem: Machine Learning and March Madness

Making Sense of the Mayhem: Machine Learning and March Madness Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University atran3@stanford.edu ginzberg@stanford.edu I. Introduction III. Model The goal of our research

More information

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT

More information

2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR)

2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR) 2DI36 Statistics 2DI36 Part II (Chapter 7 of MR) What Have we Done so Far? Last time we introduced the concept of a dataset and seen how we can represent it in various ways But, how did this dataset came

More information

Manifold Learning Examples PCA, LLE and ISOMAP

Manifold Learning Examples PCA, LLE and ISOMAP Manifold Learning Examples PCA, LLE and ISOMAP Dan Ventura October 14, 28 Abstract We try to give a helpful concrete example that demonstrates how to use PCA, LLE and Isomap, attempts to provide some intuition

More information

How to report the percentage of explained common variance in exploratory factor analysis

How to report the percentage of explained common variance in exploratory factor analysis UNIVERSITAT ROVIRA I VIRGILI How to report the percentage of explained common variance in exploratory factor analysis Tarragona 2013 Please reference this document as: Lorenzo-Seva, U. (2013). How to report

More information

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter Gerard Briones and Kasun Amarasinghe and Bridget T. McInnes, PhD. Department of Computer Science Virginia Commonwealth University Richmond,

More information

Predicting the Stock Market with News Articles

Predicting the Stock Market with News Articles Predicting the Stock Market with News Articles Kari Lee and Ryan Timmons CS224N Final Project Introduction Stock market prediction is an area of extreme importance to an entire industry. Stock price is

More information

System Identification for Acoustic Comms.:

System Identification for Acoustic Comms.: System Identification for Acoustic Comms.: New Insights and Approaches for Tracking Sparse and Rapidly Fluctuating Channels Weichang Li and James Preisig Woods Hole Oceanographic Institution The demodulation

More information

Moral Hazard. Itay Goldstein. Wharton School, University of Pennsylvania

Moral Hazard. Itay Goldstein. Wharton School, University of Pennsylvania Moral Hazard Itay Goldstein Wharton School, University of Pennsylvania 1 Principal-Agent Problem Basic problem in corporate finance: separation of ownership and control: o The owners of the firm are typically

More information

Factorization Machines

Factorization Machines Factorization Machines Steffen Rendle Department of Reasoning for Intelligence The Institute of Scientific and Industrial Research Osaka University, Japan rendle@ar.sanken.osaka-u.ac.jp Abstract In this

More information

Computational Optical Imaging - Optique Numerique. -- Deconvolution --

Computational Optical Imaging - Optique Numerique. -- Deconvolution -- Computational Optical Imaging - Optique Numerique -- Deconvolution -- Winter 2014 Ivo Ihrke Deconvolution Ivo Ihrke Outline Deconvolution Theory example 1D deconvolution Fourier method Algebraic method

More information

Discussion on the paper Hypotheses testing by convex optimization by A. Goldenschluger, A. Juditsky and A. Nemirovski.

Discussion on the paper Hypotheses testing by convex optimization by A. Goldenschluger, A. Juditsky and A. Nemirovski. Discussion on the paper Hypotheses testing by convex optimization by A. Goldenschluger, A. Juditsky and A. Nemirovski. Fabienne Comte, Celine Duval, Valentine Genon-Catalot To cite this version: Fabienne

More information

The Need for Training in Big Data: Experiences and Case Studies

The Need for Training in Big Data: Experiences and Case Studies The Need for Training in Big Data: Experiences and Case Studies Guy Lebanon Amazon Background and Disclaimer All opinions are mine; other perspectives are legitimate. Based on my experience as a professor

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Probabilistic Matrix Factorization

Probabilistic Matrix Factorization Probabilistic Matrix Factorization Ruslan Salakhutdinov and Andriy Mnih Department of Computer Science, University of Toronto 6 King s College Rd, M5S 3G4, Canada {rsalakhu,amnih}@cs.toronto.edu Abstract

More information

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Data Mining Project Report. Document Clustering. Meryem Uzun-Per Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...

More information

Heritage Provider Network Health Prize Round 3 Milestone: Team crescendo s Solution

Heritage Provider Network Health Prize Round 3 Milestone: Team crescendo s Solution Heritage Provider Network Health Prize Round 3 Milestone: Team crescendo s Solution Rie Johnson Tong Zhang 1 Introduction This document describes our entry nominated for the second prize of the Heritage

More information

SERG. Reconstructing Requirements Traceability in Design and Test Using Latent Semantic Indexing

SERG. Reconstructing Requirements Traceability in Design and Test Using Latent Semantic Indexing Delft University of Technology Software Engineering Research Group Technical Report Series Reconstructing Requirements Traceability in Design and Test Using Latent Semantic Indexing Marco Lormans and Arie

More information

Ranking on Data Manifolds

Ranking on Data Manifolds Ranking on Data Manifolds Dengyong Zhou, Jason Weston, Arthur Gretton, Olivier Bousquet, and Bernhard Schölkopf Max Planck Institute for Biological Cybernetics, 72076 Tuebingen, Germany {firstname.secondname

More information

Numerical Analysis Lecture Notes

Numerical Analysis Lecture Notes Numerical Analysis Lecture Notes Peter J. Olver 5. Inner Products and Norms The norm of a vector is a measure of its size. Besides the familiar Euclidean norm based on the dot product, there are a number

More information

Lecture Notes 2: Matrices as Systems of Linear Equations

Lecture Notes 2: Matrices as Systems of Linear Equations 2: Matrices as Systems of Linear Equations 33A Linear Algebra, Puck Rombach Last updated: April 13, 2016 Systems of Linear Equations Systems of linear equations can represent many things You have probably

More information

Statistical Machine Translation: IBM Models 1 and 2

Statistical Machine Translation: IBM Models 1 and 2 Statistical Machine Translation: IBM Models 1 and 2 Michael Collins 1 Introduction The next few lectures of the course will be focused on machine translation, and in particular on statistical machine translation

More information

arxiv:1505.07909v2 [cs.cl] 17 Jun 2015

arxiv:1505.07909v2 [cs.cl] 17 Jun 2015 Solving Verbal Comprehension Questions in IQ Test by Knowledge-Powered Word Embedding arxiv:1505.07909v2 [cs.cl] 17 Jun 2015 ABSTRACT Jiang Bian Microsoft Research 13F, Bldg 2, No. 5, Danling St Beijing,

More information

Don t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors

Don t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors Don t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors Marco Baroni and Georgiana Dinu and Germán Kruszewski Center for Mind/Brain Sciences (University

More information

Solution of Linear Systems

Solution of Linear Systems Chapter 3 Solution of Linear Systems In this chapter we study algorithms for possibly the most commonly occurring problem in scientific computing, the solution of linear systems of equations. We start

More information

Sanjeev Kumar. contribute

Sanjeev Kumar. contribute RESEARCH ISSUES IN DATAA MINING Sanjeev Kumar I.A.S.R.I., Library Avenue, Pusa, New Delhi-110012 sanjeevk@iasri.res.in 1. Introduction The field of data mining and knowledgee discovery is emerging as a

More information

Identifying SPAM with Predictive Models

Identifying SPAM with Predictive Models Identifying SPAM with Predictive Models Dan Steinberg and Mikhaylo Golovnya Salford Systems 1 Introduction The ECML-PKDD 2006 Discovery Challenge posed a topical problem for predictive modelers: how to

More information

Part 2: Community Detection

Part 2: Community Detection Chapter 8: Graph Data Part 2: Community Detection Based on Leskovec, Rajaraman, Ullman 2014: Mining of Massive Datasets Big Data Management and Analytics Outline Community Detection - Social networks -

More information

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning SAMSI 10 May 2013 Outline Introduction to NMF Applications Motivations NMF as a middle step

More information

The Data Mining Process

The Data Mining Process Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data

More information

Marketing Mix Modelling and Big Data P. M Cain

Marketing Mix Modelling and Big Data P. M Cain 1) Introduction Marketing Mix Modelling and Big Data P. M Cain Big data is generally defined in terms of the volume and variety of structured and unstructured information. Whereas structured data is stored

More information

P164 Tomographic Velocity Model Building Using Iterative Eigendecomposition

P164 Tomographic Velocity Model Building Using Iterative Eigendecomposition P164 Tomographic Velocity Model Building Using Iterative Eigendecomposition K. Osypov* (WesternGeco), D. Nichols (WesternGeco), M. Woodward (WesternGeco) & C.E. Yarman (WesternGeco) SUMMARY Tomographic

More information

Diagnosis Code Assignment Support Using Random Indexing of Patient Records A Qualitative Feasibility Study

Diagnosis Code Assignment Support Using Random Indexing of Patient Records A Qualitative Feasibility Study Diagnosis Code Assignment Support Using Random Indexing of Patient Records A Qualitative Feasibility Study Aron Henriksson 1, Martin Hassel 1, and Maria Kvist 1,2 1 Department of Computer and System Sciences

More information

Course: Model, Learning, and Inference: Lecture 5

Course: Model, Learning, and Inference: Lecture 5 Course: Model, Learning, and Inference: Lecture 5 Alan Yuille Department of Statistics, UCLA Los Angeles, CA 90095 yuille@stat.ucla.edu Abstract Probability distributions on structured representation.

More information

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing CS Master Level Courses and Areas The graduate courses offered may change over time, in response to new developments in computer science and the interests of faculty and students; the list of graduate

More information

Large-Scale Similarity and Distance Metric Learning

Large-Scale Similarity and Distance Metric Learning Large-Scale Similarity and Distance Metric Learning Aurélien Bellet Télécom ParisTech Joint work with K. Liu, Y. Shi and F. Sha (USC), S. Clémençon and I. Colin (Télécom ParisTech) Séminaire Criteo March

More information

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #18: Dimensionality Reduc7on

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #18: Dimensionality Reduc7on CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #18: Dimensionality Reduc7on Dimensionality Reduc=on Assump=on: Data lies on or near a low d- dimensional subspace Axes of this subspace

More information

Bayesian Predictive Profiles with Applications to Retail Transaction Data

Bayesian Predictive Profiles with Applications to Retail Transaction Data Bayesian Predictive Profiles with Applications to Retail Transaction Data Igor V. Cadez Information and Computer Science University of California Irvine, CA 92697-3425, U.S.A. icadez@ics.uci.edu Padhraic

More information

GLM, insurance pricing & big data: paying attention to convergence issues.

GLM, insurance pricing & big data: paying attention to convergence issues. GLM, insurance pricing & big data: paying attention to convergence issues. Michaël NOACK - michael.noack@addactis.com Senior consultant & Manager of ADDACTIS Pricing Copyright 2014 ADDACTIS Worldwide.

More information

Chapter 6. Orthogonality

Chapter 6. Orthogonality 6.3 Orthogonal Matrices 1 Chapter 6. Orthogonality 6.3 Orthogonal Matrices Definition 6.4. An n n matrix A is orthogonal if A T A = I. Note. We will see that the columns of an orthogonal matrix must be

More information

Generating Valid 4 4 Correlation Matrices

Generating Valid 4 4 Correlation Matrices Applied Mathematics E-Notes, 7(2007), 53-59 c ISSN 1607-2510 Available free at mirror sites of http://www.math.nthu.edu.tw/ amen/ Generating Valid 4 4 Correlation Matrices Mark Budden, Paul Hadavas, Lorrie

More information

Methods and Applications for Distance Based ANN Training

Methods and Applications for Distance Based ANN Training Methods and Applications for Distance Based ANN Training Christoph Lassner, Rainer Lienhart Multimedia Computing and Computer Vision Lab Augsburg University, Universitätsstr. 6a, 86159 Augsburg, Germany

More information

Asymmetry and the Cost of Capital

Asymmetry and the Cost of Capital Asymmetry and the Cost of Capital Javier García Sánchez, IAE Business School Lorenzo Preve, IAE Business School Virginia Sarria Allende, IAE Business School Abstract The expected cost of capital is a crucial

More information

Improving Word Representations via Global Context and Multiple Word Prototypes

Improving Word Representations via Global Context and Multiple Word Prototypes Improving Word Representations via Global Context and Multiple Word Prototypes Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford,

More information

Inflation. Chapter 8. 8.1 Money Supply and Demand

Inflation. Chapter 8. 8.1 Money Supply and Demand Chapter 8 Inflation This chapter examines the causes and consequences of inflation. Sections 8.1 and 8.2 relate inflation to money supply and demand. Although the presentation differs somewhat from that

More information

Query Transformations for Result Merging

Query Transformations for Result Merging Query Transformations for Result Merging Shriphani Palakodety School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213, USA spalakod@cs.cmu.edu Jamie Callan School of Computer Science

More information

Distributed Representations of Sentences and Documents

Distributed Representations of Sentences and Documents Quoc Le Tomas Mikolov Google Inc, 1600 Amphitheatre Parkway, Mountain View, CA 94043 QVL@GOOGLE.COM TMIKOLOV@GOOGLE.COM Abstract Many machine learning algorithms require the input to be represented as

More information

CORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREER-READY FOUNDATIONS IN ALGEBRA

CORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREER-READY FOUNDATIONS IN ALGEBRA We Can Early Learning Curriculum PreK Grades 8 12 INSIDE ALGEBRA, GRADES 8 12 CORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREER-READY FOUNDATIONS IN ALGEBRA April 2016 www.voyagersopris.com Mathematical

More information

A Learning Based Method for Super-Resolution of Low Resolution Images

A Learning Based Method for Super-Resolution of Low Resolution Images A Learning Based Method for Super-Resolution of Low Resolution Images Emre Ugur June 1, 2004 emre.ugur@ceng.metu.edu.tr Abstract The main objective of this project is the study of a learning based method

More information

The Role of Size Normalization on the Recognition Rate of Handwritten Numerals

The Role of Size Normalization on the Recognition Rate of Handwritten Numerals The Role of Size Normalization on the Recognition Rate of Handwritten Numerals Chun Lei He, Ping Zhang, Jianxiong Dong, Ching Y. Suen, Tien D. Bui Centre for Pattern Recognition and Machine Intelligence,

More information

Generating SQL Queries Using Natural Language Syntactic Dependencies and Metadata

Generating SQL Queries Using Natural Language Syntactic Dependencies and Metadata Generating SQL Queries Using Natural Language Syntactic Dependencies and Metadata Alessandra Giordani and Alessandro Moschitti Department of Computer Science and Engineering University of Trento Via Sommarive

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION 1 CHAPTER 1 INTRODUCTION Exploration is a process of discovery. In the database exploration process, an analyst executes a sequence of transformations over a collection of data structures to discover useful

More information

Factor analysis. Angela Montanari

Factor analysis. Angela Montanari Factor analysis Angela Montanari 1 Introduction Factor analysis is a statistical model that allows to explain the correlations between a large number of observed correlated variables through a small number

More information

Taxonomy learning factoring the structure of a taxonomy into a semantic classification decision

Taxonomy learning factoring the structure of a taxonomy into a semantic classification decision Taxonomy learning factoring the structure of a taxonomy into a semantic classification decision Viktor PEKAR Bashkir State University Ufa, Russia, 450000 vpekar@ufanet.ru Steffen STAAB Institute AIFB,

More information

Tail-Dependence an Essential Factor for Correctly Measuring the Benefits of Diversification

Tail-Dependence an Essential Factor for Correctly Measuring the Benefits of Diversification Tail-Dependence an Essential Factor for Correctly Measuring the Benefits of Diversification Presented by Work done with Roland Bürgi and Roger Iles New Views on Extreme Events: Coupled Networks, Dragon

More information

Cover Page. "Assessing the Agreement of Cognitive Space with Information Space" A Research Seed Grant Proposal to the UNC-CH Cognitive Science Program

Cover Page. Assessing the Agreement of Cognitive Space with Information Space A Research Seed Grant Proposal to the UNC-CH Cognitive Science Program Cover Page "Assessing the Agreement of Cognitive Space with Information Space" A Research Seed Grant Proposal to the UNC-CH Cognitive Science Program Submitted by: Dr. Gregory B. Newby Assistant Professor

More information

How To Find Local Affinity Patterns In Big Data

How To Find Local Affinity Patterns In Big Data Detection of local affinity patterns in big data Andrea Marinoni, Paolo Gamba Department of Electronics, University of Pavia, Italy Abstract Mining information in Big Data requires to design a new class

More information

We shall turn our attention to solving linear systems of equations. Ax = b

We shall turn our attention to solving linear systems of equations. Ax = b 59 Linear Algebra We shall turn our attention to solving linear systems of equations Ax = b where A R m n, x R n, and b R m. We already saw examples of methods that required the solution of a linear system

More information