Linear Algebra Methods for Data Mining
|
|
|
- Audrey Perry
- 9 years ago
- Views:
Transcription
1 Linear Algebra Methods for Data Mining Saara Hyvönen, Spring 2007 Text mining & Information Retrieval Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki
2 What is text mining? Methods for extracting useful information from large (and often unstructured) collections of texts. Information retrieval. E.g. search in data base of abstracts of scientific papers: Given a query with a set of keywords, find all documents that are relevant. Lots of stuff available (SAS Text Miner, Statistics Text-mining toolbox...) Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 1
3 Collections of texts Several test collections, e.g. Medline (1033 documents) Somewhat larger text collections, e.g. Reuters (21578 documents) Real data sets might be much larger: Number of terms: 10 4, number of documents: Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 2
4 Example Search in helka.linneanet.fi: Search phrase computer science engineering computing science engineering computing in science Result Nothing found Nothing found Computing in Science and Engineering Straightforward word matching is not good enough! Latent Semantic Indexing (LSI): More than literal matching - conceptualbased modelling. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 3
5 Term-document matrix: Vector space IR model Doc1 Doc2 Doc3 Doc4 Query Term Term Term The documents and the query are represented by a vector in R n (here n = 3). In applications n is large!!! Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 4
6 From term-document matrix, Find document vector(s) close to query. What is close? Use some distance measure in R n! Use SVD for data compression and retrieval enhancement. Find topics or concepts. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 5
7 Latent Semantic Indexing (LSI) 1. Document file preparation/ preprocessing: Indexing: collecting terms Use stop list: eliminate meaningless words Stemming 2. Construction term-by-document matrix, sparse matrix storage. 3. Query matching: distance measures. 4. Data compression by low rank approximation: SVD 5. Ranking and relevance feedback. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 6
8 Stop List Eliminate words that occur in all documents : We consider the computation of an eigenvalue and corresponding eigenvector of a Hermitian positive definite matrix assuming that good approximations of the wanted eigenpair are already available, as may be the case in applications such as structural mechanics. We analyze efficient implementations of inexact Rayleigh quotient-type methods, which involve the approximate solution of a linear system at each iteration by means of the Conjugate Residuals method. ftp://ftp.cs.cornell.edu/pub/smart/english.stop Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 7
9 Beginning of the cornell list: a a s able about above according accordingly across actually after afterwards again against ain t all allow allows almost alone along already also although always am among amongst an and another any anybody anyhow anyone anything anyway anyways anywhere apart appear appreciate appropriate are aren t around as aside ask asking associated at available away awfully b be became because become becomes becoming been before beforehand behind being believe below beside besides best better between beyond both brief but by c c mon c s came can... Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 8
10 Stemming computable computation computing computational comput Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 9
11 Example changes of the nucleid acid and phospholipid levels of the livers in the course of fetal and postnatal developement. we have followed the evolution of dna, rna and pl in the livers of rat foeti removed between the fifteenth and the twenty-first day of gestation and of rats newly-born or at weaning. we can observe the following facts. Stemmed (using Porter stemmer): chang of the nucleid acid and phospholipid level of the liver in the cours of fetal and postnat develop. we have follow the evolut of dna, rna and pl in the liver of rat foeti remov between the fifteenth and the twenti-first dai of gestat and of rats newli-born or at wean. we can observ the follow fact. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 10
12 Preprocessed text (stemmed and with stop words removed): nucleid acid phospholipid level liver fetal postnat develop evolut dna rna pl liver rat foeti remov fifteenth twenti dai gestat rat newli born wean observ fact Note: Sometimes also words that appear only once in the whole data set are removed. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 11
13 Queries Also queries are preprocessed natural language queries: I want to know how to compute singular values of data matrices, especially such that are large and sparse becomes comput singular value data matri large sparse Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 12
14 Inverted file structures Document file: Each document has a number and all terms are identified Dictionary: Sorted list of all unique terms Inversion List: Pointers from a term to the documents that contain that term (column index for non-zeros in a row of the matrix). Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 13
15 Terms T1 Baby T2 Child T3 Guide T4 Health T5 Home T6 Infant T7 Proofing T8 Safety T9 Toddler Example Documents: D1: Infant & Toddler First Aid D2: Babies and Children s Room (for your Home) D3: Child Safety at Home D4: Your Baby s Health and Safety: From Infant to Toddler D5: Baby Proofing Basics D6: Your Guide to Easy Rust Proofing D7: Beanie Babies Collector s Guide Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 14
16 Term-by-Document Matrix  = The element Âij is the weighted frequency of term i in document j. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 15
17 Normalization A = Each column has Euclidean length 1. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 16
18 Simple Query Matching Given a query vector q, find the columns a j of A which have dist(q, a j ) tol. Common distance definitions: dist(x, y) = x y 2, (Euclidean distance) 1 cos(θ(x, y)) = 1 xt y x 2 y 2 arccos(θ(x, y)) Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 17
19 Query: child home safety ˆq = , q = Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 18
20 Cosine similarity: cos(θ(x, y)) = xt y x 2 y 2 tol. (Note: x = y = 1.) In the example, cosine similarity with each column of A: q T A = ( ) With a cosine threshold of tol = 0.5 documents 2 and 3 would be returned: Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 19
21 Terms T1 Baby T2 Child T3 Guide T4 Health T5 Home T6 Infant T7 Proofing T8 Safety T9 Toddler Documents: D1: Infant & Toddler First Aid D2: Babies and Children s Room (for your Home) D3: Child Safety at Home D4: Your Baby s Health and Safety: From Infant to Toddler D5: Baby Proofing Basics D6: Your Guide to Easy Rust Proofing D7: Beanie Babies Collector s Guide Query: child home safety. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 20
22 Weighting schemes Don t just count occurences of terms in documents but apply a term weighting scheme. elements of term-by-document matrix A weighted depending on characteristics of document collection. A common such scheme: TFIDF: TF = Term Frequency, IDF = Inverse Document Frequency Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 21
23 TFIDF TF = Term Frequency, IDF = Inverse Document Frequency Define A ij = f ij log(n/n i ), where f ij is frequency of term i in document j, n is the number of documents in the collection, and n i is the number of documents in which the term i appears. here TF=f ij, IDF=log(n/n i ). Other choices possible. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 22
24 Sparse matrix storage A = Compressed row storage: value col-ind row-ptr Compressed column storage: analogous. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 23
25 Performance evaluation Return all documents for which the cosine measure x T y x 2 y 2 tol. Low tolerance lots of documents returned. More relevant documents, but also more irrelevant ones. High tolerance less irrelevant documents, but relevant ones may be missed. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 24
26 Example Back to our example case: Recall that cosine similarity with each column of A was q T A = ( ). tol= 0.8: one document. tol=0.5: two documents. tol=0.2: three documents. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 25
27 Example: MEDLINE Homework data set documents, around 5000 terms, 30 querys. Query 21: language developement in infancy and pre-school age. tol=0.4: two documents. tol=0.3: 6 documents. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 26
28 Precision and Recall Precision: Recall: P = D r D t R = D r N r D r is the number of relevant documents retrieved, D t is the total number of documents retrieved, N r is the total number of relevant documents in the database. What we want: High precision, high recall. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 27
29 Return all documents for which the cosine measure x T y x 2 y 2 tol. High tolerance: high precision, low recall. low tolerance: low precision, high recall. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 28
30 1 Precision vs recall precision recall Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 29
31 A R m n, m n. A = U The Singular Value Decomposition ( ) Σ V T, U R m m, Σ R n n, V R n n. 0 U and V are orthogonal, and Σ is diagonal: Σ = diag(σ 1, σ 2,..., σ n ), σ 1 σ 2... σ r σ r+1 =... = σ n = 0. Rank(A)= r. (Analogous for m < n. We assume m n for simplicity.) Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 30
32 The approximation problem has the solution min rank(b) k A B 2 B = A k = U k Σ k V T k, where U k, V k and Σ k are the matrices with the k first singular vectors and values of A. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 31
33 Latent Semantic Indexing Assumption: there is some underlying latent semantic structure in the data. E.g. car and automobile occur in similar documents, as do cows and sheep. This structure can be enhanced by projecting the data (the term-bydocument-matrix and the queries) onto a lower dimensional space using SVD. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 32
34 Normalize columns to have length =1 (otherwise long documents dominate first singular values and vectors) Compute the SVD of the term-by-document matrix: A = UΣV T, and approximate A by A U k (Σ k V T k ) = U k D k. U k : orthogonal basis, that we use to approximate all the documents D k : Column j hold the coordinates of document j in the new basis. D k is the projection of A onto the subspace spanned by U k. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 33
35 Data compression A A k U k (Σ k V T k ) = U k D k. D k holds the coordinates of all documents in terms of the k first (left) singular vectors u 1,..., u k. Take a look at our child home safety -example: D 2 = ( ) Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 34
36 D k holds the coordinates of all documents in terms of the k first singular vectors, k = 2. Documents plotted in this basis: 0.6 doc 4 doc 3 doc doc !0.2!0.4 doc 5 doc 7!0.6 doc 6!0.8!0.75!0.7!0.65!0.6!0.55!0.5!0.45!0.4!0.35!0.3!0.25 Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 35
37 A A k = (U k Σ k )V T k = T k V T k T k holds the coordinates of all terms in terms of the k first (right) singular vectors v 1,..., v k : T 2 = ( ) Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 36
38 term 8 term 2 term 5 term 9 term term 4 0 term 1!0.2!0.4!0.6 term 3 term 7!0.8!1.4!1.2!1!0.8!0.6!0.4!0.2 0 Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 37
39 Error in matrix approximation Relative error: E = A A k 2 A 2 Example: k error (Note: often the Frobenius norm is used instead of the 2-norm!) Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 38
40 Query matching after data compression Represent term-document matrix by A k = U k D k. Compute Project the query: Cosines: q T A k = q T U k D k = (U T k q) T D k. q k = U T k q. cos θ j = qt k (D ke j ) q k 2 D k e j 2 Query matching performed in k-dimensional space! Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 39
41 Example, continued Cosines for query and original data: q T A = ( ) With a cosine threshold of tol = 0.5 documents 2 and 3 returned. Cosines after projection to two-dimensional space: ( ) Now documents 1, 2, 3 and 4 are considered relevant! Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 40
42 Terms T1 Baby T2 Child T3 Guide T4 Health T5 Home T6 Infant T7 Proofing T8 Safety T9 Toddler Documents: D1: Infant & Toddler First Aid D2: Babies and Children s Room (for your Home) D3: Child Safety at Home D4: Your Baby s Health and Safety: From Infant to Toddler D5: Baby Proofing Basics D6: Your Guide to Easy Rust Proofing D7: Beanie Babies Collector s Guide Query: child home safety. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 41
43 LSI LSI (SVD compression) enhances retrieval quality! Even if the approximation error is still quite high! LSI is able to deal with synonyms (toddler, child, infant)......and with polysemy (same word has different meanings). True also for larger document collections......and for surprisingly low rank and high matrix approximation error. But: not for all queries - for most, and average performance counts. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 42
44 How to choose k? Typical term-document matrix has no clear gaps n singular values. Rank must be chosen by retrieval experiments. Improved retrieval event when matrix approximation error is high 2-norm may not be the right way to measure information content in the term-document matrix. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 43
45 Updating A new document d arrives: compute its coordinates in terms of U k : d k = U T k d. Note: we no longer have a SVD of (A d). expensive! Recomputing the SVD is Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 44
46 References [1] Lars Eldén: Numerical Linear Algebra and Data Mining Applications, Draft. [2] D. Hand, H. Mannila, P. Smyth, Principles of Data Mining, The MIT Press, [3] Soumen Chakrabarti: Mining the Web, Morgan Kaufmann Publishers, [4] [5] M. W. Berry and M. Browne, Understanding Search Engines, Mathematical Modeling and Text Retrieval, SIAM, Philadelphia, Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 45
Linear Algebra Review. Vectors
Linear Algebra Review By Tim K. Marks UCSD Borrows heavily from: Jana Kosecka [email protected] http://cs.gmu.edu/~kosecka/cs682.html Virginia de Sa Cogsci 8F Linear Algebra review UCSD Vectors The length
Latent Semantic Indexing with Selective Query Expansion Abstract Introduction
Latent Semantic Indexing with Selective Query Expansion Andy Garron April Kontostathis Department of Mathematics and Computer Science Ursinus College Collegeville PA 19426 Abstract This article describes
Lecture 5: Singular Value Decomposition SVD (1)
EEM3L1: Numerical and Analytical Techniques Lecture 5: Singular Value Decomposition SVD (1) EE3L1, slide 1, Version 4: 25-Sep-02 Motivation for SVD (1) SVD = Singular Value Decomposition Consider the system
W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015
W. Heath Rushing Adsurgo LLC Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare Session H-1 JTCC: October 23, 2015 Outline Demonstration: Recent article on cnn.com Introduction
Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC
Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC 1. Introduction A popular rule of thumb suggests that
TF-IDF. David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture6-tfidf.ppt
TF-IDF David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture6-tfidf.ppt Administrative Homework 3 available soon Assignment 2 available soon Popular media article
Chapter 6. Orthogonality
6.3 Orthogonal Matrices 1 Chapter 6. Orthogonality 6.3 Orthogonal Matrices Definition 6.4. An n n matrix A is orthogonal if A T A = I. Note. We will see that the columns of an orthogonal matrix must be
Numerical Methods I Eigenvalue Problems
Numerical Methods I Eigenvalue Problems Aleksandar Donev Courant Institute, NYU 1 [email protected] 1 Course G63.2010.001 / G22.2420-001, Fall 2010 September 30th, 2010 A. Donev (Courant Institute)
13 MATH FACTS 101. 2 a = 1. 7. The elements of a vector have a graphical interpretation, which is particularly easy to see in two or three dimensions.
3 MATH FACTS 0 3 MATH FACTS 3. Vectors 3.. Definition We use the overhead arrow to denote a column vector, i.e., a linear segment with a direction. For example, in three-space, we write a vector in terms
1 2 3 1 1 2 x = + x 2 + x 4 1 0 1
(d) If the vector b is the sum of the four columns of A, write down the complete solution to Ax = b. 1 2 3 1 1 2 x = + x 2 + x 4 1 0 0 1 0 1 2. (11 points) This problem finds the curve y = C + D 2 t which
Review Jeopardy. Blue vs. Orange. Review Jeopardy
Review Jeopardy Blue vs. Orange Review Jeopardy Jeopardy Round Lectures 0-3 Jeopardy Round $200 How could I measure how far apart (i.e. how different) two observations, y 1 and y 2, are from each other?
Text Analytics (Text Mining)
CSE 6242 / CX 4242 Apr 3, 2014 Text Analytics (Text Mining) LSI (uses SVD), Visualization Duen Horng (Polo) Chau Georgia Tech Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey
Mining Text Data: An Introduction
Bölüm 10. Metin ve WEB Madenciliği http://ceng.gazi.edu.tr/~ozdemir Mining Text Data: An Introduction Data Mining / Knowledge Discovery Structured Data Multimedia Free Text Hypertext HomeLoan ( Frank Rizzo
Similarity and Diagonalization. Similar Matrices
MATH022 Linear Algebra Brief lecture notes 48 Similarity and Diagonalization Similar Matrices Let A and B be n n matrices. We say that A is similar to B if there is an invertible n n matrix P such that
α = u v. In other words, Orthogonal Projection
Orthogonal Projection Given any nonzero vector v, it is possible to decompose an arbitrary vector u into a component that points in the direction of v and one that points in a direction orthogonal to v
by the matrix A results in a vector which is a reflection of the given
Eigenvalues & Eigenvectors Example Suppose Then So, geometrically, multiplying a vector in by the matrix A results in a vector which is a reflection of the given vector about the y-axis We observe that
CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #18: Dimensionality Reduc7on
CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #18: Dimensionality Reduc7on Dimensionality Reduc=on Assump=on: Data lies on or near a low d- dimensional subspace Axes of this subspace
SALEM COMMUNITY COLLEGE Carneys Point, New Jersey 08069 COURSE SYLLABUS COVER SHEET. Action Taken (Please Check One) New Course Initiated
SALEM COMMUNITY COLLEGE Carneys Point, New Jersey 08069 COURSE SYLLABUS COVER SHEET Course Title Course Number Department Linear Algebra Mathematics MAT-240 Action Taken (Please Check One) New Course Initiated
SERG. Reconstructing Requirements Traceability in Design and Test Using Latent Semantic Indexing
Delft University of Technology Software Engineering Research Group Technical Report Series Reconstructing Requirements Traceability in Design and Test Using Latent Semantic Indexing Marco Lormans and Arie
Mehtap Ergüven Abstract of Ph.D. Dissertation for the degree of PhD of Engineering in Informatics
INTERNATIONAL BLACK SEA UNIVERSITY COMPUTER TECHNOLOGIES AND ENGINEERING FACULTY ELABORATION OF AN ALGORITHM OF DETECTING TESTS DIMENSIONALITY Mehtap Ergüven Abstract of Ph.D. Dissertation for the degree
Applied Linear Algebra I Review page 1
Applied Linear Algebra Review 1 I. Determinants A. Definition of a determinant 1. Using sum a. Permutations i. Sign of a permutation ii. Cycle 2. Uniqueness of the determinant function in terms of properties
LINEAR ALGEBRA W W L CHEN
LINEAR ALGEBRA W W L CHEN c W W L Chen, 1997, 2008 This chapter is available free to all individuals, on understanding that it is not to be used for financial gain, and may be downloaded and/or photocopied,
CS3220 Lecture Notes: QR factorization and orthogonal transformations
CS3220 Lecture Notes: QR factorization and orthogonal transformations Steve Marschner Cornell University 11 March 2009 In this lecture I ll talk about orthogonal matrices and their properties, discuss
5. Orthogonal matrices
L Vandenberghe EE133A (Spring 2016) 5 Orthogonal matrices matrices with orthonormal columns orthogonal matrices tall matrices with orthonormal columns complex matrices with orthonormal columns 5-1 Orthonormal
DATA ANALYSIS II. Matrix Algorithms
DATA ANALYSIS II Matrix Algorithms Similarity Matrix Given a dataset D = {x i }, i=1,..,n consisting of n points in R d, let A denote the n n symmetric similarity matrix between the points, given as where
Recall the basic property of the transpose (for any A): v A t Aw = v w, v, w R n.
ORTHOGONAL MATRICES Informally, an orthogonal n n matrix is the n-dimensional analogue of the rotation matrices R θ in R 2. When does a linear transformation of R 3 (or R n ) deserve to be called a rotation?
is in plane V. However, it may be more convenient to introduce a plane coordinate system in V.
.4 COORDINATES EXAMPLE Let V be the plane in R with equation x +2x 2 +x 0, a two-dimensional subspace of R. We can describe a vector in this plane by its spatial (D)coordinates; for example, vector x 5
Least-Squares Intersection of Lines
Least-Squares Intersection of Lines Johannes Traa - UIUC 2013 This write-up derives the least-squares solution for the intersection of lines. In the general case, a set of lines will not intersect at a
Eng. Mohammed Abdualal
Islamic University of Gaza Faculty of Engineering Computer Engineering Department Information Storage and Retrieval (ECOM 5124) IR HW 5+6 Scoring, term weighting and the vector space model Exercise 6.2
Spam Filtering Based on Latent Semantic Indexing
Spam Filtering Based on Latent Semantic Indexing Wilfried N. Gansterer Andreas G. K. Janecek Robert Neumayer Abstract In this paper, a study on the classification performance of a vector space model (VSM)
Manifold Learning Examples PCA, LLE and ISOMAP
Manifold Learning Examples PCA, LLE and ISOMAP Dan Ventura October 14, 28 Abstract We try to give a helpful concrete example that demonstrates how to use PCA, LLE and Isomap, attempts to provide some intuition
Notes on Orthogonal and Symmetric Matrices MENU, Winter 2013
Notes on Orthogonal and Symmetric Matrices MENU, Winter 201 These notes summarize the main properties and uses of orthogonal and symmetric matrices. We covered quite a bit of material regarding these topics,
Math 215 HW #6 Solutions
Math 5 HW #6 Solutions Problem 34 Show that x y is orthogonal to x + y if and only if x = y Proof First, suppose x y is orthogonal to x + y Then since x, y = y, x In other words, = x y, x + y = (x y) T
Inner product. Definition of inner product
Math 20F Linear Algebra Lecture 25 1 Inner product Review: Definition of inner product. Slide 1 Norm and distance. Orthogonal vectors. Orthogonal complement. Orthogonal basis. Definition of inner product
dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING
dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING ABSTRACT In most CRM (Customer Relationship Management) systems, information on
Multidimensional data analysis
Multidimensional data analysis Ella Bingham Dept of Computer Science, University of Helsinki [email protected] June 2008 The Finnish Graduate School in Astronomy and Space Physics Summer School
1 Introduction to Matrices
1 Introduction to Matrices In this section, important definitions and results from matrix algebra that are useful in regression analysis are introduced. While all statements below regarding the columns
Medical Information-Retrieval Systems. Dong Peng Medical Informatics Group
Medical Information-Retrieval Systems Dong Peng Medical Informatics Group Outline Evolution of medical Information-Retrieval (IR). The information retrieval process. The trend of medical information retrieval
Bindel, Spring 2012 Intro to Scientific Computing (CS 3220) Week 3: Wednesday, Feb 8
Spaces and bases Week 3: Wednesday, Feb 8 I have two favorite vector spaces 1 : R n and the space P d of polynomials of degree at most d. For R n, we have a canonical basis: R n = span{e 1, e 2,..., e
Principal components analysis
CS229 Lecture notes Andrew Ng Part XI Principal components analysis In our discussion of factor analysis, we gave a way to model data x R n as approximately lying in some k-dimension subspace, where k
October 3rd, 2012. Linear Algebra & Properties of the Covariance Matrix
Linear Algebra & Properties of the Covariance Matrix October 3rd, 2012 Estimation of r and C Let rn 1, rn, t..., rn T be the historical return rates on the n th asset. rn 1 rṇ 2 r n =. r T n n = 1, 2,...,
Linear Algebraic Equations, SVD, and the Pseudo-Inverse
Linear Algebraic Equations, SVD, and the Pseudo-Inverse Philip N. Sabes October, 21 1 A Little Background 1.1 Singular values and matrix inversion For non-smmetric matrices, the eigenvalues and singular
Big Text Data Clustering using Class Labels and Semantic Feature Based on Hadoop of Cloud Computing
Vol.8, No.4 (214), pp.1-1 http://dx.doi.org/1.14257/ijseia.214.8.4.1 Big Text Data Clustering using Class Labels and Semantic Feature Based on Hadoop of Cloud Computing Yong-Il Kim 1, Yoo-Kang Ji 2 and
Inner Product Spaces and Orthogonality
Inner Product Spaces and Orthogonality week 3-4 Fall 2006 Dot product of R n The inner product or dot product of R n is a function, defined by u, v a b + a 2 b 2 + + a n b n for u a, a 2,, a n T, v b,
Homework 2. Page 154: Exercise 8.10. Page 145: Exercise 8.3 Page 150: Exercise 8.9
Homework 2 Page 110: Exercise 6.10; Exercise 6.12 Page 116: Exercise 6.15; Exercise 6.17 Page 121: Exercise 6.19 Page 122: Exercise 6.20; Exercise 6.23; Exercise 6.24 Page 131: Exercise 7.3; Exercise 7.5;
Introduction to Matrix Algebra
Psychology 7291: Multivariate Statistics (Carey) 8/27/98 Matrix Algebra - 1 Introduction to Matrix Algebra Definitions: A matrix is a collection of numbers ordered by rows and columns. It is customary
Applied Linear Algebra
Applied Linear Algebra OTTO BRETSCHER http://www.prenhall.com/bretscher Chapter 7 Eigenvalues and Eigenvectors Chia-Hui Chang Email: [email protected] National Central University, Taiwan 7.1 DYNAMICAL
Orthogonal Diagonalization of Symmetric Matrices
MATH10212 Linear Algebra Brief lecture notes 57 Gram Schmidt Process enables us to find an orthogonal basis of a subspace. Let u 1,..., u k be a basis of a subspace V of R n. We begin the process of finding
Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016
Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with
Numerical Analysis Lecture Notes
Numerical Analysis Lecture Notes Peter J. Olver 6. Eigenvalues and Singular Values In this section, we collect together the basic facts about eigenvalues and eigenvectors. From a geometrical viewpoint,
An Information Retrieval using weighted Index Terms in Natural Language document collections
Internet and Information Technology in Modern Organizations: Challenges & Answers 635 An Information Retrieval using weighted Index Terms in Natural Language document collections Ahmed A. A. Radwan, Minia
Multimedia Databases. Wolf-Tilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.
Multimedia Databases Wolf-Tilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de 14 Previous Lecture 13 Indexes for Multimedia Data 13.1
Indexing by Latent Semantic Analysis. Scott Deerwester Graduate Library School University of Chicago Chicago, IL 60637
Indexing by Latent Semantic Analysis Scott Deerwester Graduate Library School University of Chicago Chicago, IL 60637 Susan T. Dumais George W. Furnas Thomas K. Landauer Bell Communications Research 435
Linear Algebra Methods for Data Mining
Linear Algebra Methods for Data Mining Saara Hyvönen, [email protected] Spring 2007 Lecture 3: QR, least squares, linear regression Linear Algebra Methods for Data Mining, Spring 2007, University
A Survey of Document Clustering Techniques & Comparison of LDA and movmf
A Survey of Document Clustering Techniques & Comparison of LDA and movmf Yu Xiao December 10, 2010 Abstract This course project is mainly an overview of some widely used document clustering techniques.
MATH 551 - APPLIED MATRIX THEORY
MATH 55 - APPLIED MATRIX THEORY FINAL TEST: SAMPLE with SOLUTIONS (25 points NAME: PROBLEM (3 points A web of 5 pages is described by a directed graph whose matrix is given by A Do the following ( points
Similar matrices and Jordan form
Similar matrices and Jordan form We ve nearly covered the entire heart of linear algebra once we ve finished singular value decompositions we ll have seen all the most central topics. A T A is positive
Torgerson s Classical MDS derivation: 1: Determining Coordinates from Euclidean Distances
Torgerson s Classical MDS derivation: 1: Determining Coordinates from Euclidean Distances It is possible to construct a matrix X of Cartesian coordinates of points in Euclidean space when we know the Euclidean
15.062 Data Mining: Algorithms and Applications Matrix Math Review
.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop
Introduction to Information Retrieval http://informationretrieval.org
Introduction to Information Retrieval http://informationretrieval.org IIR 6&7: Vector Space Model Hinrich Schütze Institute for Natural Language Processing, University of Stuttgart 2011-08-29 Schütze:
Examination paper for TMA4205 Numerical Linear Algebra
Department of Mathematical Sciences Examination paper for TMA4205 Numerical Linear Algebra Academic contact during examination: Markus Grasmair Phone: 97580435 Examination date: December 16, 2015 Examination
AN SQL EXTENSION FOR LATENT SEMANTIC ANALYSIS
Advances in Information Mining ISSN: 0975 3265 & E-ISSN: 0975 9093, Vol. 3, Issue 1, 2011, pp-19-25 Available online at http://www.bioinfo.in/contents.php?id=32 AN SQL EXTENSION FOR LATENT SEMANTIC ANALYSIS
Chapter 17. Orthogonal Matrices and Symmetries of Space
Chapter 17. Orthogonal Matrices and Symmetries of Space Take a random matrix, say 1 3 A = 4 5 6, 7 8 9 and compare the lengths of e 1 and Ae 1. The vector e 1 has length 1, while Ae 1 = (1, 4, 7) has length
Au = = = 3u. Aw = = = 2w. so the action of A on u and w is very easy to picture: it simply amounts to a stretching by 3 and 2, respectively.
Chapter 7 Eigenvalues and Eigenvectors In this last chapter of our exploration of Linear Algebra we will revisit eigenvalues and eigenvectors of matrices, concepts that were already introduced in Geometry
Orthogonal Projections
Orthogonal Projections and Reflections (with exercises) by D. Klain Version.. Corrections and comments are welcome! Orthogonal Projections Let X,..., X k be a family of linearly independent (column) vectors
Inner Product Spaces
Math 571 Inner Product Spaces 1. Preliminaries An inner product space is a vector space V along with a function, called an inner product which associates each pair of vectors u, v with a scalar u, v, and
Soft Clustering with Projections: PCA, ICA, and Laplacian
1 Soft Clustering with Projections: PCA, ICA, and Laplacian David Gleich and Leonid Zhukov Abstract In this paper we present a comparison of three projection methods that use the eigenvectors of a matrix
Comparison of Standard and Zipf-Based Document Retrieval Heuristics
Comparison of Standard and Zipf-Based Document Retrieval Heuristics Benjamin Hoffmann Universität Stuttgart, Institut für Formale Methoden der Informatik Universitätsstr. 38, D-70569 Stuttgart, Germany
[1] Diagonal factorization
8.03 LA.6: Diagonalization and Orthogonal Matrices [ Diagonal factorization [2 Solving systems of first order differential equations [3 Symmetric and Orthonormal Matrices [ Diagonal factorization Recall:
Search Engines. Stephen Shaw <[email protected]> 18th of February, 2014. Netsoc
Search Engines Stephen Shaw Netsoc 18th of February, 2014 Me M.Sc. Artificial Intelligence, University of Edinburgh Would recommend B.A. (Mod.) Computer Science, Linguistics, French,
Lecture 1: Schur s Unitary Triangularization Theorem
Lecture 1: Schur s Unitary Triangularization Theorem This lecture introduces the notion of unitary equivalence and presents Schur s theorem and some of its consequences It roughly corresponds to Sections
SYMMETRIC EIGENFACES MILI I. SHAH
SYMMETRIC EIGENFACES MILI I. SHAH Abstract. Over the years, mathematicians and computer scientists have produced an extensive body of work in the area of facial analysis. Several facial analysis algorithms
Section 6.1 - Inner Products and Norms
Section 6.1 - Inner Products and Norms Definition. Let V be a vector space over F {R, C}. An inner product on V is a function that assigns, to every ordered pair of vectors x and y in V, a scalar in F,
CONTROLLABILITY. Chapter 2. 2.1 Reachable Set and Controllability. Suppose we have a linear system described by the state equation
Chapter 2 CONTROLLABILITY 2 Reachable Set and Controllability Suppose we have a linear system described by the state equation ẋ Ax + Bu (2) x() x Consider the following problem For a given vector x in
SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL
SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL Krishna Kiran Kattamuri 1 and Rupa Chiramdasu 2 Department of Computer Science Engineering, VVIT, Guntur, India
Recall that two vectors in are perpendicular or orthogonal provided that their dot
Orthogonal Complements and Projections Recall that two vectors in are perpendicular or orthogonal provided that their dot product vanishes That is, if and only if Example 1 The vectors in are orthogonal
Methods for Finding Bases
Methods for Finding Bases Bases for the subspaces of a matrix Row-reduction methods can be used to find bases. Let us now look at an example illustrating how to obtain bases for the row space, null space,
x1 x 2 x 3 y 1 y 2 y 3 x 1 y 2 x 2 y 1 0.
Cross product 1 Chapter 7 Cross product We are getting ready to study integration in several variables. Until now we have been doing only differential calculus. One outcome of this study will be our ability
4: EIGENVALUES, EIGENVECTORS, DIAGONALIZATION
4: EIGENVALUES, EIGENVECTORS, DIAGONALIZATION STEVEN HEILMAN Contents 1. Review 1 2. Diagonal Matrices 1 3. Eigenvectors and Eigenvalues 2 4. Characteristic Polynomial 4 5. Diagonalizability 6 6. Appendix:
Matrix Calculations: Applications of Eigenvalues and Eigenvectors; Inner Products
Matrix Calculations: Applications of Eigenvalues and Eigenvectors; Inner Products H. Geuvers Institute for Computing and Information Sciences Intelligent Systems Version: spring 2015 H. Geuvers Version:
Linear algebra and the geometry of quadratic equations. Similarity transformations and orthogonal matrices
MATH 30 Differential Equations Spring 006 Linear algebra and the geometry of quadratic equations Similarity transformations and orthogonal matrices First, some things to recall from linear algebra Two
MATH 423 Linear Algebra II Lecture 38: Generalized eigenvectors. Jordan canonical form (continued).
MATH 423 Linear Algebra II Lecture 38: Generalized eigenvectors Jordan canonical form (continued) Jordan canonical form A Jordan block is a square matrix of the form λ 1 0 0 0 0 λ 1 0 0 0 0 λ 0 0 J = 0
7 Gaussian Elimination and LU Factorization
7 Gaussian Elimination and LU Factorization In this final section on matrix factorization methods for solving Ax = b we want to take a closer look at Gaussian elimination (probably the best known method
Vector and Matrix Norms
Chapter 1 Vector and Matrix Norms 11 Vector Spaces Let F be a field (such as the real numbers, R, or complex numbers, C) with elements called scalars A Vector Space, V, over the field F is a non-empty
17. Inner product spaces Definition 17.1. Let V be a real vector space. An inner product on V is a function
17. Inner product spaces Definition 17.1. Let V be a real vector space. An inner product on V is a function, : V V R, which is symmetric, that is u, v = v, u. bilinear, that is linear (in both factors):
The Singular Value Decomposition in Symmetric (Löwdin) Orthogonalization and Data Compression
The Singular Value Decomposition in Symmetric (Löwdin) Orthogonalization and Data Compression The SVD is the most generally applicable of the orthogonal-diagonal-orthogonal type matrix decompositions Every
1 o Semestre 2007/2008
Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008 Outline 1 2 3 4 5 Outline 1 2 3 4 5 Exploiting Text How is text exploited? Two main directions Extraction Extraction
APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder
APPM4720/5720: Fast algorithms for big data Gunnar Martinsson The University of Colorado at Boulder Course objectives: The purpose of this course is to teach efficient algorithms for processing very large
December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B. KITCHENS
December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B KITCHENS The equation 1 Lines in two-dimensional space (1) 2x y = 3 describes a line in two-dimensional space The coefficients of x and y in the equation
SOLVING LINEAR SYSTEMS
SOLVING LINEAR SYSTEMS Linear systems Ax = b occur widely in applied mathematics They occur as direct formulations of real world problems; but more often, they occur as a part of the numerical analysis
Lecture Notes 2: Matrices as Systems of Linear Equations
2: Matrices as Systems of Linear Equations 33A Linear Algebra, Puck Rombach Last updated: April 13, 2016 Systems of Linear Equations Systems of linear equations can represent many things You have probably
Factorization Theorems
Chapter 7 Factorization Theorems This chapter highlights a few of the many factorization theorems for matrices While some factorization results are relatively direct, others are iterative While some factorization
Solution of Linear Systems
Chapter 3 Solution of Linear Systems In this chapter we study algorithms for possibly the most commonly occurring problem in scientific computing, the solution of linear systems of equations. We start
ISOMETRIES OF R n KEITH CONRAD
ISOMETRIES OF R n KEITH CONRAD 1. Introduction An isometry of R n is a function h: R n R n that preserves the distance between vectors: h(v) h(w) = v w for all v and w in R n, where (x 1,..., x n ) = x
P164 Tomographic Velocity Model Building Using Iterative Eigendecomposition
P164 Tomographic Velocity Model Building Using Iterative Eigendecomposition K. Osypov* (WesternGeco), D. Nichols (WesternGeco), M. Woodward (WesternGeco) & C.E. Yarman (WesternGeco) SUMMARY Tomographic
