Linear Algebra Methods for Data Mining



Similar documents
Linear Algebra Review. Vectors

Latent Semantic Indexing with Selective Query Expansion Abstract Introduction

Lecture 5: Singular Value Decomposition SVD (1)

W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015

Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC

TF-IDF. David Kauchak cs160 Fall 2009 adapted from:

Chapter 6. Orthogonality

Numerical Methods I Eigenvalue Problems

13 MATH FACTS a = The elements of a vector have a graphical interpretation, which is particularly easy to see in two or three dimensions.

x = + x 2 + x

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Text Analytics (Text Mining)

Mining Text Data: An Introduction

Similarity and Diagonalization. Similar Matrices

α = u v. In other words, Orthogonal Projection

by the matrix A results in a vector which is a reflection of the given

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #18: Dimensionality Reduc7on

SALEM COMMUNITY COLLEGE Carneys Point, New Jersey COURSE SYLLABUS COVER SHEET. Action Taken (Please Check One) New Course Initiated

SERG. Reconstructing Requirements Traceability in Design and Test Using Latent Semantic Indexing

Mehtap Ergüven Abstract of Ph.D. Dissertation for the degree of PhD of Engineering in Informatics

Applied Linear Algebra I Review page 1

LINEAR ALGEBRA W W L CHEN

CS3220 Lecture Notes: QR factorization and orthogonal transformations

5. Orthogonal matrices

DATA ANALYSIS II. Matrix Algorithms

Recall the basic property of the transpose (for any A): v A t Aw = v w, v, w R n.

is in plane V. However, it may be more convenient to introduce a plane coordinate system in V.

Least-Squares Intersection of Lines

Eng. Mohammed Abdualal

Spam Filtering Based on Latent Semantic Indexing

Manifold Learning Examples PCA, LLE and ISOMAP

Notes on Orthogonal and Symmetric Matrices MENU, Winter 2013

Math 215 HW #6 Solutions

Inner product. Definition of inner product

dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING

Multidimensional data analysis

1 Introduction to Matrices

Medical Information-Retrieval Systems. Dong Peng Medical Informatics Group

Bindel, Spring 2012 Intro to Scientific Computing (CS 3220) Week 3: Wednesday, Feb 8

Principal components analysis

October 3rd, Linear Algebra & Properties of the Covariance Matrix

Linear Algebraic Equations, SVD, and the Pseudo-Inverse

Big Text Data Clustering using Class Labels and Semantic Feature Based on Hadoop of Cloud Computing

Inner Product Spaces and Orthogonality

Homework 2. Page 154: Exercise Page 145: Exercise 8.3 Page 150: Exercise 8.9

Introduction to Matrix Algebra

Applied Linear Algebra

Orthogonal Diagonalization of Symmetric Matrices

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Numerical Analysis Lecture Notes

An Information Retrieval using weighted Index Terms in Natural Language document collections

Multimedia Databases. Wolf-Tilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig

Indexing by Latent Semantic Analysis. Scott Deerwester Graduate Library School University of Chicago Chicago, IL 60637

Linear Algebra Methods for Data Mining

A Survey of Document Clustering Techniques & Comparison of LDA and movmf

MATH APPLIED MATRIX THEORY

Similar matrices and Jordan form

Torgerson s Classical MDS derivation: 1: Determining Coordinates from Euclidean Distances

Data Mining: Algorithms and Applications Matrix Math Review

Introduction to Information Retrieval

Examination paper for TMA4205 Numerical Linear Algebra

AN SQL EXTENSION FOR LATENT SEMANTIC ANALYSIS

Chapter 17. Orthogonal Matrices and Symmetries of Space

Au = = = 3u. Aw = = = 2w. so the action of A on u and w is very easy to picture: it simply amounts to a stretching by 3 and 2, respectively.

Orthogonal Projections

Inner Product Spaces

Soft Clustering with Projections: PCA, ICA, and Laplacian

Comparison of Standard and Zipf-Based Document Retrieval Heuristics

[1] Diagonal factorization

Search Engines. Stephen Shaw 18th of February, Netsoc

Lecture 1: Schur s Unitary Triangularization Theorem

SYMMETRIC EIGENFACES MILI I. SHAH

Section Inner Products and Norms

CONTROLLABILITY. Chapter Reachable Set and Controllability. Suppose we have a linear system described by the state equation

SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL

Recall that two vectors in are perpendicular or orthogonal provided that their dot

Methods for Finding Bases

x1 x 2 x 3 y 1 y 2 y 3 x 1 y 2 x 2 y 1 0.

4: EIGENVALUES, EIGENVECTORS, DIAGONALIZATION

Matrix Calculations: Applications of Eigenvalues and Eigenvectors; Inner Products

Linear algebra and the geometry of quadratic equations. Similarity transformations and orthogonal matrices

MATH 423 Linear Algebra II Lecture 38: Generalized eigenvectors. Jordan canonical form (continued).

7 Gaussian Elimination and LU Factorization

Vector and Matrix Norms

17. Inner product spaces Definition Let V be a real vector space. An inner product on V is a function

The Singular Value Decomposition in Symmetric (Löwdin) Orthogonalization and Data Compression

1 o Semestre 2007/2008

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B. KITCHENS

SOLVING LINEAR SYSTEMS

Lecture Notes 2: Matrices as Systems of Linear Equations

Factorization Theorems

Solution of Linear Systems

ISOMETRIES OF R n KEITH CONRAD

P164 Tomographic Velocity Model Building Using Iterative Eigendecomposition

Transcription:

Linear Algebra Methods for Data Mining Saara Hyvönen, Saara.Hyvonen@cs.helsinki.fi Spring 2007 Text mining & Information Retrieval Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki

What is text mining? Methods for extracting useful information from large (and often unstructured) collections of texts. Information retrieval. E.g. search in data base of abstracts of scientific papers: Given a query with a set of keywords, find all documents that are relevant. Lots of stuff available (SAS Text Miner, Statistics Text-mining toolbox...) Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 1

Collections of texts Several test collections, e.g. Medline (1033 documents) http://www.dcs.gla.ac.uk/idom/ir_resources/test_collections/ Somewhat larger text collections, e.g. Reuters (21578 documents) http://kdd.ics.uci.edu/ Real data sets might be much larger: Number of terms: 10 4, number of documents: 10 6. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 2

Example Search in helka.linneanet.fi: Search phrase computer science engineering computing science engineering computing in science Result Nothing found Nothing found Computing in Science and Engineering Straightforward word matching is not good enough! Latent Semantic Indexing (LSI): More than literal matching - conceptualbased modelling. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 3

Term-document matrix: Vector space IR model Doc1 Doc2 Doc3 Doc4 Query Term1 1 0 1 0 1 Term2 0 0 1 1 1 Term3 0 1 1 0 0 The documents and the query are represented by a vector in R n (here n = 3). In applications n is large!!! Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 4

From term-document matrix, Find document vector(s) close to query. What is close? Use some distance measure in R n! Use SVD for data compression and retrieval enhancement. Find topics or concepts. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 5

Latent Semantic Indexing (LSI) 1. Document file preparation/ preprocessing: Indexing: collecting terms Use stop list: eliminate meaningless words Stemming 2. Construction term-by-document matrix, sparse matrix storage. 3. Query matching: distance measures. 4. Data compression by low rank approximation: SVD 5. Ranking and relevance feedback. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 6

Stop List Eliminate words that occur in all documents : We consider the computation of an eigenvalue and corresponding eigenvector of a Hermitian positive definite matrix assuming that good approximations of the wanted eigenpair are already available, as may be the case in applications such as structural mechanics. We analyze efficient implementations of inexact Rayleigh quotient-type methods, which involve the approximate solution of a linear system at each iteration by means of the Conjugate Residuals method. ftp://ftp.cs.cornell.edu/pub/smart/english.stop Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 7

Beginning of the cornell list: a a s able about above according accordingly across actually after afterwards again against ain t all allow allows almost alone along already also although always am among amongst an and another any anybody anyhow anyone anything anyway anyways anywhere apart appear appreciate appropriate are aren t around as aside ask asking associated at available away awfully b be became because become becomes becoming been before beforehand behind being believe below beside besides best better between beyond both brief but by c c mon c s came can... Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 8

Stemming computable computation computing computational comput http://gatekeeper.dec.com/pub/net/infosys/wais/ir-book-resources/stemmer http://www.tartarus.org/~martin/porterstemmer/ Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 9

Example changes of the nucleid acid and phospholipid levels of the livers in the course of fetal and postnatal developement. we have followed the evolution of dna, rna and pl in the livers of rat foeti removed between the fifteenth and the twenty-first day of gestation and of rats newly-born or at weaning. we can observe the following facts. Stemmed (using Porter stemmer): chang of the nucleid acid and phospholipid level of the liver in the cours of fetal and postnat develop. we have follow the evolut of dna, rna and pl in the liver of rat foeti remov between the fifteenth and the twenti-first dai of gestat and of rats newli-born or at wean. we can observ the follow fact. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 10

Preprocessed text (stemmed and with stop words removed): nucleid acid phospholipid level liver fetal postnat develop evolut dna rna pl liver rat foeti remov fifteenth twenti dai gestat rat newli born wean observ fact Note: Sometimes also words that appear only once in the whole data set are removed. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 11

Queries Also queries are preprocessed natural language queries: I want to know how to compute singular values of data matrices, especially such that are large and sparse becomes comput singular value data matri large sparse Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 12

Inverted file structures Document file: Each document has a number and all terms are identified Dictionary: Sorted list of all unique terms Inversion List: Pointers from a term to the documents that contain that term (column index for non-zeros in a row of the matrix). Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 13

Terms T1 Baby T2 Child T3 Guide T4 Health T5 Home T6 Infant T7 Proofing T8 Safety T9 Toddler Example Documents: D1: Infant & Toddler First Aid D2: Babies and Children s Room (for your Home) D3: Child Safety at Home D4: Your Baby s Health and Safety: From Infant to Toddler D5: Baby Proofing Basics D6: Your Guide to Easy Rust Proofing D7: Beanie Babies Collector s Guide Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 14

Term-by-Document Matrix  = 0 1 0 1 1 0 1 0 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 1 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 The element Âij is the weighted frequency of term i in document j. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 15

Normalization 0 0.577 0 0.447 0.707 0 0.707 0 0.577 0.577 0 0 0 0 0 0 0 0 0 0.707 0.707 0 0 0 0.447 0 0 0 A = 0 0.577 0.577 0 0 0 0 0.707 0 0 0.447 0 0 0 0 0 0 0 0.707 0.707 0 0 0 0.577 0.447 0 0 0 0.707 0 0 0.447 0 0 0 Each column has Euclidean length 1. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 16

Simple Query Matching Given a query vector q, find the columns a j of A which have dist(q, a j ) tol. Common distance definitions: dist(x, y) = x y 2, (Euclidean distance) 1 cos(θ(x, y)) = 1 xt y x 2 y 2 arccos(θ(x, y)) Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 17

Query: child home safety ˆq = 0 1 0 0 1 0 0 1 0, q = 0 0.577 0 0 0.577 0 0 0.577 0. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 18

Cosine similarity: cos(θ(x, y)) = xt y x 2 y 2 tol. (Note: x = y = 1.) In the example, cosine similarity with each column of A: q T A = (0 0.667 1.000 0.258 0 0 0) With a cosine threshold of tol = 0.5 documents 2 and 3 would be returned: Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 19

Terms T1 Baby T2 Child T3 Guide T4 Health T5 Home T6 Infant T7 Proofing T8 Safety T9 Toddler Documents: D1: Infant & Toddler First Aid D2: Babies and Children s Room (for your Home) D3: Child Safety at Home D4: Your Baby s Health and Safety: From Infant to Toddler D5: Baby Proofing Basics D6: Your Guide to Easy Rust Proofing D7: Beanie Babies Collector s Guide Query: child home safety. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 20

Weighting schemes Don t just count occurences of terms in documents but apply a term weighting scheme. elements of term-by-document matrix A weighted depending on characteristics of document collection. A common such scheme: TFIDF: TF = Term Frequency, IDF = Inverse Document Frequency Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 21

TFIDF TF = Term Frequency, IDF = Inverse Document Frequency Define A ij = f ij log(n/n i ), where f ij is frequency of term i in document j, n is the number of documents in the collection, and n i is the number of documents in which the term i appears. here TF=f ij, IDF=log(n/n i ). Other choices possible. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 22

Sparse matrix storage A = Compressed row storage: 0.67 0 0 0.29 0 0.71 0.41 0.29 0.33 0 0.41 0.29 0.67 0 0 0 value 0.67 0.29 0.71 0.41 0.29 0.33 0.41 0.29 0.67 col-ind 1 4 2 3 4 1 3 4 1 row-ptr 1 3 6 9 Compressed column storage: analogous. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 23

Performance evaluation Return all documents for which the cosine measure x T y x 2 y 2 tol. Low tolerance lots of documents returned. More relevant documents, but also more irrelevant ones. High tolerance less irrelevant documents, but relevant ones may be missed. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 24

Example Back to our example case: Recall that cosine similarity with each column of A was q T A = (0 0.667 1.000 0.258 0 0 0). tol= 0.8: one document. tol=0.5: two documents. tol=0.2: three documents. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 25

Example: MEDLINE Homework data set... 1033 documents, around 5000 terms, 30 querys. Query 21: language developement in infancy and pre-school age. tol=0.4: two documents. tol=0.3: 6 documents. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 26

Precision and Recall Precision: Recall: P = D r D t R = D r N r D r is the number of relevant documents retrieved, D t is the total number of documents retrieved, N r is the total number of relevant documents in the database. What we want: High precision, high recall. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 27

Return all documents for which the cosine measure x T y x 2 y 2 tol. High tolerance: high precision, low recall. low tolerance: low precision, high recall. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 28

1 Precision vs recall 0.9 0.8 0.7 0.6 precision 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 recall Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 29

A R m n, m n. A = U The Singular Value Decomposition ( ) Σ V T, U R m m, Σ R n n, V R n n. 0 U and V are orthogonal, and Σ is diagonal: Σ = diag(σ 1, σ 2,..., σ n ), σ 1 σ 2... σ r σ r+1 =... = σ n = 0. Rank(A)= r. (Analogous for m < n. We assume m n for simplicity.) Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 30

The approximation problem has the solution min rank(b) k A B 2 B = A k = U k Σ k V T k, where U k, V k and Σ k are the matrices with the k first singular vectors and values of A. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 31

Latent Semantic Indexing Assumption: there is some underlying latent semantic structure in the data. E.g. car and automobile occur in similar documents, as do cows and sheep. This structure can be enhanced by projecting the data (the term-bydocument-matrix and the queries) onto a lower dimensional space using SVD. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 32

Normalize columns to have length =1 (otherwise long documents dominate first singular values and vectors) Compute the SVD of the term-by-document matrix: A = UΣV T, and approximate A by A U k (Σ k V T k ) = U k D k. U k : orthogonal basis, that we use to approximate all the documents D k : Column j hold the coordinates of document j in the new basis. D k is the projection of A onto the subspace spanned by U k. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 33

Data compression A A k U k (Σ k V T k ) = U k D k. D k holds the coordinates of all documents in terms of the k first (left) singular vectors u 1,..., u k. Take a look at our child home safety -example: D 2 = ( ) 0.27 0.71 0.42 0.62 0.74 0.50 0.74 0.53 0.29 0.54 0.51 0.38 0.64 0.38 Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 34

D k holds the coordinates of all documents in terms of the k first singular vectors, k = 2. Documents plotted in this basis: 0.6 doc 4 doc 3 doc 1 0.4 doc 2 0.2 0!0.2!0.4 doc 5 doc 7!0.6 doc 6!0.8!0.75!0.7!0.65!0.6!0.55!0.5!0.45!0.4!0.35!0.3!0.25 Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 35

A A k = (U k Σ k )V T k = T k V T k T k holds the coordinates of all terms in terms of the k first (right) singular vectors v 1,..., v k : T 2 = ( ) 1.10 0.41 0.56 0.18 0.41 0.30 0.56 0.33 0.30 0.12 0.38 0.57 0.18 0.38 0.47 0.57 0.42 0.47 Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 36

0.6 0.4 term 8 term 2 term 5 term 9 term 6 0.2 term 4 0 term 1!0.2!0.4!0.6 term 3 term 7!0.8!1.4!1.2!1!0.8!0.6!0.4!0.2 0 Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 37

Error in matrix approximation Relative error: E = A A k 2 A 2 Example: k 1 2 3 4 5 6 7 error 0.80 0.75 0.50 0.45 0.36 0.12 0 (Note: often the Frobenius norm is used instead of the 2-norm!) Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 38

Query matching after data compression Represent term-document matrix by A k = U k D k. Compute Project the query: Cosines: q T A k = q T U k D k = (U T k q) T D k. q k = U T k q. cos θ j = qt k (D ke j ) q k 2 D k e j 2 Query matching performed in k-dimensional space! Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 39

Example, continued Cosines for query and original data: q T A = (0 0.667 1.000 0.258 0 0 0) With a cosine threshold of tol = 0.5 documents 2 and 3 returned. Cosines after projection to two-dimensional space: (0.979 0.872 1.000 0.976 0.192 0.233 0.192) Now documents 1, 2, 3 and 4 are considered relevant! Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 40

Terms T1 Baby T2 Child T3 Guide T4 Health T5 Home T6 Infant T7 Proofing T8 Safety T9 Toddler Documents: D1: Infant & Toddler First Aid D2: Babies and Children s Room (for your Home) D3: Child Safety at Home D4: Your Baby s Health and Safety: From Infant to Toddler D5: Baby Proofing Basics D6: Your Guide to Easy Rust Proofing D7: Beanie Babies Collector s Guide Query: child home safety. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 41

LSI LSI (SVD compression) enhances retrieval quality! Even if the approximation error is still quite high! LSI is able to deal with synonyms (toddler, child, infant)......and with polysemy (same word has different meanings). True also for larger document collections......and for surprisingly low rank and high matrix approximation error. But: not for all queries - for most, and average performance counts. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 42

How to choose k? Typical term-document matrix has no clear gaps n singular values. Rank must be chosen by retrieval experiments. Improved retrieval event when matrix approximation error is high 2-norm may not be the right way to measure information content in the term-document matrix. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 43

Updating A new document d arrives: compute its coordinates in terms of U k : d k = U T k d. Note: we no longer have a SVD of (A d). expensive! Recomputing the SVD is Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 44

References [1] Lars Eldén: Numerical Linear Algebra and Data Mining Applications, Draft. [2] D. Hand, H. Mannila, P. Smyth, Principles of Data Mining, The MIT Press, 2001. [3] Soumen Chakrabarti: Mining the Web, Morgan Kaufmann Publishers, 2003. [4] http://www.cs.utk.edu/~lsi/ [5] M. W. Berry and M. Browne, Understanding Search Engines, Mathematical Modeling and Text Retrieval, SIAM, Philadelphia, 1999. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 45