Linear Algebra Methods for Data Mining

Size: px
Start display at page:

Download "Linear Algebra Methods for Data Mining"

Transcription

1 Linear Algebra Methods for Data Mining Saara Hyvönen, Spring 2007 Text mining & Information Retrieval Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki

2 What is text mining? Methods for extracting useful information from large (and often unstructured) collections of texts. Information retrieval. E.g. search in data base of abstracts of scientific papers: Given a query with a set of keywords, find all documents that are relevant. Lots of stuff available (SAS Text Miner, Statistics Text-mining toolbox...) Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 1

3 Collections of texts Several test collections, e.g. Medline (1033 documents) Somewhat larger text collections, e.g. Reuters (21578 documents) Real data sets might be much larger: Number of terms: 10 4, number of documents: Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 2

4 Example Search in helka.linneanet.fi: Search phrase computer science engineering computing science engineering computing in science Result Nothing found Nothing found Computing in Science and Engineering Straightforward word matching is not good enough! Latent Semantic Indexing (LSI): More than literal matching - conceptualbased modelling. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 3

5 Term-document matrix: Vector space IR model Doc1 Doc2 Doc3 Doc4 Query Term Term Term The documents and the query are represented by a vector in R n (here n = 3). In applications n is large!!! Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 4

6 From term-document matrix, Find document vector(s) close to query. What is close? Use some distance measure in R n! Use SVD for data compression and retrieval enhancement. Find topics or concepts. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 5

7 Latent Semantic Indexing (LSI) 1. Document file preparation/ preprocessing: Indexing: collecting terms Use stop list: eliminate meaningless words Stemming 2. Construction term-by-document matrix, sparse matrix storage. 3. Query matching: distance measures. 4. Data compression by low rank approximation: SVD 5. Ranking and relevance feedback. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 6

8 Stop List Eliminate words that occur in all documents : We consider the computation of an eigenvalue and corresponding eigenvector of a Hermitian positive definite matrix assuming that good approximations of the wanted eigenpair are already available, as may be the case in applications such as structural mechanics. We analyze efficient implementations of inexact Rayleigh quotient-type methods, which involve the approximate solution of a linear system at each iteration by means of the Conjugate Residuals method. ftp://ftp.cs.cornell.edu/pub/smart/english.stop Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 7

9 Beginning of the cornell list: a a s able about above according accordingly across actually after afterwards again against ain t all allow allows almost alone along already also although always am among amongst an and another any anybody anyhow anyone anything anyway anyways anywhere apart appear appreciate appropriate are aren t around as aside ask asking associated at available away awfully b be became because become becomes becoming been before beforehand behind being believe below beside besides best better between beyond both brief but by c c mon c s came can... Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 8

10 Stemming computable computation computing computational comput Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 9

11 Example changes of the nucleid acid and phospholipid levels of the livers in the course of fetal and postnatal developement. we have followed the evolution of dna, rna and pl in the livers of rat foeti removed between the fifteenth and the twenty-first day of gestation and of rats newly-born or at weaning. we can observe the following facts. Stemmed (using Porter stemmer): chang of the nucleid acid and phospholipid level of the liver in the cours of fetal and postnat develop. we have follow the evolut of dna, rna and pl in the liver of rat foeti remov between the fifteenth and the twenti-first dai of gestat and of rats newli-born or at wean. we can observ the follow fact. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 10

12 Preprocessed text (stemmed and with stop words removed): nucleid acid phospholipid level liver fetal postnat develop evolut dna rna pl liver rat foeti remov fifteenth twenti dai gestat rat newli born wean observ fact Note: Sometimes also words that appear only once in the whole data set are removed. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 11

13 Queries Also queries are preprocessed natural language queries: I want to know how to compute singular values of data matrices, especially such that are large and sparse becomes comput singular value data matri large sparse Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 12

14 Inverted file structures Document file: Each document has a number and all terms are identified Dictionary: Sorted list of all unique terms Inversion List: Pointers from a term to the documents that contain that term (column index for non-zeros in a row of the matrix). Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 13

15 Terms T1 Baby T2 Child T3 Guide T4 Health T5 Home T6 Infant T7 Proofing T8 Safety T9 Toddler Example Documents: D1: Infant & Toddler First Aid D2: Babies and Children s Room (for your Home) D3: Child Safety at Home D4: Your Baby s Health and Safety: From Infant to Toddler D5: Baby Proofing Basics D6: Your Guide to Easy Rust Proofing D7: Beanie Babies Collector s Guide Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 14

16 Term-by-Document Matrix  = The element Âij is the weighted frequency of term i in document j. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 15

17 Normalization A = Each column has Euclidean length 1. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 16

18 Simple Query Matching Given a query vector q, find the columns a j of A which have dist(q, a j ) tol. Common distance definitions: dist(x, y) = x y 2, (Euclidean distance) 1 cos(θ(x, y)) = 1 xt y x 2 y 2 arccos(θ(x, y)) Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 17

19 Query: child home safety ˆq = , q = Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 18

20 Cosine similarity: cos(θ(x, y)) = xt y x 2 y 2 tol. (Note: x = y = 1.) In the example, cosine similarity with each column of A: q T A = ( ) With a cosine threshold of tol = 0.5 documents 2 and 3 would be returned: Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 19

21 Terms T1 Baby T2 Child T3 Guide T4 Health T5 Home T6 Infant T7 Proofing T8 Safety T9 Toddler Documents: D1: Infant & Toddler First Aid D2: Babies and Children s Room (for your Home) D3: Child Safety at Home D4: Your Baby s Health and Safety: From Infant to Toddler D5: Baby Proofing Basics D6: Your Guide to Easy Rust Proofing D7: Beanie Babies Collector s Guide Query: child home safety. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 20

22 Weighting schemes Don t just count occurences of terms in documents but apply a term weighting scheme. elements of term-by-document matrix A weighted depending on characteristics of document collection. A common such scheme: TFIDF: TF = Term Frequency, IDF = Inverse Document Frequency Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 21

23 TFIDF TF = Term Frequency, IDF = Inverse Document Frequency Define A ij = f ij log(n/n i ), where f ij is frequency of term i in document j, n is the number of documents in the collection, and n i is the number of documents in which the term i appears. here TF=f ij, IDF=log(n/n i ). Other choices possible. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 22

24 Sparse matrix storage A = Compressed row storage: value col-ind row-ptr Compressed column storage: analogous. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 23

25 Performance evaluation Return all documents for which the cosine measure x T y x 2 y 2 tol. Low tolerance lots of documents returned. More relevant documents, but also more irrelevant ones. High tolerance less irrelevant documents, but relevant ones may be missed. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 24

26 Example Back to our example case: Recall that cosine similarity with each column of A was q T A = ( ). tol= 0.8: one document. tol=0.5: two documents. tol=0.2: three documents. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 25

27 Example: MEDLINE Homework data set documents, around 5000 terms, 30 querys. Query 21: language developement in infancy and pre-school age. tol=0.4: two documents. tol=0.3: 6 documents. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 26

28 Precision and Recall Precision: Recall: P = D r D t R = D r N r D r is the number of relevant documents retrieved, D t is the total number of documents retrieved, N r is the total number of relevant documents in the database. What we want: High precision, high recall. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 27

29 Return all documents for which the cosine measure x T y x 2 y 2 tol. High tolerance: high precision, low recall. low tolerance: low precision, high recall. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 28

30 1 Precision vs recall precision recall Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 29

31 A R m n, m n. A = U The Singular Value Decomposition ( ) Σ V T, U R m m, Σ R n n, V R n n. 0 U and V are orthogonal, and Σ is diagonal: Σ = diag(σ 1, σ 2,..., σ n ), σ 1 σ 2... σ r σ r+1 =... = σ n = 0. Rank(A)= r. (Analogous for m < n. We assume m n for simplicity.) Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 30

32 The approximation problem has the solution min rank(b) k A B 2 B = A k = U k Σ k V T k, where U k, V k and Σ k are the matrices with the k first singular vectors and values of A. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 31

33 Latent Semantic Indexing Assumption: there is some underlying latent semantic structure in the data. E.g. car and automobile occur in similar documents, as do cows and sheep. This structure can be enhanced by projecting the data (the term-bydocument-matrix and the queries) onto a lower dimensional space using SVD. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 32

34 Normalize columns to have length =1 (otherwise long documents dominate first singular values and vectors) Compute the SVD of the term-by-document matrix: A = UΣV T, and approximate A by A U k (Σ k V T k ) = U k D k. U k : orthogonal basis, that we use to approximate all the documents D k : Column j hold the coordinates of document j in the new basis. D k is the projection of A onto the subspace spanned by U k. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 33

35 Data compression A A k U k (Σ k V T k ) = U k D k. D k holds the coordinates of all documents in terms of the k first (left) singular vectors u 1,..., u k. Take a look at our child home safety -example: D 2 = ( ) Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 34

36 D k holds the coordinates of all documents in terms of the k first singular vectors, k = 2. Documents plotted in this basis: 0.6 doc 4 doc 3 doc doc !0.2!0.4 doc 5 doc 7!0.6 doc 6!0.8!0.75!0.7!0.65!0.6!0.55!0.5!0.45!0.4!0.35!0.3!0.25 Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 35

37 A A k = (U k Σ k )V T k = T k V T k T k holds the coordinates of all terms in terms of the k first (right) singular vectors v 1,..., v k : T 2 = ( ) Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 36

38 term 8 term 2 term 5 term 9 term term 4 0 term 1!0.2!0.4!0.6 term 3 term 7!0.8!1.4!1.2!1!0.8!0.6!0.4!0.2 0 Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 37

39 Error in matrix approximation Relative error: E = A A k 2 A 2 Example: k error (Note: often the Frobenius norm is used instead of the 2-norm!) Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 38

40 Query matching after data compression Represent term-document matrix by A k = U k D k. Compute Project the query: Cosines: q T A k = q T U k D k = (U T k q) T D k. q k = U T k q. cos θ j = qt k (D ke j ) q k 2 D k e j 2 Query matching performed in k-dimensional space! Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 39

41 Example, continued Cosines for query and original data: q T A = ( ) With a cosine threshold of tol = 0.5 documents 2 and 3 returned. Cosines after projection to two-dimensional space: ( ) Now documents 1, 2, 3 and 4 are considered relevant! Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 40

42 Terms T1 Baby T2 Child T3 Guide T4 Health T5 Home T6 Infant T7 Proofing T8 Safety T9 Toddler Documents: D1: Infant & Toddler First Aid D2: Babies and Children s Room (for your Home) D3: Child Safety at Home D4: Your Baby s Health and Safety: From Infant to Toddler D5: Baby Proofing Basics D6: Your Guide to Easy Rust Proofing D7: Beanie Babies Collector s Guide Query: child home safety. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 41

43 LSI LSI (SVD compression) enhances retrieval quality! Even if the approximation error is still quite high! LSI is able to deal with synonyms (toddler, child, infant)......and with polysemy (same word has different meanings). True also for larger document collections......and for surprisingly low rank and high matrix approximation error. But: not for all queries - for most, and average performance counts. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 42

44 How to choose k? Typical term-document matrix has no clear gaps n singular values. Rank must be chosen by retrieval experiments. Improved retrieval event when matrix approximation error is high 2-norm may not be the right way to measure information content in the term-document matrix. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 43

45 Updating A new document d arrives: compute its coordinates in terms of U k : d k = U T k d. Note: we no longer have a SVD of (A d). expensive! Recomputing the SVD is Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 44

46 References [1] Lars Eldén: Numerical Linear Algebra and Data Mining Applications, Draft. [2] D. Hand, H. Mannila, P. Smyth, Principles of Data Mining, The MIT Press, [3] Soumen Chakrabarti: Mining the Web, Morgan Kaufmann Publishers, [4] [5] M. W. Berry and M. Browne, Understanding Search Engines, Mathematical Modeling and Text Retrieval, SIAM, Philadelphia, Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 45

Linear Algebra Review. Vectors

Linear Algebra Review. Vectors Linear Algebra Review By Tim K. Marks UCSD Borrows heavily from: Jana Kosecka [email protected] http://cs.gmu.edu/~kosecka/cs682.html Virginia de Sa Cogsci 8F Linear Algebra review UCSD Vectors The length

More information

Latent Semantic Indexing with Selective Query Expansion Abstract Introduction

Latent Semantic Indexing with Selective Query Expansion Abstract Introduction Latent Semantic Indexing with Selective Query Expansion Andy Garron April Kontostathis Department of Mathematics and Computer Science Ursinus College Collegeville PA 19426 Abstract This article describes

More information

Lecture 5: Singular Value Decomposition SVD (1)

Lecture 5: Singular Value Decomposition SVD (1) EEM3L1: Numerical and Analytical Techniques Lecture 5: Singular Value Decomposition SVD (1) EE3L1, slide 1, Version 4: 25-Sep-02 Motivation for SVD (1) SVD = Singular Value Decomposition Consider the system

More information

W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015

W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015 W. Heath Rushing Adsurgo LLC Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare Session H-1 JTCC: October 23, 2015 Outline Demonstration: Recent article on cnn.com Introduction

More information

Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC

Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC 1. Introduction A popular rule of thumb suggests that

More information

TF-IDF. David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture6-tfidf.ppt

TF-IDF. David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture6-tfidf.ppt TF-IDF David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture6-tfidf.ppt Administrative Homework 3 available soon Assignment 2 available soon Popular media article

More information

Chapter 6. Orthogonality

Chapter 6. Orthogonality 6.3 Orthogonal Matrices 1 Chapter 6. Orthogonality 6.3 Orthogonal Matrices Definition 6.4. An n n matrix A is orthogonal if A T A = I. Note. We will see that the columns of an orthogonal matrix must be

More information

Numerical Methods I Eigenvalue Problems

Numerical Methods I Eigenvalue Problems Numerical Methods I Eigenvalue Problems Aleksandar Donev Courant Institute, NYU 1 [email protected] 1 Course G63.2010.001 / G22.2420-001, Fall 2010 September 30th, 2010 A. Donev (Courant Institute)

More information

13 MATH FACTS 101. 2 a = 1. 7. The elements of a vector have a graphical interpretation, which is particularly easy to see in two or three dimensions.

13 MATH FACTS 101. 2 a = 1. 7. The elements of a vector have a graphical interpretation, which is particularly easy to see in two or three dimensions. 3 MATH FACTS 0 3 MATH FACTS 3. Vectors 3.. Definition We use the overhead arrow to denote a column vector, i.e., a linear segment with a direction. For example, in three-space, we write a vector in terms

More information

1 2 3 1 1 2 x = + x 2 + x 4 1 0 1

1 2 3 1 1 2 x = + x 2 + x 4 1 0 1 (d) If the vector b is the sum of the four columns of A, write down the complete solution to Ax = b. 1 2 3 1 1 2 x = + x 2 + x 4 1 0 0 1 0 1 2. (11 points) This problem finds the curve y = C + D 2 t which

More information

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Review Jeopardy. Blue vs. Orange. Review Jeopardy Review Jeopardy Blue vs. Orange Review Jeopardy Jeopardy Round Lectures 0-3 Jeopardy Round $200 How could I measure how far apart (i.e. how different) two observations, y 1 and y 2, are from each other?

More information

Text Analytics (Text Mining)

Text Analytics (Text Mining) CSE 6242 / CX 4242 Apr 3, 2014 Text Analytics (Text Mining) LSI (uses SVD), Visualization Duen Horng (Polo) Chau Georgia Tech Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey

More information

Mining Text Data: An Introduction

Mining Text Data: An Introduction Bölüm 10. Metin ve WEB Madenciliği http://ceng.gazi.edu.tr/~ozdemir Mining Text Data: An Introduction Data Mining / Knowledge Discovery Structured Data Multimedia Free Text Hypertext HomeLoan ( Frank Rizzo

More information

Similarity and Diagonalization. Similar Matrices

Similarity and Diagonalization. Similar Matrices MATH022 Linear Algebra Brief lecture notes 48 Similarity and Diagonalization Similar Matrices Let A and B be n n matrices. We say that A is similar to B if there is an invertible n n matrix P such that

More information

α = u v. In other words, Orthogonal Projection

α = u v. In other words, Orthogonal Projection Orthogonal Projection Given any nonzero vector v, it is possible to decompose an arbitrary vector u into a component that points in the direction of v and one that points in a direction orthogonal to v

More information

by the matrix A results in a vector which is a reflection of the given

by the matrix A results in a vector which is a reflection of the given Eigenvalues & Eigenvectors Example Suppose Then So, geometrically, multiplying a vector in by the matrix A results in a vector which is a reflection of the given vector about the y-axis We observe that

More information

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #18: Dimensionality Reduc7on

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #18: Dimensionality Reduc7on CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #18: Dimensionality Reduc7on Dimensionality Reduc=on Assump=on: Data lies on or near a low d- dimensional subspace Axes of this subspace

More information

SALEM COMMUNITY COLLEGE Carneys Point, New Jersey 08069 COURSE SYLLABUS COVER SHEET. Action Taken (Please Check One) New Course Initiated

SALEM COMMUNITY COLLEGE Carneys Point, New Jersey 08069 COURSE SYLLABUS COVER SHEET. Action Taken (Please Check One) New Course Initiated SALEM COMMUNITY COLLEGE Carneys Point, New Jersey 08069 COURSE SYLLABUS COVER SHEET Course Title Course Number Department Linear Algebra Mathematics MAT-240 Action Taken (Please Check One) New Course Initiated

More information

SERG. Reconstructing Requirements Traceability in Design and Test Using Latent Semantic Indexing

SERG. Reconstructing Requirements Traceability in Design and Test Using Latent Semantic Indexing Delft University of Technology Software Engineering Research Group Technical Report Series Reconstructing Requirements Traceability in Design and Test Using Latent Semantic Indexing Marco Lormans and Arie

More information

Mehtap Ergüven Abstract of Ph.D. Dissertation for the degree of PhD of Engineering in Informatics

Mehtap Ergüven Abstract of Ph.D. Dissertation for the degree of PhD of Engineering in Informatics INTERNATIONAL BLACK SEA UNIVERSITY COMPUTER TECHNOLOGIES AND ENGINEERING FACULTY ELABORATION OF AN ALGORITHM OF DETECTING TESTS DIMENSIONALITY Mehtap Ergüven Abstract of Ph.D. Dissertation for the degree

More information

Applied Linear Algebra I Review page 1

Applied Linear Algebra I Review page 1 Applied Linear Algebra Review 1 I. Determinants A. Definition of a determinant 1. Using sum a. Permutations i. Sign of a permutation ii. Cycle 2. Uniqueness of the determinant function in terms of properties

More information

LINEAR ALGEBRA W W L CHEN

LINEAR ALGEBRA W W L CHEN LINEAR ALGEBRA W W L CHEN c W W L Chen, 1997, 2008 This chapter is available free to all individuals, on understanding that it is not to be used for financial gain, and may be downloaded and/or photocopied,

More information

CS3220 Lecture Notes: QR factorization and orthogonal transformations

CS3220 Lecture Notes: QR factorization and orthogonal transformations CS3220 Lecture Notes: QR factorization and orthogonal transformations Steve Marschner Cornell University 11 March 2009 In this lecture I ll talk about orthogonal matrices and their properties, discuss

More information

5. Orthogonal matrices

5. Orthogonal matrices L Vandenberghe EE133A (Spring 2016) 5 Orthogonal matrices matrices with orthonormal columns orthogonal matrices tall matrices with orthonormal columns complex matrices with orthonormal columns 5-1 Orthonormal

More information

DATA ANALYSIS II. Matrix Algorithms

DATA ANALYSIS II. Matrix Algorithms DATA ANALYSIS II Matrix Algorithms Similarity Matrix Given a dataset D = {x i }, i=1,..,n consisting of n points in R d, let A denote the n n symmetric similarity matrix between the points, given as where

More information

Recall the basic property of the transpose (for any A): v A t Aw = v w, v, w R n.

Recall the basic property of the transpose (for any A): v A t Aw = v w, v, w R n. ORTHOGONAL MATRICES Informally, an orthogonal n n matrix is the n-dimensional analogue of the rotation matrices R θ in R 2. When does a linear transformation of R 3 (or R n ) deserve to be called a rotation?

More information

is in plane V. However, it may be more convenient to introduce a plane coordinate system in V.

is in plane V. However, it may be more convenient to introduce a plane coordinate system in V. .4 COORDINATES EXAMPLE Let V be the plane in R with equation x +2x 2 +x 0, a two-dimensional subspace of R. We can describe a vector in this plane by its spatial (D)coordinates; for example, vector x 5

More information

Least-Squares Intersection of Lines

Least-Squares Intersection of Lines Least-Squares Intersection of Lines Johannes Traa - UIUC 2013 This write-up derives the least-squares solution for the intersection of lines. In the general case, a set of lines will not intersect at a

More information

Eng. Mohammed Abdualal

Eng. Mohammed Abdualal Islamic University of Gaza Faculty of Engineering Computer Engineering Department Information Storage and Retrieval (ECOM 5124) IR HW 5+6 Scoring, term weighting and the vector space model Exercise 6.2

More information

Spam Filtering Based on Latent Semantic Indexing

Spam Filtering Based on Latent Semantic Indexing Spam Filtering Based on Latent Semantic Indexing Wilfried N. Gansterer Andreas G. K. Janecek Robert Neumayer Abstract In this paper, a study on the classification performance of a vector space model (VSM)

More information

Manifold Learning Examples PCA, LLE and ISOMAP

Manifold Learning Examples PCA, LLE and ISOMAP Manifold Learning Examples PCA, LLE and ISOMAP Dan Ventura October 14, 28 Abstract We try to give a helpful concrete example that demonstrates how to use PCA, LLE and Isomap, attempts to provide some intuition

More information

Notes on Orthogonal and Symmetric Matrices MENU, Winter 2013

Notes on Orthogonal and Symmetric Matrices MENU, Winter 2013 Notes on Orthogonal and Symmetric Matrices MENU, Winter 201 These notes summarize the main properties and uses of orthogonal and symmetric matrices. We covered quite a bit of material regarding these topics,

More information

Math 215 HW #6 Solutions

Math 215 HW #6 Solutions Math 5 HW #6 Solutions Problem 34 Show that x y is orthogonal to x + y if and only if x = y Proof First, suppose x y is orthogonal to x + y Then since x, y = y, x In other words, = x y, x + y = (x y) T

More information

Inner product. Definition of inner product

Inner product. Definition of inner product Math 20F Linear Algebra Lecture 25 1 Inner product Review: Definition of inner product. Slide 1 Norm and distance. Orthogonal vectors. Orthogonal complement. Orthogonal basis. Definition of inner product

More information

dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING

dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING ABSTRACT In most CRM (Customer Relationship Management) systems, information on

More information

Multidimensional data analysis

Multidimensional data analysis Multidimensional data analysis Ella Bingham Dept of Computer Science, University of Helsinki [email protected] June 2008 The Finnish Graduate School in Astronomy and Space Physics Summer School

More information

1 Introduction to Matrices

1 Introduction to Matrices 1 Introduction to Matrices In this section, important definitions and results from matrix algebra that are useful in regression analysis are introduced. While all statements below regarding the columns

More information

Medical Information-Retrieval Systems. Dong Peng Medical Informatics Group

Medical Information-Retrieval Systems. Dong Peng Medical Informatics Group Medical Information-Retrieval Systems Dong Peng Medical Informatics Group Outline Evolution of medical Information-Retrieval (IR). The information retrieval process. The trend of medical information retrieval

More information

Bindel, Spring 2012 Intro to Scientific Computing (CS 3220) Week 3: Wednesday, Feb 8

Bindel, Spring 2012 Intro to Scientific Computing (CS 3220) Week 3: Wednesday, Feb 8 Spaces and bases Week 3: Wednesday, Feb 8 I have two favorite vector spaces 1 : R n and the space P d of polynomials of degree at most d. For R n, we have a canonical basis: R n = span{e 1, e 2,..., e

More information

Principal components analysis

Principal components analysis CS229 Lecture notes Andrew Ng Part XI Principal components analysis In our discussion of factor analysis, we gave a way to model data x R n as approximately lying in some k-dimension subspace, where k

More information

October 3rd, 2012. Linear Algebra & Properties of the Covariance Matrix

October 3rd, 2012. Linear Algebra & Properties of the Covariance Matrix Linear Algebra & Properties of the Covariance Matrix October 3rd, 2012 Estimation of r and C Let rn 1, rn, t..., rn T be the historical return rates on the n th asset. rn 1 rṇ 2 r n =. r T n n = 1, 2,...,

More information

Linear Algebraic Equations, SVD, and the Pseudo-Inverse

Linear Algebraic Equations, SVD, and the Pseudo-Inverse Linear Algebraic Equations, SVD, and the Pseudo-Inverse Philip N. Sabes October, 21 1 A Little Background 1.1 Singular values and matrix inversion For non-smmetric matrices, the eigenvalues and singular

More information

Big Text Data Clustering using Class Labels and Semantic Feature Based on Hadoop of Cloud Computing

Big Text Data Clustering using Class Labels and Semantic Feature Based on Hadoop of Cloud Computing Vol.8, No.4 (214), pp.1-1 http://dx.doi.org/1.14257/ijseia.214.8.4.1 Big Text Data Clustering using Class Labels and Semantic Feature Based on Hadoop of Cloud Computing Yong-Il Kim 1, Yoo-Kang Ji 2 and

More information

Inner Product Spaces and Orthogonality

Inner Product Spaces and Orthogonality Inner Product Spaces and Orthogonality week 3-4 Fall 2006 Dot product of R n The inner product or dot product of R n is a function, defined by u, v a b + a 2 b 2 + + a n b n for u a, a 2,, a n T, v b,

More information

Homework 2. Page 154: Exercise 8.10. Page 145: Exercise 8.3 Page 150: Exercise 8.9

Homework 2. Page 154: Exercise 8.10. Page 145: Exercise 8.3 Page 150: Exercise 8.9 Homework 2 Page 110: Exercise 6.10; Exercise 6.12 Page 116: Exercise 6.15; Exercise 6.17 Page 121: Exercise 6.19 Page 122: Exercise 6.20; Exercise 6.23; Exercise 6.24 Page 131: Exercise 7.3; Exercise 7.5;

More information

Introduction to Matrix Algebra

Introduction to Matrix Algebra Psychology 7291: Multivariate Statistics (Carey) 8/27/98 Matrix Algebra - 1 Introduction to Matrix Algebra Definitions: A matrix is a collection of numbers ordered by rows and columns. It is customary

More information

Applied Linear Algebra

Applied Linear Algebra Applied Linear Algebra OTTO BRETSCHER http://www.prenhall.com/bretscher Chapter 7 Eigenvalues and Eigenvectors Chia-Hui Chang Email: [email protected] National Central University, Taiwan 7.1 DYNAMICAL

More information

Orthogonal Diagonalization of Symmetric Matrices

Orthogonal Diagonalization of Symmetric Matrices MATH10212 Linear Algebra Brief lecture notes 57 Gram Schmidt Process enables us to find an orthogonal basis of a subspace. Let u 1,..., u k be a basis of a subspace V of R n. We begin the process of finding

More information

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with

More information

Numerical Analysis Lecture Notes

Numerical Analysis Lecture Notes Numerical Analysis Lecture Notes Peter J. Olver 6. Eigenvalues and Singular Values In this section, we collect together the basic facts about eigenvalues and eigenvectors. From a geometrical viewpoint,

More information

An Information Retrieval using weighted Index Terms in Natural Language document collections

An Information Retrieval using weighted Index Terms in Natural Language document collections Internet and Information Technology in Modern Organizations: Challenges & Answers 635 An Information Retrieval using weighted Index Terms in Natural Language document collections Ahmed A. A. Radwan, Minia

More information

Multimedia Databases. Wolf-Tilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.

Multimedia Databases. Wolf-Tilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs. Multimedia Databases Wolf-Tilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de 14 Previous Lecture 13 Indexes for Multimedia Data 13.1

More information

Indexing by Latent Semantic Analysis. Scott Deerwester Graduate Library School University of Chicago Chicago, IL 60637

Indexing by Latent Semantic Analysis. Scott Deerwester Graduate Library School University of Chicago Chicago, IL 60637 Indexing by Latent Semantic Analysis Scott Deerwester Graduate Library School University of Chicago Chicago, IL 60637 Susan T. Dumais George W. Furnas Thomas K. Landauer Bell Communications Research 435

More information

Linear Algebra Methods for Data Mining

Linear Algebra Methods for Data Mining Linear Algebra Methods for Data Mining Saara Hyvönen, [email protected] Spring 2007 Lecture 3: QR, least squares, linear regression Linear Algebra Methods for Data Mining, Spring 2007, University

More information

A Survey of Document Clustering Techniques & Comparison of LDA and movmf

A Survey of Document Clustering Techniques & Comparison of LDA and movmf A Survey of Document Clustering Techniques & Comparison of LDA and movmf Yu Xiao December 10, 2010 Abstract This course project is mainly an overview of some widely used document clustering techniques.

More information

MATH 551 - APPLIED MATRIX THEORY

MATH 551 - APPLIED MATRIX THEORY MATH 55 - APPLIED MATRIX THEORY FINAL TEST: SAMPLE with SOLUTIONS (25 points NAME: PROBLEM (3 points A web of 5 pages is described by a directed graph whose matrix is given by A Do the following ( points

More information

Similar matrices and Jordan form

Similar matrices and Jordan form Similar matrices and Jordan form We ve nearly covered the entire heart of linear algebra once we ve finished singular value decompositions we ll have seen all the most central topics. A T A is positive

More information

Torgerson s Classical MDS derivation: 1: Determining Coordinates from Euclidean Distances

Torgerson s Classical MDS derivation: 1: Determining Coordinates from Euclidean Distances Torgerson s Classical MDS derivation: 1: Determining Coordinates from Euclidean Distances It is possible to construct a matrix X of Cartesian coordinates of points in Euclidean space when we know the Euclidean

More information

15.062 Data Mining: Algorithms and Applications Matrix Math Review

15.062 Data Mining: Algorithms and Applications Matrix Math Review .6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop

More information

Introduction to Information Retrieval http://informationretrieval.org

Introduction to Information Retrieval http://informationretrieval.org Introduction to Information Retrieval http://informationretrieval.org IIR 6&7: Vector Space Model Hinrich Schütze Institute for Natural Language Processing, University of Stuttgart 2011-08-29 Schütze:

More information

Examination paper for TMA4205 Numerical Linear Algebra

Examination paper for TMA4205 Numerical Linear Algebra Department of Mathematical Sciences Examination paper for TMA4205 Numerical Linear Algebra Academic contact during examination: Markus Grasmair Phone: 97580435 Examination date: December 16, 2015 Examination

More information

AN SQL EXTENSION FOR LATENT SEMANTIC ANALYSIS

AN SQL EXTENSION FOR LATENT SEMANTIC ANALYSIS Advances in Information Mining ISSN: 0975 3265 & E-ISSN: 0975 9093, Vol. 3, Issue 1, 2011, pp-19-25 Available online at http://www.bioinfo.in/contents.php?id=32 AN SQL EXTENSION FOR LATENT SEMANTIC ANALYSIS

More information

Chapter 17. Orthogonal Matrices and Symmetries of Space

Chapter 17. Orthogonal Matrices and Symmetries of Space Chapter 17. Orthogonal Matrices and Symmetries of Space Take a random matrix, say 1 3 A = 4 5 6, 7 8 9 and compare the lengths of e 1 and Ae 1. The vector e 1 has length 1, while Ae 1 = (1, 4, 7) has length

More information

Au = = = 3u. Aw = = = 2w. so the action of A on u and w is very easy to picture: it simply amounts to a stretching by 3 and 2, respectively.

Au = = = 3u. Aw = = = 2w. so the action of A on u and w is very easy to picture: it simply amounts to a stretching by 3 and 2, respectively. Chapter 7 Eigenvalues and Eigenvectors In this last chapter of our exploration of Linear Algebra we will revisit eigenvalues and eigenvectors of matrices, concepts that were already introduced in Geometry

More information

Orthogonal Projections

Orthogonal Projections Orthogonal Projections and Reflections (with exercises) by D. Klain Version.. Corrections and comments are welcome! Orthogonal Projections Let X,..., X k be a family of linearly independent (column) vectors

More information

Inner Product Spaces

Inner Product Spaces Math 571 Inner Product Spaces 1. Preliminaries An inner product space is a vector space V along with a function, called an inner product which associates each pair of vectors u, v with a scalar u, v, and

More information

Soft Clustering with Projections: PCA, ICA, and Laplacian

Soft Clustering with Projections: PCA, ICA, and Laplacian 1 Soft Clustering with Projections: PCA, ICA, and Laplacian David Gleich and Leonid Zhukov Abstract In this paper we present a comparison of three projection methods that use the eigenvectors of a matrix

More information

Comparison of Standard and Zipf-Based Document Retrieval Heuristics

Comparison of Standard and Zipf-Based Document Retrieval Heuristics Comparison of Standard and Zipf-Based Document Retrieval Heuristics Benjamin Hoffmann Universität Stuttgart, Institut für Formale Methoden der Informatik Universitätsstr. 38, D-70569 Stuttgart, Germany

More information

[1] Diagonal factorization

[1] Diagonal factorization 8.03 LA.6: Diagonalization and Orthogonal Matrices [ Diagonal factorization [2 Solving systems of first order differential equations [3 Symmetric and Orthonormal Matrices [ Diagonal factorization Recall:

More information

Search Engines. Stephen Shaw <[email protected]> 18th of February, 2014. Netsoc

Search Engines. Stephen Shaw <stesh@netsoc.tcd.ie> 18th of February, 2014. Netsoc Search Engines Stephen Shaw Netsoc 18th of February, 2014 Me M.Sc. Artificial Intelligence, University of Edinburgh Would recommend B.A. (Mod.) Computer Science, Linguistics, French,

More information

Lecture 1: Schur s Unitary Triangularization Theorem

Lecture 1: Schur s Unitary Triangularization Theorem Lecture 1: Schur s Unitary Triangularization Theorem This lecture introduces the notion of unitary equivalence and presents Schur s theorem and some of its consequences It roughly corresponds to Sections

More information

SYMMETRIC EIGENFACES MILI I. SHAH

SYMMETRIC EIGENFACES MILI I. SHAH SYMMETRIC EIGENFACES MILI I. SHAH Abstract. Over the years, mathematicians and computer scientists have produced an extensive body of work in the area of facial analysis. Several facial analysis algorithms

More information

Section 6.1 - Inner Products and Norms

Section 6.1 - Inner Products and Norms Section 6.1 - Inner Products and Norms Definition. Let V be a vector space over F {R, C}. An inner product on V is a function that assigns, to every ordered pair of vectors x and y in V, a scalar in F,

More information

CONTROLLABILITY. Chapter 2. 2.1 Reachable Set and Controllability. Suppose we have a linear system described by the state equation

CONTROLLABILITY. Chapter 2. 2.1 Reachable Set and Controllability. Suppose we have a linear system described by the state equation Chapter 2 CONTROLLABILITY 2 Reachable Set and Controllability Suppose we have a linear system described by the state equation ẋ Ax + Bu (2) x() x Consider the following problem For a given vector x in

More information

SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL

SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL Krishna Kiran Kattamuri 1 and Rupa Chiramdasu 2 Department of Computer Science Engineering, VVIT, Guntur, India

More information

Recall that two vectors in are perpendicular or orthogonal provided that their dot

Recall that two vectors in are perpendicular or orthogonal provided that their dot Orthogonal Complements and Projections Recall that two vectors in are perpendicular or orthogonal provided that their dot product vanishes That is, if and only if Example 1 The vectors in are orthogonal

More information

Methods for Finding Bases

Methods for Finding Bases Methods for Finding Bases Bases for the subspaces of a matrix Row-reduction methods can be used to find bases. Let us now look at an example illustrating how to obtain bases for the row space, null space,

More information

x1 x 2 x 3 y 1 y 2 y 3 x 1 y 2 x 2 y 1 0.

x1 x 2 x 3 y 1 y 2 y 3 x 1 y 2 x 2 y 1 0. Cross product 1 Chapter 7 Cross product We are getting ready to study integration in several variables. Until now we have been doing only differential calculus. One outcome of this study will be our ability

More information

4: EIGENVALUES, EIGENVECTORS, DIAGONALIZATION

4: EIGENVALUES, EIGENVECTORS, DIAGONALIZATION 4: EIGENVALUES, EIGENVECTORS, DIAGONALIZATION STEVEN HEILMAN Contents 1. Review 1 2. Diagonal Matrices 1 3. Eigenvectors and Eigenvalues 2 4. Characteristic Polynomial 4 5. Diagonalizability 6 6. Appendix:

More information

Matrix Calculations: Applications of Eigenvalues and Eigenvectors; Inner Products

Matrix Calculations: Applications of Eigenvalues and Eigenvectors; Inner Products Matrix Calculations: Applications of Eigenvalues and Eigenvectors; Inner Products H. Geuvers Institute for Computing and Information Sciences Intelligent Systems Version: spring 2015 H. Geuvers Version:

More information

Linear algebra and the geometry of quadratic equations. Similarity transformations and orthogonal matrices

Linear algebra and the geometry of quadratic equations. Similarity transformations and orthogonal matrices MATH 30 Differential Equations Spring 006 Linear algebra and the geometry of quadratic equations Similarity transformations and orthogonal matrices First, some things to recall from linear algebra Two

More information

MATH 423 Linear Algebra II Lecture 38: Generalized eigenvectors. Jordan canonical form (continued).

MATH 423 Linear Algebra II Lecture 38: Generalized eigenvectors. Jordan canonical form (continued). MATH 423 Linear Algebra II Lecture 38: Generalized eigenvectors Jordan canonical form (continued) Jordan canonical form A Jordan block is a square matrix of the form λ 1 0 0 0 0 λ 1 0 0 0 0 λ 0 0 J = 0

More information

7 Gaussian Elimination and LU Factorization

7 Gaussian Elimination and LU Factorization 7 Gaussian Elimination and LU Factorization In this final section on matrix factorization methods for solving Ax = b we want to take a closer look at Gaussian elimination (probably the best known method

More information

Vector and Matrix Norms

Vector and Matrix Norms Chapter 1 Vector and Matrix Norms 11 Vector Spaces Let F be a field (such as the real numbers, R, or complex numbers, C) with elements called scalars A Vector Space, V, over the field F is a non-empty

More information

17. Inner product spaces Definition 17.1. Let V be a real vector space. An inner product on V is a function

17. Inner product spaces Definition 17.1. Let V be a real vector space. An inner product on V is a function 17. Inner product spaces Definition 17.1. Let V be a real vector space. An inner product on V is a function, : V V R, which is symmetric, that is u, v = v, u. bilinear, that is linear (in both factors):

More information

The Singular Value Decomposition in Symmetric (Löwdin) Orthogonalization and Data Compression

The Singular Value Decomposition in Symmetric (Löwdin) Orthogonalization and Data Compression The Singular Value Decomposition in Symmetric (Löwdin) Orthogonalization and Data Compression The SVD is the most generally applicable of the orthogonal-diagonal-orthogonal type matrix decompositions Every

More information

1 o Semestre 2007/2008

1 o Semestre 2007/2008 Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008 Outline 1 2 3 4 5 Outline 1 2 3 4 5 Exploiting Text How is text exploited? Two main directions Extraction Extraction

More information

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder APPM4720/5720: Fast algorithms for big data Gunnar Martinsson The University of Colorado at Boulder Course objectives: The purpose of this course is to teach efficient algorithms for processing very large

More information

December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B. KITCHENS

December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B. KITCHENS December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B KITCHENS The equation 1 Lines in two-dimensional space (1) 2x y = 3 describes a line in two-dimensional space The coefficients of x and y in the equation

More information

SOLVING LINEAR SYSTEMS

SOLVING LINEAR SYSTEMS SOLVING LINEAR SYSTEMS Linear systems Ax = b occur widely in applied mathematics They occur as direct formulations of real world problems; but more often, they occur as a part of the numerical analysis

More information

Lecture Notes 2: Matrices as Systems of Linear Equations

Lecture Notes 2: Matrices as Systems of Linear Equations 2: Matrices as Systems of Linear Equations 33A Linear Algebra, Puck Rombach Last updated: April 13, 2016 Systems of Linear Equations Systems of linear equations can represent many things You have probably

More information

Factorization Theorems

Factorization Theorems Chapter 7 Factorization Theorems This chapter highlights a few of the many factorization theorems for matrices While some factorization results are relatively direct, others are iterative While some factorization

More information

Solution of Linear Systems

Solution of Linear Systems Chapter 3 Solution of Linear Systems In this chapter we study algorithms for possibly the most commonly occurring problem in scientific computing, the solution of linear systems of equations. We start

More information

ISOMETRIES OF R n KEITH CONRAD

ISOMETRIES OF R n KEITH CONRAD ISOMETRIES OF R n KEITH CONRAD 1. Introduction An isometry of R n is a function h: R n R n that preserves the distance between vectors: h(v) h(w) = v w for all v and w in R n, where (x 1,..., x n ) = x

More information

P164 Tomographic Velocity Model Building Using Iterative Eigendecomposition

P164 Tomographic Velocity Model Building Using Iterative Eigendecomposition P164 Tomographic Velocity Model Building Using Iterative Eigendecomposition K. Osypov* (WesternGeco), D. Nichols (WesternGeco), M. Woodward (WesternGeco) & C.E. Yarman (WesternGeco) SUMMARY Tomographic

More information