Text Analytics (Text Mining)

Size: px

Start display at page:

Download "Text Analytics (Text Mining)"

Buck Newman
10 years ago
Views:

1 CSE 6242 / CX 4242 Apr 3, 2014 Text Analytics (Text Mining) LSI (uses SVD), Visualization Duen Horng (Polo) Chau Georgia Tech Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos, Le Song

Tech Some lectures are partly based on materials by Professors

2 Singular Value Decomposition (SVD): Motivation Problem #1: " " Text - LSI uses SVD find concepts " " Problem #2: " " Compression / dimensionality reduction

3 SVD - Motivation Problem #1: text - LSI: find concepts

4 SVD - Motivation Customer-product, for recommendation system: bread lettuce tomatos beef chicken vegetarians meat eaters

5 SVD - Motivation problem #2: compress / reduce dimensionality

6 Problem - Specification ~10^6 rows; ~10^3 columns; no updates;" Random access to any cell(s); small error: OK

7 SVD - Motivation

8 SVD - Motivation

9 SVD - Definition (reminder: matrix multiplication) x = 3 x 2 2 x 1

10 SVD - Definition (reminder: matrix multiplication) x = 3 x 2 2 x 1 3 x 1

11 SVD - Definition (reminder: matrix multiplication) x = 3 x 2 2 x 1 3 x 1

12 SVD - Definition (reminder: matrix multiplication) x = 3 x 2 2 x 1 3 x 1

13 SVD - Definition (reminder: matrix multiplication) x =

14 SVD - Definition A [n x m] = U [n x r] Λ [ r x r] (V [m x r] ) T" " A: n x m matrix e.g., n documents, m terms" U: n x r matrix e.g., n documents, r concepts" Λ: r x r diagonal matrix r : rank of the matrix; strength of each concept " V: m x r matrix " e.g., m terms, r concepts

15 SVD - Definition A [n x m] = U [n x r] Λ [ r x r] (V [m x r] ) T m r n r = n x r x r m diagonal entries: concept strengths m terms r concepts n documents m terms n documents r concepts"

16 SVD - Properties THEOREM [Press+92]: always possible to decompose matrix A into A = U Λ V T " U, Λ, V: unique, most of the time" U, V: column orthonormal " i.e., columns are unit vectors, orthogonal to each other" U T U = I" V T V = I (I: identity matrix) Λ: diagonal matrix with non-negative diagonal entires, sorted in decreasing order

17 SVD - Example A = U Λ V T - example: data infṛetrieval brain lung CS = x x MD

18 SVD - Example A = U Λ V T - example: data infṛetrieval brain lung CS-concept MD-concept CS = x x MD

19 SVD - Example A = U Λ V T - example: data infṛetrieval brain lung doc-to-concept similarity matrix CS-concept MD-concept CS = x x MD

20 SVD - Example A = U Λ V T - example: data infṛetrieval brain lung strength of CS-concept CS = x x MD

21 SVD - Example CS A = U Λ V T - example: data infṛetrieval brain lung = CS-concept x term-to-concept similarity matrix x MD

22 SVD - Example CS A = U Λ V T - example: data infṛetrieval brain lung = CS-concept x term-to-concept similarity matrix x MD

23 SVD - Interpretation #1 documents, terms and concepts :! U: document-to-concept similarity matrix! V: term-to-concept sim. matrix! Λ: its diagonal elements: strength of each concept

24 SVD Interpretation #1 documents, terms and concepts :! Q: if A is the document-to-term matrix, what is A T A?! A:! Q: A A T?! A:

25 SVD Interpretation #1 documents, terms and concepts :! Q: if A is the document-to-term matrix, what is A T A?! A: term-to-term ([m x m]) similarity matrix! Q: A A T?! A: document-to-document ([n x n]) similarity matrix

26 SVD properties V are the eigenvectors of the covariance matrix A T A!! U are the eigenvectors of the Gram (inner-product) matrix AA T Thus, SVD is closely related to PCA, and can be numerically more stable. For more info, see: Ian T. Jolliffe, Principal Component Analysis (2 nd ed), Springer, Gilbert Strang, Linear Algebra and Its Applications (4 th ed), Brooks Cole, 2005.

27 SVD - Interpretation #2 best axis to project on! ( best = min sum of squares of projection errors) v1 First Singular Vector min RMS error

28 SVD - Interpretation #2 A = U Λ V T - example: variance ( spread ) on the v1 axis = x x v1

29 SVD - Interpretation #2 A = U Λ V T - example:! U Λ gives the coordinates of the points in the projection axis = x x

30 SVD - Interpretation #2 More details! Q: how exactly is dim. reduction done? = x x

31 SVD - Interpretation #2 More details! Q: how exactly is dim. reduction done?! A: set the smallest singular values to zero: = x x

32 SVD - Interpretation #2 ~ x x

33 SVD - Interpretation #2 ~ x x

34 SVD - Interpretation #2 ~ x x

35 SVD - Interpretation #2 ~

36 SVD - Interpretation #3 finds non-zero blobs in a data matrix = x x

37 SVD - Interpretation #3 finds non-zero blobs in a data matrix = x x

38 SVD - Interpretation #3 finds non-zero blobs in a data matrix =! communities (bi-partite cores, here) Row 1 Row 4 Row 5 Row 7 Col 1 Col 3 Col 4

39 SVD algorithm Numerical Recipes in C (free)

40 SVD - Interpretation #3 Drill: find the SVD, by inspection!! Q: rank =?? =?? x?? x??

41 SVD - Interpretation #3 A: rank = 2 (2 linearly independent rows/ cols) =???? x x????

42 SVD - Interpretation #3 A: rank = 2 (2 linearly independent rows/ cols) = x x orthogonal??

43 SVD - Interpretation #3 column vectors: are orthogonal - but not unit vectors: 1/sqrt(3) 0 = 1/sqrt(3) 0 1/sqrt(3) 0 x x 0 1/sqrt(2) 0 1/sqrt(2)

44 SVD - Interpretation #3 and the singular values are: 1/sqrt(3) 0 = 1/sqrt(3) 0 1/sqrt(3) 0 x x 0 1/sqrt(2) 0 1/sqrt(2)

45 SVD - Interpretation #3 Q: How to check we are correct? 1/sqrt(3) 0 = 1/sqrt(3) 0 1/sqrt(3) 0 x x 0 1/sqrt(2) 0 1/sqrt(2)

46 SVD - Interpretation #3 A: SVD properties:! matrix product should give back matrix A matrix U should be column-orthonormal, i.e., columns should be unit vectors, orthogonal to each other! ditto for matrix V matrix Λ should be diagonal, with non-negative values

47 SVD - Complexity O(n*m*m) or O(n*n*m) (whichever is less)!! Faster version, if just want singular values! or if we want first k singular vectors! or if the matrix is sparse [Berry]!! No need to write your own! Available in most linear algebra packages (LINPACK, matlab, Splus/R, mathematica...)

48 References Berry, Michael: Fukunaga, K. (1990). Introduction to Statistical Pattern Recognition, Academic Press.! Press, W. H., S. A. Teukolsky, et al. (1992). Numerical Recipes in C, Cambridge University Press.

49 Case study - LSI Q1: How to do queries with LSI?! Q2: multi-lingual IR (english query, on spanish text?)

50 Case study - LSI Q1: How to do queries with LSI?! Problem: Eg., find documents with data data inf. retrieval brain lung CS = x x MD

51 Case study - LSI Q1: How to do queries with LSI?! A: map query vectors into concept space how? data inf. retrieval brain lung CS = x x MD

52 Case study - LSI Q1: How to do queries with LSI?! A: map query vectors into concept space how? q= data inf. retrieval brain lung term2 v2 v1 q term1

53 Case study - LSI Q1: How to do queries with LSI?! A: map query vectors into concept space how? q= data inf. retrieval brain lung term2 v2 q A: inner product (cosine similarity) with each concept vector v i v1 term1

54 Case study - LSI Q1: How to do queries with LSI?! A: map query vectors into concept space how? q= data inf. retrieval brain lung term2 v2 A: inner product (cosine similarity) with each concept vector v i v1 q q o v1 term1

55 Case study - LSI compactly, we have:! Eg: q= data inf. retrieval brain lung q V= q concept! CS-concept = term-to-concept similarities

56 Case study - LSI Drill: how would the document ( information, retrieval ) be handled by LSI?

57 Case study - LSI Drill: how would the document ( information, retrieval ) be handled by LSI? A: SAME:! d concept = d V Eg: d= data inf. retrieval brain lung CS-concept = term-to-concept similarities

58 Case study - LSI Observation: document ( information, retrieval ) will be retrieved by query ( data ), although it does not contain data!! d= q= data inf. retrieval brain lung CS-concept

59 Case study - LSI Q1: How to do queries with LSI?! Q2: multi-lingual IR (english query, on spanish text?)

60 Case study - LSI Problem:! given many documents, translated to both languages (eg., English and Spanish)! answer queries across languages

61 Solution: ~ LSI data inf. retrieval brain lung Case study - LSI informacion datos CS MD

62 Switch Gear to Text Visualization What comes up to your mind?! What visualization have you seen before? 62

63 Word/Tag Cloud (still popular?) 63

64 Word Counts (words as bubbles) 64

65 Word Tree 65

66 Phrase Net Visualize pairs of words that satisfy a particular pattern, e.g., X and Y 66

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #18: Dimensionality Reduc7on

CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #18: Dimensionality Reduc7on Dimensionality Reduc=on Assump=on: Data lies on or near a low d- dimensional subspace Axes of this subspace