Text Analytics. Text Clustering. Ulf Leser

Transcription

1 Text Analytics Text Clustering Ulf Leser

2 Content of this Lecture (Text) clustering Cluster quality Clustering algorithms Application Ulf Leser: Text Analytics, Winter Semester 2010/2011 2

3 Clustering Clustering groups objects into (usually disjoint) sets Intuitively, each set should contain objects that are similar to each other and dissimilar to objects in any other set We need a similarity or distance function That is the only text-specific bit the rest is just clustering Often called unsupervised learning We don t know how many sets/classes there are (if there are any) We don t know how those sets should look like Ulf Leser: Text Analytics, Winter Semester 2010/2011 3

4 Nice Ulf Leser: Text Analytics, Winter Semester 2010/2011 4

5 Nice Not Nice In text clustering, we typically have > dimensions Ulf Leser: Text Analytics, Winter Semester 2010/2011 5

6 Text Clustering Applications Explorative data analysis Learn about the structure within your document collection Corpus preprocessing Clustering provides a semantic index to corpus Group docs into clusters to ease navigation Retrieval speed: Index only one representative per cluster Processing of search results Cluster all hits into groups of similar hits (in particular: duplicates) Improving search recall Return doc and all members of its cluster Has similarity to automatic relevance feedback using top-k docs Word sense disambiguation The different senses of a word should appear as clusters Ulf Leser: Text Analytics, Winter Semester 2010/2011 6

7 Processing Search Results The research breakthrough was labeling the clusters, i.e., grouping search results into folder topics [Clusty.com blog] Ulf Leser: Text Analytics, Winter Semester 2010/2011 7

8 Ulf Leser: Text Analytics, Winter Semester 2010/ Similarity between Documents Clustering requires a distance function Must be a metric: d(x,x)=0, d(x,y)=d(y,x), d(x,y) d(x,z)+d(z,y) In contrast to search, we now compare two docs And not a document and a query Nevertheless, the same methods are often used Vector space, TF*IDF, cosine distance ( ) ] [ * ] [ ] [ ]* [ * ), cos( ), ( = = = i d i d i d i d d d d d d d d d sim o

9 Clustering Speed In Information Retrieval We compare a vector with ~3 non-null values (query) with one with ~500 non-null values Usage inverted files to pick docs that have an overlap with the query (~3 log-scale searches in main memory) In clustering We compare ~500 nnv with one with ~500 nnv We need to compare all docs with all docs For some clustering methods; in any case many docs with many docs Inverted files offer much less speed-up compared to scanning To increase speed, feature selection is essential E.g., use the most descriptive terms Perform LSI on docs before clustering Ulf Leser: Text Analytics, Winter Semester 2010/2011 9

10 Cluster Labels To support user interaction, clusters need to have a name Names should capture the topic (semantic) of the cluster Many possibilities Chose term with highest TF*IDF value in cluster E.g. TF computed as average or considering all docs in cluster as one Chose term with highest TF*IDF value in cluster centre Apply statistical test to find terms whose TF*IDF distribution deviates the most between clusters For instance, test each cluster against all others Requires comparison of each cluster with each cluster for each term Only possible if applied after strict pre-filtering Report top-k terms (by whatever method) Extend to bi-grams Ulf Leser: Text Analytics, Winter Semester 2010/

11 Content of this Lecture (Text) clustering Cluster quality Clustering algorithms Application Ulf Leser: Text Analytics, Winter Semester 2010/

12 How many Clusters? Ulf Leser: Text Analytics, Winter Semester 2010/

13 Maybe 2? Ulf Leser: Text Analytics, Winter Semester 2010/

15 Maybe 4 and One Outlier? Ulf Leser: Text Analytics, Winter Semester 2010/

17 Maybe 4 and 2 at Different Levels? Ulf Leser: Text Analytics, Winter Semester 2010/

18 Which Distance Measure did you Use? Ulf Leser: Text Analytics, Winter Semester 2010/

19 Quality of a Clustering There is no true number of clusters We need a method to define which number of clusters we find reasonable (for the data) and which not Finding the clusters is a second, actually somewhat simpler (means: more clearly defined) problem In real applications, you cannot find this by looking at the data Because clustering should help you in looking at the data Explorative usage of course possible Ulf Leser: Text Analytics, Winter Semester 2010/

20 Distance to a Cluster We frequently will have to compute the distance between a point o and a cluster c, d(o,c) And sometimes distances between clusters see hier. clustering Various methods Let c m me the numerical center of a cluster Let c med me the central point of a cluster Let d(o 1,o 2 ) be the distance between two points d ( o, c) = d( o, cm) d ( o, c) = d( o, cmed d( o, c) = p c d( o, c)* ) 1 c Ulf Leser: Text Analytics, Winter Semester 2010/

21 Quality of a Clustering First Approach We could measure cluster quality by the average distance of objects within a cluster Definition Let f be a clustering of a set of objects O into a set of classes C with C =k. The k-score q k of f is q k ( f ) = c C f ( o) = c d( o, c) Remark We could define the k-score as the average distance across all objects pairs within a cluster no centers necessary Ulf Leser: Text Analytics, Winter Semester 2010/

22 6-Score Certainly better than the 2/4/5-score we have seen Thus: Chose the k with the best k-score? Ulf Leser: Text Analytics, Winter Semester 2010/

23 Disadvantage Always has a trivially optimal solution: k= O Score still be useful to compare different clusterings for the same k Ulf Leser: Text Analytics, Winter Semester 2010/

24 Silhouette Considering only the inner-cluster distances is not enough Points in a cluster should be close to each other but also far away from points in other clusters One way: Extend k-score, for instance to: q k ( f ) = d( o, c) c C f ( o) = c i C, i c C 1 But: Sensitive to outliers clusters far away from the others Alternative: Silhouette of a clustering d( o, i)* Punish points that are not uniquely assigned to one cluster Captures how clearly points are part of their cluster 1 Ulf Leser: Text Analytics, Winter Semester 2010/

25 More Formal Definition Let f: O C with C arbitrary. We define Inner score: a(o) = d(o,f(o)) Outer score: b(o) = min( d(o,c i )) with c i f(o) The silhouette of o, s(o), is defined as s( o) = b( o) a( o) max( a( o), b( o)) The silhouette s(f) of f is defined as s( f ) = s( o) Note It holds: -1 s(o) 1 s(o) 0: Point right between two cluster s(o) ~ 1: Point very close to only one (its own) cluster s(o) ~ -1: Point far away from its own cluster Ulf Leser: Text Analytics, Winter Semester 2010/

26 Behavior Silhouette is not always better / worse for more clusters s(o) probably higher s(o) probably lower Ulf Leser: Text Analytics, Winter Semester 2010/

27 Not the End Intuitive clusters need not be hyper-spheres Clusters need not have convex shapes Cluster centre need not be part of a cluster Requires completely different quality metrics Definition must fit to the data/application Not used in text clustering To my knowldge Source: [FPPS96] Ulf Leser: Text Analytics, Winter Semester 2010/

28 Content of this Lecture Text clustering Cluster quality Clustering algorithms Hierarchical clustering K-means Soft clustering: EM algorithm Application Ulf Leser: Text Analytics, Winter Semester 2010/

29 Classes of Cluster Algorithms Hierarchical clustering Iteratively creates a hierarchy of clusters Bottom-Up: Start from D clusters and merge until only 1 remains Top-Down: Start from one cluster and split Or until some stop criterion is met Partitioning Heuristically partition all objects in k clusters Guess a first partitioning and improve iteratively k is a parameter of the method, not a result Other Graph-Theoretic: Max-Cut (partitioning) etc. Density-base clustering Ulf Leser: Text Analytics, Winter Semester 2010/

30 Hierarchical Clustering Also called UPGMA Unweighted Pair-group method with arithmetic mean We only discuss the bottom-up approach Computes a binary tree (dendrogram) Simple algorithm Compute distance matrix M Choose pair d 1, d 2 with smallest distance Compute x = avg(d 1,d 2 ) (the centre point) Remove d 1, d 2 from M Insert x into M Distance between x and any d in M: Average distance between d 1 and d and d 2 and d Loop until M is empty Ulf Leser: Text Analytics, Winter Semester 2010/

31 Ulf Leser: Text Analytics, Winter Semester 2010/ Example ABCDEFG A B. C.. D... E... F... G... (B,D) a ACEFGa A C. E.. F... G... a... A B C D E F G ACGab A C. G.. a... b... (E,F) b A B C D E F G (A,b) c CGac C G. a.. c... A B C D E F G (C,G) d acd a c. d.. A B C D E F G (d,c) e A B C D E F G (a,e) f A B C D E F G ae a e.

32 Visual Ulf Leser: Text Analytics, Winter Semester 2010/

33 Properties Advantages Simple and intuitive Number of clusters is not an input of the method Usually good quality clusters (which clusters?) Disadvantage Does not really generate clusters Very expensive; let n= O, m= K Computing M requires O(n 2 ) space and O(mn 2 ) time Clustering requires O(m*n 2 *log(n)) Why? Not applicable as such to large doc sets Ulf Leser: Text Analytics, Winter Semester 2010/

34 Intuition Hierarchical clustering organizes a doc collection Ideally, hierarchical clustering directly creates a hierarchical and intuitive directory of the corpus Not easy Many, many ways to group objects hierarchical clustering will choose just one No guarantee that clusters make sense semantically Ulf Leser: Text Analytics, Winter Semester 2010/

35 Branch Length Use branch length to symbolize distance Outlier detection Outlier Ulf Leser: Text Analytics, Winter Semester 2010/

36 Variations We used the distance between the centers of two clusters to decide about distance between clusters Other alternatives (incurring different complexities) Single Link: Distance of the two closest docs in both clusters Complete Link: Distance of the two furthest docs Average Link: Average distance between pairs of docs from both clusters Centroid: Distance between centre points Ulf Leser: Text Analytics, Winter Semester 2010/

37 Variations We used the distance between the centers of clusters to decide about distance between clusters Other alternatives (incurring different complexities) Single Link: Distance of the two closest docs in both clusters Complete Link: Distance of the two furthest docs Average Link: Average distance between pairs of docs from both clusters Centroid: Distance between centre points Ulf Leser: Text Analytics, Winter Semester 2010/

40 Single-link versus Complete-link Ulf Leser: Text Analytics, Winter Semester 2010/

41 More Properties Single-link Optimizes a local criterion (only look at the closest pair) Similar to computing a minimal spanning tree With cuts at most expensive branches as going down the hierarchy Creates elongated clusters (chaining effect) Complete-link Optimizes a global criterion (look at the worst pair) Creates more compact, more convex, spherical clusters Ulf Leser: Text Analytics, Winter Semester 2010/

43 K-Means Probably is the best known clustering algorithm Partitioning method Requires the number k of clusters to be predefined Algorithm Fix k Guess k cluster centers at random Can use k docs or k random points in doc-space Loop forever Assign all docs to their closest cluster center If no doc has changed its assignment, stop Or if sufficiently few docs have changed their assignment Otherwise, compute new cluster centres Ulf Leser: Text Analytics, Winter Semester 2010/

44 Example 1 k=3 Choose random start points Quelle: Stanford, CS 262 Computational Genomics Ulf Leser: Text Analytics, Winter Semester 2010/

45 Example 2 Assign docs to closest cluster centre Ulf Leser: Text Analytics, Winter Semester 2010/

46 Example 3 Compute new cluster centre Ulf Leser: Text Analytics, Winter Semester 2010/

47 Example 4 Ulf Leser: Text Analytics, Winter Semester 2010/

48 Example 5 Ulf Leser: Text Analytics, Winter Semester 2010/

49 Example 6 Converged Ulf Leser: Text Analytics, Winter Semester 2010/

50 Properties Usually, k-means converges quite fast Let l be the number of iterations Complexity: O(l*k*n) Assignment: n*k distance computations New centers: Summing up n vectors k times Choosing the right start points is important K-Means is a greedy heuristic and only finds local optima Option 1: Start several times with different start points Option 2: Compute hierarchical clustering on random sample and choose cluster centers as start points ( Buckshot algorithm) How to choose k? Try for different k and use quality score(s) Ulf Leser: Text Analytics, Winter Semester 2010/

51 k-means and Outlier Try for k=3 Ulf Leser: Text Analytics, Winter Semester 2010/

52 Help: K-Medoid Chose the doc in the middle of a cluster as representative PAM: Partitioning around Medoids Advantage Less sensitive to outliers Also works for non-metric spaces as no new center point needs to be computed Disadvantage: Considerable more expensive Finding the median doc requires computing all pair-wise distances in each cluster in each round Overall complexity is O(n 3 ) Can save re-computations at the expense of more space Ulf Leser: Text Analytics, Winter Semester 2010/

53 k-medoid and Outlier Ulf Leser: Text Analytics, Winter Semester 2010/

55 Soft Clustering We always assumed docs are assigned exactly one cluster Probabilistic interpretation: All docs pertain to all clusters with a certain probability Generative model Assume we have k doc-producing devices Such as authors, topics, Each device produces docs that are normally distributed in vector space with a device-specific mean and variance Assume that k devices have produced D documents Clustering can be interpreted as re-covering mean and variance of each distribution / device Solution: Expectation Maximization Algorithm (EM) Ulf Leser: Text Analytics, Winter Semester 2010/

56 Expectation Maximization (rough sketch, no math) EM optimizes the set of parameters Θ of a multi-variant normal distribution (mean and variance of k clusters) given sample data Iterative process with two phases Guess an initial Θ Expectation: Assign all docs its most likely generator based on Θ Maximization: Compute new optimal Θ based on assignment Using MLE or other estimation techniques Iterate through both steps until convergence Finds a local optimum, convergence guaranteed K-Means: Special case of EM k normal distributions with different means but equal variance Ulf Leser: Text Analytics, Winter Semester 2010/

57 Content of this Lecture Text clustering Cluster quality Clustering algorithms Application Clustering Phenotypes Ulf Leser: Text Analytics, Winter Semester 2010/

58 Mining Phenotypes for Function Prediction Ulf Leser: Text Analytics, Winter Semester 2010/

59 Or Source: Ulf Leser: Text Analytics, Winter Semester 2010/

60 Mining Phenotypes: General Idea Established Genotype Gene A Phenotype Genotype Gene B Phenotype? Established Genes with similar genotypes likely have similar phenotypes Question If genes generate very similar phenotypes do they have the same genotype? Ulf Leser: Text Analytics, Winter Semester 2010/

61 Approach GO Annotation Gene A Phenotype Description Prediction Inference Similarity GO Annotation Gene B Phenotype Description Ulf Leser: Text Analytics, Winter Semester 2010/

62 Phenodocs 411,102 phenotypes Short: <250 words Remove all phenotypes associated to more than one gene (~500) PhenomicDB Remove small phenotypes Remove multi-gene phenotypes Remove stop words Stemming 39,610 phenodocs for 15,426 genes Phenodocs Ulf Leser: Text Analytics, Winter Semester 2010/

63 K-Means Clustering Hierarchical clustering would require ~ * = comparisons K-Means: Simple, iterative algorithm Number of clusters must be predefined We experimented with clusters Ulf Leser: Text Analytics, Winter Semester 2010/

64 Properties: Phenodoc Similarity of Genes Genes in phenoclusters Control (Random selection) Pair-wise similarity scores of phenodocs of genes in the same cluster, sorted by score Result: Phenodocs of genes in phenoclusters are highly similar to each other Ulf Leser: Text Analytics, Winter Semester 2010/

65 PPI: Inter-Connectedness Interacting proteins often share function PPI from BIOGRID database Not at all a complete dataset In >200 clusters, >30% of genes interact with each other Control (random groups): 3 clusters Result: Genes in phenoclusters interact with each other much more often than expected by chance Proteins and interactions from BioGrid. Red proteins have no phenotypes in PhenomicDB Ulf Leser: Text Analytics, Winter Semester 2010/

66 Coherence of Functional Annotation Comparison of GO annotation of genes in phenoclusters Data from Entrez Gene Similarity of two GO terms: Normalized n# of shared ancestors Similarity of two genes: Average of the top-k GO pairs >200 clusters with score >0.4 Control: 2 clusters Results: Genes in phenoclusters have a much higher coherence in functional annotation than expected by chance Gene Ontology Molecular Function Biological Process Physiological Process Catalytic Activity Cellular Process Binding Metabolism Transferase Activity Nucleotide Binding Protein Metabolism Kinase Activity Cell Communication Signal Transduction Protein Modification Ulf Leser: Text Analytics, Winter Semester 2010/

67 Function Prediction Can increased functional coherence of clusters be exploited for function prediction? General approach Compute phenoclusters For each cluster, compute set of associated genes (gene cluster) In each gene cluster, predict common GO terms to all genes Common: annotated to >50% of genes in the cluster Filtering clusters Idea: Find clusters a-priori which give hope for very good results Filter 1: Only clusters with >2 members and at least one common GO term Filter 2: Only clusters with GO coherence>0.4 Filter 3: Only clusters with PPI-connectedness >33% Ulf Leser: Text Analytics, Winter Semester 2010/

68 Evaluation How can we know how good we are? Cross-validation Separate genes in training (90%) and test (10%) Remove annotation from genes in test set Build clusters and predict functions on training set Compare predicted with removed annotations Precision and recall Repeat and average results Macro-average Note: This punishes new and potentially valid annotations Ulf Leser: Text Analytics, Winter Semester 2010/

69 Results for Different Filters (Filter 1) (Filter 1 & Filter 2) (Filter 1 & Filter 3) # of groups # of terms # of genes Precision 67.91% 62.52% 60.52% Recall 22.98% 26.16% 19.78% What if we consider predicted terms to be correct that are a little more general than the removed terms (filter 1)? One step more general: 75.6% precision, 28.7% recall Two steps: 76.3% precision, 30.7% recall The less stringent GO equality, the better the results This is a common trick in studies using GO Ulf Leser: Text Analytics, Winter Semester 2010/

70 Results for Different Cluster Sizes With increasing k Clusters are smaller Number of predicted terms increases Clusters are more homogeneous Number of genes which receive annotations stays roughly the same Precision decreases slowly, recall increases Effect of the rapid increase in number of predictions Ulf Leser: Text Analytics, Winter Semester 2010/