EFFICIENT CLUSTERING OF VERY LARGE DOCUMENT COLLECTIONS

Transcription

1 Chapter 1 EFFICIENT CLUSTERING OF VERY LARGE DOCUMENT COLLECTIONS Inderjit S. Dhion, James Fan and Yuqiang Guan Abstract An invauabe portion of scientific data occurs naturay in text form. Given a arge unabeed document coection, it is often hepfu to organize this coection into custers of reated documents. By using a vector space mode, text data can be treated as high-dimensiona but sparse numerica data vectors. It is a contemporary chaenge to efficienty preprocess and custer very arge document coections. In this paper we present a time and memory efficient technique for the entire custering process, incuding the creation of the vector space mode. This efficiency is obtained by (i) a memory-efficient muti-threaded preprocessing scheme, and (ii) a fast custering agorithm that fuy expoits the sparsity of the data set. We show that this entire process takes time that is inear in the size of the document coection. Detaied experimenta resuts are presented a highight of our resuts is that we are abe to effectivey custer a coection of 113,716 NSF award abstracts in 23 minutes (incuding disk I/O costs) on a singe workstation with modest memory consumption. Keywords: custering, arge document coections, hash tabes, spherica k-means, vector space mode 1. Introduction Large coections of documents are becoming increasingy common. The pubic internet currenty has more than 1.5 biion web pages, whie private intranets aso contain an abundance of text data. A vast amount of important scientific data appears as technica abstracts and papers. Given such arge document coections it is important to organize them into structured ontoogies. This organization faciitates navigation and search, and at the same time provides a framework for continua maintenance as document repositories grow in size. 1

2 2 Manua construction of structured ontoogies is one possibe soution and has been adopted to organize the internet ( and to structure ibrary content. However, this process has the obvious disadvantage of being too abor intensive, and is viabe ony in arge corporations. Thus it is desirabe to seek automatic methods for organizing unabeed document coections. Given a coection of unabeed data points, custering refers to the probem of automaticay assigning cass abes to the data and has been widey studied in statistica pattern recognition and machine earning [DH73, Mit97]. A starting point for appying custering agorithms to unstructured text data is to create a vector space mode, aternativey known as a bagof-words mode [SM83]. The basic idea is (a) to extract unique contentbearing words from the set of documents treating these words as features and (b) to then represent each document as a vector of certain weighted word frequencies in this feature space. Observe that we may regard the vector space mode of a text data set as a word-by-document matrix whose rows are words and coumns are document vectors. Typicay, a arge number of words exist in even a moderatey sized set of documents where a few thousand words or more are common. Thus for arge document coections, both the row and coumn dimensions of the matrix are quite arge. However, as we wi discuss ater in greater detai, this matrix is typicay very sparse with amost 99% of the matrix entries being zero. Using the vector space mode various cassica custering agorithms such as the k-means agorithm and its variants, hierarchica aggomerative custering, and graph-theoretic methods have been expored in the text mining iterature; for detaied reviews, see [Ras92, Wi88]. Recenty, there has been a furry of activity in this area, see [BGG + 98, CKPT92, SS97, ZE98]. A substantia amount of this work has concentrated on custering web search resuts where the document coections to be custered are not very arge. In this paper, our main concern is in obtaining a highy efficient process for custering very arge document coections. Our main motivation is that we want to custer coections in excess of more than 100,000 documents in a reasonabe amount of time on a singe processor. Thus our main emphasis is on high speed and scaabiity with modest main memory consumption. The custering process invoves reading the text documents from disk, and preprocessing them to form the vector space mode before using a particuar custering agorithm. It turns out that main memory consumption and disk I/O costs in this process can be prohibitive. In order to aeviate these probems, we empoy a memoryefficient muti-threaded approach to reading and preprocessing the docu-

3 Efficient Custering of Very Large Document Coections 3 ments. For creating the vector space mode, we use efficient and scaabe data structures such as oca and goba hash tabes. Finay, we use the highy efficient and effective spherica k-means agorithm that fuy expoits the sparsity of the data [DM01]. The above steps ead to a highy efficient custering process as a resut we are abe to preprocess and custer a coection of 113,716 NSF award abstracts documents in 23 minutes on a Sun workstation with modest memory consumption. This paper is organized as foows. In Section 2 we highight the chaenges in obtaining an efficient and effective process for custering arge document coections. Section 3 describes our muti-threaded agorithm for creating the vector space mode. In Section 4 we present the spherica k-means agorithm which is ideay suited for custering high-dimensiona sparse text data. Section 5 presents speed and memory consumption resuts on some document coections, in particuar we examine the resuts on a coection of 113,716 NSF award abstracts. Finay, Section 6 concudes with a short summary and discussion of future work. A quick word about notation. Lower case bod etters such as x, c wi denote vectors whie x wi denote the 2-norm of the corresponding vector. Thus when we say that x = 1 we mean that x T x = Chaenges As mentioned above, the preprocessing phase eads to the creation of the vector space mode. Given a arge document coection, the foowing steps are invoved in this phase: (a) read a input documents from disk, (b) parse into word tokens using efficient reguar expression pattern matching, (c) ookup words rapidy in hash tabes to track the number of occurrences of each word, and finay, (d) output the vector space mode. Efficient parsing using reguar expression pattern matching is a westudied probem. Existing software toos such as FLEX provide a quick and easy way to construct the required parser [Pax96]. Aso, efficient hash tabe indexing and access is provided in pubic-domain software [MS96]. Thus effective soutions to steps (b) and (c) above are easy to obtain. However I/O costs in steps (a) and (d) can be substantia, especiay if the documents need to be accessed from a Network Fie System(NFS) [Ca99]. In such a case, the time taken by these steps can be quite arge and often unpredictabe depending on other traffic on the NFS. Our approach to sove this probem is to use mutipe threads [NBF96] so that input documents can be read in a parae fashion.

4 4 Another chaenge is main memory consumption. After a the documents have been read and processed there is a fina phase which invoves a sweep and modification of the entire vector space mode. One simpe approach to faciitate this phase is to retain the partiay formed vector space mode in main memory whie the rest of the documents are being processed. We impemented such an approach and on our test coection of 113,716 documents, this required neary 500 MBytes of main memory. Extrapoating this memory requirement to 1 miion documents, we can see that this main-memory based impementation wi require severa gigabytes of memory. This cost is unacceptabe since our goa is to preprocess and custer a miion documents on current workstations consuming ess than 256 MBytes of main memory. In Section 3 we introduce such a memory efficient agorithm that aows us to preprocess a arge number of documents without sacrificing much speed. In our test coection of 113,716 documents there are a tota of more than 150,000 unique words. After a pruning step, we retain 26,000 unique words in the vector space mode. Thus, the document vectors are very high-dimensiona. However, typicay, most documents contain ony a sma subset of the tota number of words. Hence, the document vectors are very sparse a sparsity of 99% is common. The major chaenge is to find a custering agorithm that can yied good effective soutions for very high-dimensiona data and at the same time, expoit the sparsity of the vector space mode. Since we are interested in custering very arge document coections we seek agorithms that consume time and memory that is inear in the size of the document coection. 3. Efficient Preprocessing In this section, we describe our preprocessing agorithm for creating the vector space mode. Before we describe the detais we give a high eve description of the task at hand Vector Space Mode The basic idea is to represent each document as a vector of certain weighted word frequencies. In order to do so, the foowing parsing and extraction steps are needed. 1 Ignoring case, extract a unique words from the entire set of documents. 2 Eiminate non-content-bearing stopwords such as a, and, the, etc. For sampe ists of stopwords, see [FBY92, Chapter 7]. 3 For each document, count the number of occurrences of each word.

5 Efficient Custering of Very Large Document Coections 5 4 Using heuristic or information-theoretic criteria, eiminate noncontent-bearing high-frequency and ow-frequency words [SM83]. 5 After the above eimination, suppose w unique words remain. Assign a unique identifier between 1 and w to each remaining word, and a unique identifier between 1 and d to each document. The above steps outine a simpe preprocessing scheme. In addition, one may extract word phrases such as New York, and one may reduce each word to its root or stem, thus eiminating puras, tenses, prefixes, and suffixes [FBY92, Chapter 8]. The above preprocessing yieds the number of occurrences of word j in document i, say, f ji, and the number of documents which contain the word j, say, d j. Using these counts, we can represent the i-th document as a w-dimensiona vector x i as foows. For 1 j w, set the j-th component of x i, to be the product of three terms x ji = t ji g j s i, where t ji is the term weighting component and depends ony on f ji, whie g j is the goba weighting component and depends on d j, and s i is the normaization component for x i. Intuitivey, t ji captures the reative importance of a word in a document, whie g j captures the overa importance of a word in the entire set of documents. The objective of such weighting schemes is to enhance discrimination between various document vectors for better retrieva effectiveness [SB88]. There are many schemes for seecting the term, goba, and normaization components, see [Ko97] for various possibiities. In this paper we use the popuar tfn scheme known as normaized term frequencyinverse document frequency. This scheme uses t ji = f ji, g j = og(d/d j ) and s i = ( w j=1 (t jig j ) 2 ) 1/2. Note that this normaization impies that x i = 1, i.e., each document vector ies on the surface of the unit sphere in R w. Intuitivey, the effect of normaization is to retain ony the proportion of words occurring in a document. This ensures that documents deaing with the same subject matter (that is, using simiar words), but differing in ength ead to simiar document vectors Preprocessing Agorithm The input to the preprocessing step is a data directory that contains a the documents to be processed. The documents may aso be contained within subdirectories of the input directory. The output is the vector space mode described in the above section, which can be represented as

6 6 1.Initiaize goba vocabuary hash tabe. 2. Recursivey wak through the input directory to obtain the ist of documents to be processed. 3. Process a document. Output each 3. Process a document. Output each word id and its frequency of occurrence word id and its frequency of occurrence in the document to a temporary fie. in the document to a temporary fie. 4. Remove words that are too common or too rare from the goba vocabuary hash tabe, and reassign word ids. 5. Output the fina vocabuary aong with their ids and tota frequencies. 6. Reprocess the temporary fies to assign the correct word ids, and output the resuts as a sparse matrix in CCS format. Figure 1.1. Outine of the preprocessing agorithm a highy sparse word-by-document matrix. We store this sparse matrix by using the Compressed Coumn Storage(CCS) format [DGL89]. In this format, we record the vaue of each non-zero eement, aong with its row and coumn index. The coumn indices represent the input documents, the row indices represent ids of distinct words present in the document coection, and the non-zero entries in the matrix represent the frequencies of words in documents. Figure 1.1 gives an outine of the preprocessing agorithm. The agorithm first initiaizes a goba hash tabe. To resove whether a word has been encountered previousy, a oca and a goba hash tabe are used. Both these hash tabes use words as keys and store the corresponding row indices and frequencies as vaues. As the names suggest, the goba hash tabe keeps track of words and their occurrences in the entire document coection whie the oca hash tabe does so for just one document. After initiaizing the goba hash tabe, the agorithm recursivey waks through the input directory to obtain the ist of documents to be processed. The preprocessing agorithm then creates severa threads of computation. The purpose of each thread is to process a set of documents independenty and output its resuts into temporary fies. Detais of this processing are given in Figure 1.2 which we discuss in the next paragraph. After a the threads have finished, the goba hash tabe is examined, and words that are too common or too rare are removed from the goba hash tabe. Unique word ids are assigned to the

7 Efficient Custering of Very Large Document Coections 7 1 Initiaize the oca hash tabe (this hash tabe uses words as keys, and stores row indices and frequencies as vaues). 2 Get a token from the document, and convert the token to owercase. 3 Discard the token if it is too ong or too short or is a non-content stopword such as a, and, the, etc. 4 If the word aready exists in the oca hash tabe, increment the frequency of that entry, otherwise insert the word into the oca hash tabe with row index 1 and frequency 1. 5 After the whoe document has been processed, set the row indices in the oca hash tabe to the corresponding ones in the goba hash tabe. If a word in the oca hash tabe does not exist in the goba hash tabe, assign a new row index to the word and add to the goba hash tabe (note that this requires ocking the goba hash tabe to prevent simutaneous modification by another thread). 6 Output the contents of the oca hash tabe (row indices and frequencies) to a temporary fie, and discard the oca hash tabe. Detais on the processing of a document by each thread (step 3 of Fig- Figure 1.2. ure 1.1) words that sti remain in the goba hash tabe. The temporary fies are then reoaded, the word ids are resoved and then the fina vocabuary and word-by-document matrix are output. Figure 1.2 describes the various steps performed by each preprocessing thread. Two decisions warrant further expanation the use of temporary fies for storing the partia vector space mode and the way in which the oca and goba hash tabes are accessed. As mentioned in the ast section, storing the partia vector space mode in main memory woud require a few gigabytes of main memory and is thus prohibitive for modern workstations. Hence to reduce main memory consumption we store the contents of the oca hash tabe onto temporary fies. Since this ony eads to oca disk access, the resuting overhead is not substantia. The goba hash tabe is accessed and modified by a processing threads and hence is a shared resource. In order to achieve maximum paraeism, we need to minimize the number of times the goba hash tabe is ocked and modified by each processing thread. We achieve this

8 8 by using a oca hash tabe to process the entire document first, and then making a bock access to the goba hash tabe. This access invoves resoving the word ids and possibe modification of the goba hash tabe, at which time this data structure needs to be ocked. See Figure 1.2 for detais, especiay step 5. The major main memory consumption in our preprocessing agorithm is due to the goba vocabuary tabe. The partia vector space mode is stored in temporary fies instead of main memory. Thus the overa main memory requirement is O(W ), where W is the number of distinct words that appear in the document coection. It is we known that W grows sower than ineary with the size of the document coection Heaps Law states that the number of unique words in a text of size d grows as O(d β ), where β is a positive number ess than one[hea78]. Thus the main memory consumption grows sower than ineary with the size of the document coection. The computation time for the preprocessing step is approximatey inear with respect to the input data size since each word in a document is processed in O(1) amortized time. Performance resuts on arge document coections that vaidate these caims are given in Section Efficient Custering of High-Dimensiona Text Data Given the vector space mode, the document vectors may be represented by x 1, x 2,..., x d, where each x i R w. Reca that w stands for the number of unique words in the vector space mode and d is the tota number of documents. A custering of the document coection is its partitioning into the disjoint subsets π 1, π 2,..., π k, i.e., k π j = {x 1, x 2,..., x d } and π j π = φ, j. j=1 The most important and chaenging characterisitics of the vector space modes that arise from text data are high dimensionaity and sparsity. Typicay, w is in the thousands and a sparsity of 99% is common. For purposes of efficiency, it is important that the custering agorithm expoit the sparsity of the data whie giving meaningfu resuts at the same time. The spherica k-means agorithm satisfies both these properties and hence is our agorithm of choice. We now briefy formaize this agorithm highighting its saient features. More detais may be found in [DM01]. Any text custering agorithm needs an objective notion of simiarity between documents. A widey used measure of simiarity is the cosine

9 Efficient Custering of Very Large Document Coections 9 of the ange between two document vectors [FBY92, SM83]. Cosine simiarity is easy to interpret and simpe to compute for sparse vectors and has been used in other information retrieva appications, such as query retrieva. Based on cosine simiarity, we can define the goodness or coherence of custer π j as x T i c j, (4.1) x i π j where each x i is assumed to be normaized such that x i = 1 and c j is the normaized centroid of custer π j, x c j = i π j x i x i π j x i. By the Cauchy-Schwarz inequaity, x T i z x T i c j, z R w, x i π j x i π j and thus the normaized centroid is the vector that is cosest in cosine simiarity (in an average sense) to a the document vectors in the custer π j. As a resut, we aso ca the vector c j s as concept vectors. Aggregating (4.1) over a custers, we can measure the goodness of any given partitioning {π j } k j=1 using the foowing objective function: Q({π j } k j=1 ) = k j=1 x i π j x T i c j. (4.2) Intuitivey, the objective function measures the combined coherence of a the k custers. Having posed the objective function, we now present an agorithm that attempts to maximize its vaue Spherica k-means Agorithm It is we known that finding the partitioning that maximizes (4.2) is NP-Compete [KPR98, Theorem 3.1]; aso, see [GJW82]. Thus we seek a heuristic agorithm that can quicky find a good oca maximum. For this purpose, we use a variant of the cassica k-means agorithm [DH73] that uses the cosine simiarity measure. Since both the document and concept vectors ie on the surface of a high-dimensiona sphere, we ca this variant the spherica k-means agorithm. This agorithm proceeds as foows.

10 10 1 Initiaize custering. Start with some initia partitioning of the document vectors, namey {π (0) j } k. Let {c(0) j=1 j } k be the concept j=1 vectors of the associated partitioning. Set the iteration count t to 0. 2 Re-assign document vectors. For each document vector x i, 1 i d, do the foowing: a. Compute x T i c(t) for a = 1, 2,..., k. b. From among a x T i c(t) computed above, find j = arg max x T i c(t) (break ties arbitrariy if x i has argest cosine simiarity with more than one concept vector). This induces the new partitioning π (t+1) j = {x i : j = arg max x T i c (t) }, 1 j k. 3 Update concept vectors. Compute the concept vectors corresponding to the new partitioning: s j = x i, c (t+1) j = s j x s i π j, 1 j k. j 4 Check the stopping criterion. If the stopping criterion is satisfied, then exit. Otherwise increase t by 1 and go to step 2 above. In our impementation, the stopping criterion is: Q({π (t) j } k ) j=1 Q({π(t+1) j } k ) j= Q({π (t) j } k ). j=1 It can be shown that the above agorithm yieds a gradient-ascent scheme. In particuar, we can show that the objective function vaue in (4.2) does not decrease from one iteration to the next, i.e., Q({π (t) j } k ) j=1 Q({π(t+1) j } k ). j=1 Like any other gradient-ascent scheme, the spherica k-means agorithm is prone to oca maxima. A carefu seection of initia partitions

11 Efficient Custering of Very Large Document Coections 11 {π (0) j } k j=1 is important. One can either (a) randomy assign each document to one of the k custers, (b) first compute the concept vector for the entire document coection and randomy perturb this vector to get k starting concept vectors or (c) try severa initia custerings and seect the best in terms of the argest objective function. We use strategy (b) in our impementation. We now examine the time and memory compexity of the above agorithm. We assume that the number of non-zero entries in the sparse matrix is nz, and that the above agorithm iterates τ times before stopping. Using our initiaization strategy, step 1 of the agorithm costs O(nz + k w) operations. For each iteration, step 2a costs O(nz k) operations whie 2b costs O(k d) comparisons. Step 3 updates the concept vectors and costs O(nz + k w) operations. Thus the tota time compexity for τ iterations is O((nz k + k d + k w) τ). Typicay nz max(d, w), hence it is cear that step 2a is the most computationay expensive step in the agorithm and the overa compexity of the agorithm is O(nz k τ). The main memory consumed by the agorithm is for storing the document vectors and for od and new copies of the concept vectors. Storing the document vectors in the CCS sparse matrix form requires 4(2nz+d+ 4) bytes whie the concept vectors require 8kw bytes; hence the memory consumption is modest Simiarity Estimation The computationa botteneck in the spherica k-means agorithm is the dot product computation between a the document vectors and concept vectors (see step 2a). During the course of the agorithm it turns out that the first few iterations ead to a ot of movement of documents between custers. However, just after a few iterations the custers become more stabe, see the soid ine in the eft pot of Figure 1.3. Consequenty, the overa objective function vaue aso settes down after 4-5 iterations as seen in the right pot of Figure 1.3. When the custers become stabe, there are potentia savings if we can somehow avoid computing unnecessary dot products between document vectors and far away concept vectors. We now introduce a technique for estimating cosine simiarities which aows us to avoid such dot product computations.

12 12 Potentia vs actua number of document assignment changes 12 x k=12 potentia changes actua changes Objective function 3 x k=20 k=12 k= Iteration count (t) Iteration count (t) Figure 1.3. Custers stabiize after just a few iterations. Suppose c (t) is the concept vector of custer at iteration t and x is a document vector. Then, x T c (t) x T c (t+1) x c (t) c (t+1) = c (t) c (t+1), x T c (t) c (t) c (t+1) x T c (t+1) x T c (t) + c (t) c (t+1) (4.3) The right side of inequaity (4.3) gives a simiarity upper bound which can be used profitaby in the (t + 1)-st iteration to avoid computing x T c (t+1). The idea is to store in memory the quantities c (t) and upper bounds for x T c (t) c (t+1). Suppose x beongs to the custer j at the t-th iteration. Then, at iteration t + 1, we can update in O(1) time the simiarity upper bound for x T c (t+1), j. If this simiarity upper bound is smaer than x T c (t+1) j (which is computed exacty) there is no need to expicity compute x T c (t+1), otherwise the exact vaue needs to be computed. As the custers become more and more stabe, this simiarity estimation can dramaticay cut down on computation (see Figure 1.4). In our impementation, we store the simiarity upper bounds in the d k matrix U. We obtain the new agorithm by repacing step 2 of the spherica k-means agorithm by the foowing step (here we assume that simiarity estimation is started after t min iterations).

13 Efficient Custering of Very Large Document Coections x Number of dot products computed With simiarity estimation Origina agorithm Iteration count (t) Figure 1.4. Computationa Savings due to Simiarity Estimation 2. Re-assign document vectors. For each document vector x i, 1 i d, do the foowing: a. If t t min, do step 2a as in Section 4.1. ese if t = t min, set U(i, ) = x T i c(t) for a = 1, 2,..., k. ese if t > t min, do the foowing steps for a = 1, 2,..., k, a1. Set U(i, ) = U(i, ) + c (t) a2. Compute x T i c(t) where x i beongs to custer j at iteration t 1. If U(i, ) > x T i c(t) j x T i c(t). b. From among a x T i c(t) j c (t 1)., compute xt i c(t) and set U(i, ) = computed above, find j = arg max x T i c(t). Figure 1.4 shows the considerabe savings in the number of dot product computations as the iteration count increases in a typica run of the agorithm. To obtain these savings, an extra storage requirement of 4dk bytes is required to store the matrix U. Additionay, an extra O(dk) operations to update the matrix U are required (see step a1 above).

14 14 However, these extra costs are sma compared to the resuting savings in computation time, see Section for an exampe Custering for Dimensionaity Reduction We now highight another use of our custering agorithm. Using the vector space mode, each document may be represented as a highdimensiona vector of words. Ceary the occurrence of one word in a document is not independent of other words. Thus there is a great dea of redundancy in this coection of vectors. Dimensionaity reduction is a technique that tries to represent each document as a vector with fewer dimensions that are more independent. It turns out that the above custering agorithm aso yieds a fast and effective technique for dimensionaity reduction. Let {c j } k j=1 denote the concept vectors corresponding to a custering of the document vectors. The concept matrix C k is defined to be a d k matrix such that, for 1 j k, the j-th coumn of the matrix is the concept vector c j, i.e, C k = [c 1, c 2,..., c k ]. An approximation X k to the word-document matrix X may be obtained by taking the east squares projection of X onto the coumn space of C k, i.e., the inear subspace spanned by the concept vectors. This may be expressed as X k = C k Z k, where Zk is a k d matrix that is to be determined by soving the foowing east-squares probem: Z k = arg min X C k Z 2 F. Z It is we known that a cosed form soution exists for this east-squares probem, namey, Zk = (CT k C k) 1 Ck T X. (4.4) The i-th coumn of Zk gives the reduced k-dimensiona representation of the i-th document vector. Typicay the origina dimensionaity w is in the thousands whie k is much smaer. In previous work, we have shown that empiricay, the quaity of the reduced dimensiona representation given by (4.4) is comparabe to the best possibe, namey the k-truncated SVD, see [DM01] for detais. Thus our custering process foowed by dimensionaity reduction can be used in various appications, such as query retrieva, text cassification, etc.

15 Efficient Custering of Very Large Document Coections Experimenta Resuts In this section, we give experimenta resuts for the entire custering process this incudes time and memory consumed by the preprocessing phase as we as by the custering agorithm. In addition, we need to evauate the goodness of the custering produced. Evauating custering resuts is a tricky business. However, in situations where documents are aready categorized(abeed), we can compare the custers with the true cass abes. For this comparison we use the measures of purity and entropy as defined beow, see aso [SGM00]. Suppose we are given c categories (true cass abes) whie the custering agorithm produces k custers. Custer π s purity can be defined as P (π ) = 1 max n h (n(h) ), where n = π and n (h) is the number of documents in π that beong to cass h, h = 1,..., c. Note that each custer may contain sampes from different casses. Purity gives the ratio of the dominant cass size in the custer to the custer size itsef. A high purity vaue impies that the custer is a pure subset of the dominant cass. Additionay, we aso use entropy as a quaity measure, which is defined as foows: ( ) H(π ) = 1 c n (h) (h) n og. og c n n h=1 Entropy is a more comprehensive measure than purity. It considers the distribution of casses in a custer. Note that we have normaized entropy to take vaues between 0 and 1. An entropy vaue of 0 means the custer is comprised entirey of one cass, whie an entropy vaue near 1 is bad since it impies that the custer contains a uniform mixture of casses. We now examine these quaity measures on a sampe coection. We formed a coection of 3893 documents by merging the popuar MED- LINE, CISI, and CRANFIELD sets. MEDLINE consists of 1033 abstracts from medica journas, CISI consists of 1460 abstracts from information retrieva papers, whie CRANFIELD consists of 1400 abstracts from aeronautica systems papers (ftp://ftp.cs.corne.edu/pub/smart). We preprocessed this coection by proceeding as in Section 3. After removing common stopwords, the coection contained unique words from which we eiminated ow-frequency words appearing in ess than 8 documents (roughy 0.2% of the documents), and 7 high-frequency

16 16 words appearing in more than 585 documents (roughy 15% of the documents). We were finay eft with 4303 words using which we created 3893 document vectors using the tfn scheme. Each document vector has dimension 4303, however, on an average, each document vector is about 99% sparse. For this contrived coection, we used our custering agorithm to form 3 custers. The foowing tabe shows the confusion matrix for this custering, from which we can see that the custers can be easiy identified with the three casses MEDLINE, CISI and CRANFIELD. MEDLINE CRANFIELD CISI patients(1023) boundary(1388) ibrary (1481) In the above tabe the rows denote custers. Note that we have denoted the custers by the most frequenty occurring word in the custer from among the words in the vector space mode patients, boundary and ibrary. The above tabe says that the custer patients has 1023 documents of which 1020 beong to the MEDLINE cass, the boundary custer has 1388 documents whie the ibrary custer has 1481 documents. Notice that the confusion matrix is amost diagona which shows that the custering agorithm neary recovered the three casses. This fact is aso refected by the high purity and ow entropy vaues of the three custers shown beow. Custer# Purity Entropy Case Study on a Large Scientific Coection To show the efficiency and effectiveness of our agorithms, we now present resuts on a arge rea-ife coection of NSF award abstracts. We obtained the NSF data set by downoading abstracts of grants awarded by the Nationa Science Foundation between Jan 1958 and August 1999 from For our experiments we extracted tites and abstracts (ignoring fieds such as Type, NSF Org, Date, etc.) of awards made in the 8 NSF organizations: Directorate for Bioogica Sciences (BIO), Directorate for Computer and Information Science and Engineering (CISE), Directorate for Education and Human Resources (EHR), Directorate for Engineering (ENG), Directorate for Geosciences (GEO), Directorate for Mathematica and Physica Sciences (MPS), Office of

17 Efficient Custering of Very Large Document Coections 17 Poar Programs (OPP) and Directorate for Socia, Behaviora, and Economic Sciences (SBE). This resuts in a tota of abstracts. The number of documents per NSF organization is as foows: Cass abe ENG CSE MPS BIO GEO EHR SBE OPP # documents The most popuous cass is MPS with over awards, with ENG being second. On the other hand, OPP contains the fewest (2767) number of documents. Both the mean and median cass size is about We preprocessed this NSF coection by proceeding as in Section 3. After removing common stopwords, the coection contained unique words from which we eiminated ow-frequency and highfrequency words. We were finay eft with words using which we created document vectors using the tfn scheme. Each document vector has dimension 26424, however, on an average, each document vector contains ony about 72 nonzero components and thus is more than 99% sparse. In terms of the words contained in these documents, Tabe 1.1 shows the top ten most common words in the NSF data set. Note that we refer to the most frequenty occurring word as having rank 1, the second most frequent word as having rank 2, and so on. As is common in most document coections the majority of the words occur in very few documents. Figure 1.5 shows the distribution of a the word frequencies versus their rank on a og-og scae. This distribution approximatey fits the so-caed Zipf s aw [Zip49]. Rank Word Frequency 1 abstract research tite project study data system university program science Tabe 1.1. Top 10 most common words in the NSF data set As mentioned above, there are a tota of unique word in the NSF coection. In genera, the number of unique words W grows sower

18 Word frequency Top 10 most common words 1. abstract 2. research 3. tite 4. project 5. study 6. data 7. systems 8. university 9. program 10. science Word rank Figure 1.5. Distribution of word frequencies on a og-og scae (Zipf s Law) 1.6 x Number of distinct words Number of documents x 10 4 Figure 1.6. Vocabuary size vs. number of documents (Heap s Law)

19 Efficient Custering of Very Large Document Coections Main memory consumed (MBytes) Size of data (MBytes) Figure 1.7. Main memory consumed vs. size of data than ineary with the number of documents, i.e., W = O(d β ) where 0 < β < 1. This behaviour known as Heap s aw [Hea78] is seen in Figure Preprocessing Resuts. We now show the resuts of our custering process on this arge NSF coection. The preprocessing phase takes time that grows ineary with the size of the data set, whie the main memory consumed aso grows ineary with the number of unique words. Experiments on subsets of the NSF coection vaidate these caims, see Figures 1.7 and 1.8. Figure 1.7 shows that we consume ess than 18 MBytes of main memory to preprocess the entire NSF coection of documents. Extrapoating this number to a coection of one miion documents, we see that 160 MBytes of main memory woud be sufficient, which impies that our agorithms can carry out this arge task on a singe workstation with just 256 MBytes of main memory. From Figure 1.8 we see that the entire NSF data set is preprocessed in ess than 20 minutes. Note that this incudes disk I/O time. Our impementation is muti-threaded and hence the time taken is not very sensitive to the traffic over the Network Fie Server Custering Resuts. In this section, we examine the speed and quaity of the custering agorithm. As discussed in Sec-

20 Computation time (seconds) Size of data (MBbyes) Figure 1.8. Computation time vs. size of data tion 4.1, the spherica k-means agorithm forms k custers in O(nz k) time. Figure 1.9 gives the time taken to custer the NSF data set ( documents, words, non-zero entries) into a varied number of custers. Times for both the origina custering agorithm and its modification that uses simiarity estimation are shown (see Section 4.2). The simiarity estimation technique yieds considerabe savings in computation time when the number of custers is arge. The running time of the agorithm is aso seen to increase ineary with the number of custers k. To form 100 custers, 1400 seconds of computationa time is needed. On the other hand, to form 12 custers ony 200 seconds are needed. Thus, combined with the preprocessing time of about 1190 seconds, the entire custering process for 12 custers is seen to take ony about 23 minutes. We now present a sampe custering obtained by the spherica k-means agorithm. We custered the entire set of documents into 12 custers. In the foowing tabe, we have isted the 10 dominant words and their weights in each of the concept vectors that correspond to the 12 custers. Note that we have denoted each of the 12 custers by the dominant word, e.g., theory, chemistry, physics, etc. By examining the top 10 words it is easy to see the concept contained in each custer. For exampe, the conference custer containing 6652 documents appears to be about NSF awards that support conferences, meetings, workshops and symposiums for internationa scientists and researchers.

21 Efficient Custering of Very Large Document Coections With simiarity estimation Origina agorithm 5000 Computation Time(seconds) Number of custers Figure 1.9. Time required to custer the NSF coection We now see the extent to which our custering agorithm is abe to recover the 8 NSF casses: BIO, CSE, EHR, ENG, GEO, MPS, OPP and SBE. The foowing gives the confusion matrix between the 12 custers and 8 casses. Note that the argest number in each row of the above tabe is bodfaced. We have ordered the custers and casses so that the confusion matrix has arge numbers near the diagona. The fact that the numbers near the diagona are arger than those away from the diagonas impies that many of the custers can be identified with NSF organizations. For exampe, the theory and chemistry custers can be identified with MPS, the socia custer with SBE, the protein and species custers with BIO and the ocean custer can be identified with GEO. Interestingy, some of the custers indicate the mutidiscipinary nature of some NSF organizations. The conference custer is seen to have many awards from MPS, ENG, SBE and BIO but no particuar organization is dominant. Aso, organizations such as MPS are seen to have subcategories mathematica theory, chemistry reactions and quantum physics. The numbers in the confusion matrix aso indicate the inter-reationships between various NSF orgainzations. For exampe, the materias custer indicates coseness between MPS and ENG whie the design custer shows the coseness between ENG and CSE. The quaity measures of purity and entropy for this custering are given in the foowing tabe.

22 22 theory(7442) chemistry(7420) physics (9835) materias(11589) design(13255) conference(6652) mathematics(7374) aboratory(11302) socia(8267) protein(9212) species(8822) ocean(12546) theory(.339),mathematica(.295),equations(.25),geometry(.192), probems(.186),sciences(.177),differentia(.168) agebraic(.165),manifods(.126),spaces(.114) chemistry(.448),reactions(.217),organic(.21),nmr(.194), moecues(.182),meta(.169),chemica(.151),compounds(.142), spectroscopy(.122),reaction(.12) physics(.189),quantum(.166),magnetic(.166),partice(.144), eectron(.135),stars(.13),soar(.129),energy(.123), aser(.115),theoretica(.112) materias(.292),phase(.192),properties(.156),fims(.14), poymer(.135),surface(.125),thin(.116),optica(.114), temperature(.11),iquid(.107) design(.243),agorithms(.216),contro(.18),parae(.175), performance(.149),probems(.144),software(.118),modes(.115), computer(.112),optimization(.1) conference(.47),workshop(.399),internationa(.23),hed(.207), meeting(.181),symposium(.178),trave(.135),participants(.112), scientists(.107),researchers(.096) mathematics(.342),teachers(.333),schoo(.259),education(.177), coege(.151),year(.150),teacher(.132),summer(.125), schoos(.12) aboratory(.303),equipment(.241),undergraduate(.225), computer(.186),engineering(.185),courses(.175),student(.132), bioogy(.126),facuty(.124),projects(.124) socia(.27),economic(.180),poitica(.167),poicy(.143), dissertation(.13),anguage(.126),pubic(.104),cutura(.103), decision(.1),market(.087) protein(.273),ce(.258),ces(.235),proteins(.22),gene(.196), genes(.179),dna(.153),expression(.138),moecuar(.131), reguation(.119) species(.326),pant(.167),popuations(.151),pants(.133), popuation(.125),evoutionary(.125),evoution(.101), variation(.098),ecoogica(.096) ocean(.232),ice(194),cimate(.153),mante(.147),sea(.134), water(.118),seismic(.106),circuation(.097),pacific(.096), isotopic(.092) Tabe 1.2. Top 10 words associated with each custer (the numbers in the right coumn give the word s weight in the concept vector) Despite the above quaity measures, it is sometimes best to have an informa way of examining a custering. For this purpose, we have created a sequence of web pages to browse through the custers, and the documents and keywords contained in them. The custering presented above is avaiabe at

23 Efficient Custering of Very Large Document Coections 23 MPS ENG CSE EHR SBE OPP BIO GEO theory(7442) chemistry(7420) physics(9835) materias(11589) design(13255) conference(6652) mathematics(7374) aboratory(11302) socia(8267) protein(9212) species(8822) ocean(12546) Tabe 1.3. Confusion matrix between the 8 true casses and 12 custers for the entire NSF coection Custer Purity Entropy theory chemistry physics materias design conference mathematics aboratory socia protein species ocean Tabe 1.4. coection Purity and entropy resuts for the computed custers of the entire NSF 6. Concusions and Future Work In this paper, we have presented a time and memory efficient scheme for custering very arge document coections. We empoy a mutithreaded approach to reading and preprocessing the documents to mitigate disk I/O costs, and use the effective spherica k-means agorithm to custer the documents. Our experimenta resuts show that we are abe to preprocess and custer 113, 716 NSF award abstracts in 23 minutes on a Sun workstation with modest memory consumption. We have aso

24 24 demonstrated that the quaity of the custering is good. Based on the experiments we have done in this paper, we predict that we can use our scheme to custer a miion documents into 12 custers in ess than 4 hours on a Sun workstation. In future work, we wi continue our focus on improving the efficiency and scaabiity of our scheme. The dot product computation between a the document vectors and concept vectors is the computationa botteneck in the spherica k-means agorithm. To cut down on this computation, techniques ike truncation which project each document vector onto a sma subspace of the tota word space wi be investigated, see [SS97]. Meanwhie, more sophisticated handing of I/O threads wi be studied in order to cut down the I/O cost which is the botteneck for preprocessing. Paraeizing the whoe process can be one of the approaches. Hierarchica custering wi aso be studied in future work to discover the inherent hierarchica structure and the correct number of custers in the data set. References [BGG + 98] D. Boey, M. Gini, R. Gross, E.-H. Han, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, and J. Moore. Document categorization and query generation on the Word Wide Web using WebACE. AI Review, [Ca99] Brent Caaghan. NFS Iustrated. Addison-Wesey, [CKPT92] D. R. Cutting, D. R. Karger, J. O. Pedersen, and J. W. Tukey. Scatter/gather: A custer-based approach to browsing arge document coections. In ACM SIGIR, [DGL89] [DH73] [DM01] I. Duff, R. Grimes, and J. Lewis. Sparse matrix test probems. ACM Trans Math Soft, pages 1 14, R. O. Duda and P. E. Hart. Pattern Cassification and Scene Anaysis. Wiey, I. S. Dhion and D. S. Modha. Concept decompositions for arge sparse text data using custering. Machine Learning, 42(1): , January Aso appears as IBM Research Report RJ 10147, Juy [FBY92] W. B. Frakes and R. Baeza-Yates. Information Retrieva: Data Structures and Agorithms. Prentice Ha, Engewood Ciffs, New Jersey, [GJW82] M. R. Garey, D. S. Johnson, and H. S. Witsenhausen. The compexity of the generaized Loyd-Max probem. IEEE Trans. Inform. Theory, 28(2): , 1982.

25 REFERENCES 25 [Hea78] [Ko97] [KPR98] J. Heaps. Information Retrieva - Computationa and Theoretica Aspects. Academic Press, T. G. Koda. Limited-Memory Matrix Methods with Appications. PhD thesis, The Appied Mathematics Program, University of Maryand, Coege Park, Mayand, Jon Keinberg, C. H. Papadimitriou, and P. Raghavan. A microeconomic view of data mining. Data Mining and Knowedge Discovery, 2(4): , December [Mit97] Tom M. Mitche. Machine Learning. McGraw-Hi, [MS96] [NBF96] D. Musser and A. Saini. STL Tutoria and Reference Guide. Addison-Wesey, Bradford Nichos, Bick Buttar, and Jackie Proux Farre. Pthreads Programming. O Reiy & Associates, Inc., 981 Chestnut Street, Newton, MA 02164, USA, [Pax96] Vern Paxson. Fex user manua, November [Ras92] E. Rasmussen. Custering agorithms. In Wiiam B. Frakes and Ricardo Baeza-Yates, editors, Information Retrieva: Data Structures and Agorithms, pages Prentice Ha, Engewood Ciffs, New Jersey, [SB88] G. Saton and C. Buckey. Term-weighting approaches in automatic text retrieva. Information Processing & Management, 4(5): , [SGM00] A. Streh, J. Ghosh, and R. Mooney. Impact of simiarity measures on web-page custering. In Proceedings of the AAAI2000 Workshop on Artificia Inteigence for Web Search, pages 58 64, Austin, Texas, Juy AAAI/MIT Press. [SM83] G. Saton and M. J. McGi. Introduction to Modern Retrieva. McGraw-Hi Book Company, [SS97] [Wi88] [ZE98] [Zip49] H. Schütze and C. Siverstein. Projections for efficient document custering. In ACM SIGIR, P. Wiet. Recent trends in hierarchic document custering: a critica review. Information Processing & Management, 24(5): , O. Zamir and O. Etzioni. Web document custering: A feasibiity demonstration. In ACM SIGIR, G. K. Zipf. Human Behavior and the Principe of Least Effort. Addison Wesey, Reading, MA, 1949.