IX- On Some Clustering Techniques for Information Retrieval. J. D. Broffitt, H. L. Morgan, and J. V. Soden

IX-1 IX- On Sme Clustering Techniques fr Infrmatin Retrieval J. D. Brffitt, H. L. Mrgan, and J. V. Sden Abstract Dcument clustering methds which have been prpsed by R. E. Bnner and J. J. Rcchi are cmpared. Bnner's methd is fund t give higher precisin than Rcchi T s methd, while the recall fr the tw methds is abut the same. Bnner's methd necessitates abut twice as many cmparisns against a query vectr as Rcchi's methd; this is t be expected since Rcchi cntrls the cluster size in rder t maximize search efficiency. Manual relevance judgments are used as well as relevance judgments determined by query dcument csines. The results are fund t be invariant under the tw measures. 1. Intrductin The rganizatin f infrmatin int hmgeneus grups plays a majr rle in many fields f research. Sme areas f applicatin are infrmatin retrieval, bilgical taxnmy, islatin f disease syndrmes in medicine, anthrplgy (categrizatin f tribes), and business applicatins such as categrizing TV audiences, sales ffices, etc. Indeed, the applicatins f infrmatin rganizatin are numerus, and the particular applicatin being studied dictates the type f classificatin needed. Needham[l] has divided classificatin prblems int three types: (1) the assignment f given bjects t given classes, (2) the extractin f class characteristics frm given classes and their bjects, and

rx-2 (3) the setting up f apprpriate classes, clusters, clumps, r grups given a set f bjects and same infrmatin abut them. Even mre specifically, this last prblem may be viewed as a prblem f either structuring the bjects int a hierarchy r tree arrangement, r cllecting the bjects int cherent grups withut regard fr hierarchical relatins amng the bjects r grups. Bth f these prblems find their place within the realm f an autmatic infrmatin retrieval system, where dcuments are identified by vectrs f weighted measures f ccurrences f cncepts. Thesaurus cnstructin, fr example, while being cncerned with the gruping f cccurring cncepts fr synnym references, has as a majr bjective the determinatin f hierarchies f infrmatin amng the cncepts. On the ther hand, dcument clustering deals with the cllectin f similar dcuments int grups based nly n similarities amng the dcuments. The applicatin f infrmatin rganizatin t dcument clustering is the cncern f this study. In rder effectively t perate an autmatic infrmatin retrieval system based n vectr matching between queries and dcuments, an efficient dcument clustering prcedure is necessary. The number f dcuments which must be cmpared with the query in rder reasnably t satisfy the demands f an infrmatin request is t large t permit individual cmparisns with each dcument in the cllectin. Once the dcuments have been clustered int hmgeneus grups, a tw-level search prcedure greatly reduces the number f cmparisns needed t answer a request. This methd first cmpares the query vectr with the classificatin vectrs characterizing the grups. The secnd level cmpares the query vectr with all dcuments belnging t

n-3 the grups with whse classificatin vectr the query vectr was fund t crrelate higily n the first level. While increasing the search efficiency in terms f the number f cmparisns made with a query vectr, the clustering prcess inherently decreases the infrmatin which categrizes the individual dcuments. Thus the prblem f dcument clustering is clear: classifying the dcuments int hmgeneus grups in rder t increase search efficiency withut seriusly sacrificing the ability t retrieve the dcuments. schemes have been suggested t this end. Varius These prcedures gather dcuments int grups based n sme type f crrelatin functin which assciates similar dcuments. Since there exists n purely analytical methd fr cmparing the relative efficiency f these varius methds, and since this "efficiency" may differ with the use t which the grups are put, it is necessary t cmpare the clustering prcedures in the cntext f an autmatic infrmatin retrieval system, and t evaluate the merits f the clustering prcedures by evaluating the verall perfrmance f such a system, with nly a change in the clustering methd frm experiment t experiment. The evaluatin f the verall perfrmance f such a system clearly invlves the envirnment in which the system is used, and the users themselves. Specifically, the present study cnsists f a cmparisn f tw clustering prcedures: a methd f R. E. Bnner [2], and the methd prpsed by J. J. Rcchi [3]. These techniques can reasnably be studied by a cmparisn f the dcuments retrieved by an autmatic retrieval system using a tw level search based n the clusters prduced by the methds with thse retrieved by a full search f the dcument cllectin.

IX-4 2. Similarity Measures In the present system, dcuments and queries are represented as vectrs in n-dimensinal Euclidean space, where n is the number f allwable cncepts r index terms in the system. Dcuments are retrieved n the basis f their clseness t the query vectr, with "clseness" meaning small Euclidean distance between the vectrs. Since the dcument vectrs are f varying length, hwever, perpendicular distance at same fixed distance frm the rigin may nt always be a gd measure. Nrmalizatin f the dcument vectrs s that their endpints lie n the unit hypersphere, and use f the arc-length alng the hypersphere as a distance measure, remves this prblem. The measures used in the present study are functins f this arc length thrugh the csine f the angle between tw dcument vectrs. The measures used are defined in the fllwing ways: (1) Csine measure S d d - 12 (2) Tanimt's measure [k] d l- d 2 ivl^l ^ where d, and d- are dcument vectrs and S-. is the similarity f 1 2 aid2 dcument ne with dcument tw. Bnner uses the cefficient (2) t frm his dcument-dcument similarity matrices, while Rcchi uses the csine measure t cmpare dcuments.

EC-5 Rcchi states that since the csine is used as the matching functin t retrieve dcuments, clustering with it shuld give better results. T test this, Bnner's methd is being tried using bth cefficients. 3. Rcchifs Prcedure The stated bjective f Rcchi1s prcedure is that f jintly maximizing search efficiency and minimizing lss f relevant dcuments retrieved, in the search. It is a heuristically derived algrithm which is meant t be used in cnjunctin with a tw-level search. The input parameters t the algrithm are: (1) the number f categries desired; (2) lwer and upper bunds n the number f elements t be allwed in a categry; (3) a lwer bund n the crrelatin between a dcument and a classificatin vectr, belw which a dcument will nt be placed in a categry. All dcuments are first cnsidered unclustered, and pass frm, this state int ne f tw ther pssible states, clustered r lse. The algrithm prceeds as fllws. An unclustered dcument is selected-as a pssible cluster center. All f the ther unclustered and lse dcuments are crrelated with it and the selected dcument is subjected t a regin density test t see. if a categry shuld be frmed arund it. This test specifies that mre than N, dcuments shuld be crrelated higher than p. with the candidate, and that mre than N~ dcuments shuld be crrelated higher than p p with the candidate. This ensures that dcuments n the edge f large grups d nt becme centers f grups. Fr example, in This sectin is a summary f sectin k.5 f reference [3], t which the reader is referred fr a mre detailed discussin and prgram flwcharts.

K-6 Figure 1 belw, dcument A wuld pass the test while dcument B wuld nt (the dcuments are here represented by their endpints n the unit hypersphere) If the dcument passes the \ ': # # Example f the density test. Figure 1 regin density test, the cutff n categry size is used t find the lwest crrelated dcument that wuld be included in the grup. This figure is used t setup a crrelatin p.. If a dcument crrelates belw p. with a classificatin vectr, it will nt be included in the cluster. By using this cutff, dcuments that are in an area between tw classificatin vectrs are likely t be included nly in the cluster t which they are crrelated mre highly. This means that the bundaries between grups f dcuments which lie near each ther will be sharpened, althugh sme dcuments may still be included in bth grups. A classificatin vectr is then frmed by taking the centrid f all f the dcument vectrs belnging t the cluster at this time. This centrid is then matched against the entire cllectin, and the cutff parameters n categry size are used t create the cluster. At this pint, sme dcuments may be in mre than ne cluster. Als, sme dcuments which were in a cluster when the centrid was frmed may n lnger be in the cluster. These dcuments, as well as thse which fail the regin density test, are then marked lse, and thse in the cluster are marked clustered.

DC-7 This entire prcedure is repeated with all unclustered dcuments. When this is cmpleted, there is n guarantee that the required minimum number f clusters have been frmed. Hence, sme dcuments which failed the regin density test in the first pass are chsen as cluster centers. If the number f categries frmed in the first pass was t.high, the density test may be made stricter and the first pass repeated. There may still be sme relatively islated dcuments at the end f the clustering prcess. These crrelate very prly with any f the classificatin vectrs. In rder t prperly test the prcedures with the limited cllectins available, these dcuments are included in the cluster with which they crrelate highest. This prcedure has been prgrammed fr the CDC l66k FORTRAN 63 and CODAP by V. Lesser. cmputer in The utput f this prgram cnsists f a deck f punched cards which specify the dcuments belnging t each categry and the classificatin vectr fr each categry frmed. These cards are then used as input t a tw-level search prcedure prgram, written by the authrs in FORTRAN 63, which cmpares the queries t the dcuments in the clusters crrelating highest with the query vectr. b. Bnner's Prcedure This algrithm is based upn Clustering Prgrams I and II and the Cluster Adjustment Prgram presented by Bnner in reference [2], and has been prgrammed by the authrs fr the CDC 160^ cmputer as FORTRAN 63 subrutines DOCDOC, SIMSIM, CLUSTER, and ADJUSTCL. Subrutine DOCDOC accepts as input a binary dcument-term matrix and calculates a dcument-dcument similarity matrix S, using either f the

n-8 similarity measures discussed in sectin 2. S is then transfrmed int a binary matrix whse elements are zer (0) if the calculated similarity cefficient is belw an empirically determined threshld, r ne (l) therwise. Subrutine SIMSIM then calculates the similarity matrix f the dcument-dcument matrix in a manner cmpletely analagus t DOCDOC. This is bviusly equivalent t calculating the similarity matrix f anther similarity matrix and is repeated as many times as necessary t better define the clusters which result. Next, subrutine CLUSTER uses the algrithm as utlined in Figure 2 t define a "cluster" as that set f dcuments f maximum size where each dcument is similar (in the final dcument-dcument matrix f SIMSIM) t all ther members f the set. In additin, n cluster can be a subset f anther cluster* After CLUSTER has fund the "tight" clusters r, in graph theretic terms, the maximal cmplete sublps f the dcument cllectin, subrutine ADJUSTCL attempts t redefine the clusters in a manner mre cnducive t search ptimizatin. This is accmplished by transferring a member f a cluster with membership less than a predetermined cnstant int the large cluster t which it is mst similar. This similarity is the percentage f dcuments in the large cluster t which the transferring dcument is linked in the first dcument-dcument similarity matrix. Befre the transfer is made, hwever, this percentage is checked against a predetermined value (SIMTHEES), belw which the transfer will nt be made. After all shifting is cmpleted, the classificatin vectr is calculated. This is merely the centrid vectr f all f the dcument vectrs

EC-9 s,_ - (j s k F W 1 in "last" similarity matrix} U d'c D, ial,...,m} «list f clusters fund = subset f D frm which clusters are t be drawn = pssible cluster currently being frmed «list f dcuments currently being cnsidered fr inclusin in C j «n. f clusters fund m «n. f d. D C P n «n. f dcuments currently in C k - C(n) z p - & n p n k n-u C» C + P n (l) n n + 1 C «C - C(n) n n - 1 k = C(n) ' J CONTINUE i LIST THE j CLUSTERS, Tj, IN F w-i-yi Bnner's Cluster Building Algrithm H Figure 2

IX-10 in the cluster. This set f classificatin vectrs is used in the tw-level search prcedure, as described at the end f sectin 3. 5. The Experiment The experimental envirnment cnsisted f 82 dcuments and 20 queries frm an American Dcumentatin Institute (ADI) cllectin. These dcuments were autmatically indexed by the SMART autmatic dcument retrieval system. [5] This system prvided 82 dcument vectrs and 20 query vectrs in 601-dimensinal Euclidean space. These vectrs are then nrmalized t length ne and used as input t bth Bnner1 s and Rcchi's clustering prcedures. Since each f these prcedures depends n several parameters, many runs were planned with each methd in rder empirically t determine desirable values fr these parameters. The results presented belw are based mainly n six cmputer runs, fur using Bnner's clustering methd and tw using Rcchi's clustering methd. Each run cnsists f a clustering prcess fr the 82 dcuments, resulting in the generatin f the classificatin vectrs, fllwed by a tw-level search fr each f the 20 queries. In additin, a run was made t match each query with the entire dcument cllectin using the csine matching functin, s as t btain an rdering f the dcuments by this crrelatin fr each f the 20 queries. At the end f the first level f the search, nly the tw clusters which crrelate highest with a query are retained. This is dne because f the small size f the cllectin. If mre than tw clusters are retained, the ttal number f cmparisns with a given query vectr becmes

K-ll t large t btain any savings ver a search f the entire cllectin. In larger cllectins, ne wuld prbably wish t lk nt nly at a certain number f clusters, but t take int accunt their crrelatins with a query as well, when determining hw many clusters t search. 6. Evaluatin The results f the experiment are evaluated by using the recall and precisin measures defined in reference [5]*, and the mean number f cmparisns per query. In calculating the recall and precisin measures, the questin f which dcuments are relevant t a given query arises. Fr the,82 dcument ADI cllectin, judgments have previusly been made by searching the entire cllectin manually t identify all dcuments relevant t a given query. It is nt clear that recall and precisin calculated by using these manual relevance judgments prvide a gd measuring device fr cmparing the clustering prcedures. Hence, an additinal set f relevance judgments has been made based upn the results f the search f the entire cllectin described in the previus sectin. Fr each query, a.n dcuments crrelating abve.30 with a query (.30 was chsen t give the same average number f relevant dcuments per query as the manual judgments gave) are cnsidered relevant. In the results presented belw, the recall and precisin figures fr bth the manual and the "autmatic" relevance judgments are given. The authrs feel that bth sets f measures shuld be cnsidered in evaluating the perfrmance f a clustering prcedure as part f an verall autmatic infrmatin retrieval system. * Recall = number f relevant dcuments retrieved number f relevant dcuments in the cllectin number f dcuments retrieved Precisin = number f relevant dcuments retrieved

EC-12 7. Results and Cnclusins The results f the study are summarized in Table 1* One striking characteristic f Bnner's methd is the large number f clusters it prduces. This is t be expected since Bnner refuses t assciate dissimilar dcuments, whereas Rcchi allws dissimilar dcuments t be assciated in rder t build clusters f a size cnducive t search efficiency. Thus, ne wuld expect that the mean number f matches made with a query vectr using Rcchifs methd wuld be less than the mean number f matches made using Bnner's methd. The results supprt this hypthesis since apprximately twice as many matches are required using Bnner's methd as when using Rcchi's methd. Next, restricting ur attentin t the recall and precisin results btained using the manual relevance judgments, it is apparent that Bnner's methd exhibits higher precisin than, and nearly equivalent recall t Rcchi's methd. One wuld expect the higher precisin since there are far fewer members in each cluster, and hence, when a cluster is retrieved, it is mre likely t cntain a high percentage f relevant dcuments. Als, there is a higher similarity between members f the same cluster with Bnner's methd than with Rcchifs methd. Hence, if the cluster is similar, mre f the dcuments in it are likely t be relevant since they are all very similar. The nearly equivalent recall between the tw methds is smewhat surprising, as ne wuld expect the large number f dcuments retrieved by using Rcchi's methd t include mre f the relevant nes. This is the case if the tw highest ranking clusters are used, but is nt true if nly the highest ranking cluster is used, althugh using nly the

EC-13 w c 0 a c g-s 0) R VO ft cn ft fc H CVJ _ H 3 S DC en VO cn ON ^ IA g rh c: a} 't O -H «8 2 s» H 00 cs CS ^ S 0) CO -* Cn * & a H -3- -4- IA 1 O c *d d -P c cy CD Q P5 35 CO v CV) in IA CO m 1 O s -5f 3 00 (9 cn H CO cvi! 1 IA I CVJ I 00 IA \ cn\ I CO <u -p CO IA IA ON 00 a a CO CO 8 CD I 1 1 Bnner (2) Csine [H IA 1 ' O if d «EH 5 5 1 CVJ I S, 3 2 PQ EH L CVJ CVJ 1 jl ffl En -^ H H 42 O O!2 1 H 45 O v H H x: «t- H ll 00 1 cvn w ^-*

highest ranking clusters makes the number f dcuments retrieved clser t the number retrieved by Bnner's methd. The superirity f the csine similarity measure is evidenced by entry number ne in Table 1. In this run, the csine cefficient was used t btain the similarity matrices fr Bnner's prcedure. This entry clearly dminates the next three entries fr any reasnable measure f "gdness". Further research using Bnner's methd shuld therefre fcus attentin upn this similarity measure. If we cmpare the results fr all f the entries using the manual relevance Judgments with thse using the "autmatic" relevance judgments, it is clear that the relative ranking f the entries is nt changed. These results are, fr all practical purpses, invariant under either set f relevance judgments. The abve results shuld be viewed in their prper light. The dcument cllectin is very small and the number f cmputer runs btained s far t few t make any strng statements. It is clear that a mre thrugh analysis f the effects f the parameters f these clustering prcedures shuld be made befre any slid cnclusins are drawn regarding the relative efficiency f the tw methds tested.

n-15 References [l] R. M» Needham, Applicatins f the Thery f Clumps, Mechanical Translatin, Vl. 8, Ns. 3 and k, June-Octber 1965. [2] R. E. Bnner, On Sme Clustering Techniques, IBM Jurnal f Research and Develpment, Vl. 8, N. 1, January 1964. [3] J«J. Rcchi, Harvard University Dctral Thesis, Reprt N. ISR-10 t the Natinal Science Fundatin, April 1966. [k] D. Rgers and T. Tanimt, A Cmputer Prgram fr Classifying Plaints, Science, Vl. 132, pp. 1115-1118, Octber i960. [5] G. Saltn, et. al. Infrmatin Strage and Retrieval, Reprt N. ISR-9 t Natinal Science Fundatin, Harvard Cmputatin Labratry, August 1965* [6] J. Rubin, Optimal Classificatin int Grups: An Apprach fr Slving The Taxnmy Prblem, (unpublished IBM reprt), March 1966. [7] W. D. Fisher, Tw-Way Cluster Analysis, (preliminary reprt) Kansas State University, 1966.