1 A New Evaluation Measure for Information Retrieval Systems Martin Mehlitz Christian Bauckhage Deutsche Telekom Laboratories Jérôme Kunegis Şahin Albayrak Abstract Some of the establishe approaches to evaluating text clustering algorithms for information retrieval show theoretical flaws. In this paper, we analyze these flaws an introuce a new evaluation measure to overcome them. Base on a simple yet rigorous mathematical analysis of the effect of certain parameters in cluster base retrieval, we show that certain conclusions rawn in the recent literature must be taken with a grain of salt. Our new measure, in contrast, accounts for statistical biases that have to be expecte accoring to our analysis. A series of experiments an a comparison with results reporte recently unerlines that this measure is a more suitable performance inicator that allows for more meaningful interpretations. I. INTRODUCTION Information retrieval is the iscipline within computer science that eals with the automate storage an retrieval of ocuments an ata [1], [2], [3], [4]. As the amount of ata available to moern information societies continues to grow rapily, it is not surprising to fin that information retrieval is a very active area of research. In this paper, we focus on the problems of text retrieval an text clustering where the latter has been propose as a technique for improving text retrieval [5]. In general, systems that apply text clustering either retrieve ocuments by automatically ranking clusters an then selecting ocuments from the highest ranking clusters [6] or by showing the cluster structure to the user who has to select a cluster whose ocuments are then isplaye [7]. The output of a system that automatically ranks clusters is a flat list of ocuments similar to that of a regular information retrieval system. Therefore, the evaluation of these systems is similar to the evaluation of regular information retrieval systems. Evaluating a system that shows a cluster structure to a user though requires other measures. A methoology that is often applie for evaluation of these systems is the optimal cluster search which selects the cluster with the optimal value in regar to the employe measure [8]. In this paper, we analyze the pitfalls of this evaluation methoology. Base on an appropriate mathematical moel for what is going on in clusterbase text retrieval, we reconsier results reporte in a recent paper which evaluates several cluster algorithms for queryspecific clustering [9]. Our main contribution will be a new measure for evaluating information retrieval systems, which circumvents the shortcommings which we ientify in our analysis. The remainer of this paper is structure as follows: In section II, we will briefly review concepts applie in information retrieval. Section III will then present a mathematical analysis of how the optimal cluster search methoology affects precision an recall measures. Base on the results we erive for ranom clusterings, section IV will introuce a new evaluation inicator, which, in the experiments in section V, is compare to the classic evaluation measures. A summary an an outlook on future work will conclue this contribution. II. BACKGROUND AND RELATED WORK A. Clustering Documents for Information Retrieval Clustering is the process of grouping items in a way that items in a group are similar to each other an issimilar to the items in other groups [10]. In information retrieval often the problem for retrieving the esire ocuments given a short query lies in the ambiguity of the query. Terms in a query can occur in various ocuments an contexts an usually it is very ifficult to ientify the ocuments where a search term is use in the intene sense. Therefore, many search results in a result list of a classic information retrieval system are irrelevant. Clustering results for information retrieval is a way to circumvent this problem by structuring the (probably) very long list of search results. Of course, this will not avoi any irrelevant ocuments but if the cluster labels are intuitive enough, the user can irectly choose the cluster which contains the most relevant ocuments for his request. Figure 1 illustrates the process of ocument clustering. Starting with a user querying an information retrieval system with the phrase jaguar, a list of relevant ocuments is foun by matching strings in the search phrase with ocuments. While most search engines stop after the secon step (the actual retrieval of ocuments), text clustering procees by analyzing the ocuments in the result list in orer to group
2 Fig. 1. Query: Jaguar Doc1 ( join the Jaguar club car ) Doc2 ( car clubjaguar ) Doc3 (this weekens racejaguarriver) Doc4 (the Jaguar riverclub) Doc5 (saw animalbig catjaguarhabitat) Doc6 (Jaguarcar sale) Doc7 (animalsjaguar in its natural habitat) Doc8 (as I saw this car, the Jaguar ) Doc9 (Jaguaranimaljoin nature club) Doc10 (Jaguar on a Mac OS computer sale) Doc11 (on my computerjaguarsoftware) Car Doc1, Doc2 Doc3, Doc4 Doc6, Doc8 Animal Doc5, Doc7 Doc9, Computer Doc10, Doc 11, Diactic example of clustering ocuments for information retrieval similar ocuments. The user who querie for jaguar can then choose among the clusters an access the corresponing ocuments. B. Evaluation of Information Retrieval Systems The measures typically employe to evaluate an information retrieval system are precision an recall. Precision ( How many of the retrieve results are relevant? ) an recall ( How many of the relevant results are retrieve? ) are efine as: P (g ret,r)= g ret (1) r an R (g ret,g)= g ret g, (2) where r enotes the number of ocuments retrieve, g correspons to the number of relevant ocuments in the collection an g ret enotes the number of relevant ocuments among the returne ocuments. The effectiveness measure E is propose in [8] as a measure for the evaluation of information retrieval systems that are base on text clustering. E is efine as: ( β 2 +1 ) PR E (P, R) =1 β 2 P + R, (3) where β is a factor that etermines the relative importance of precision an recall where precision an recall values are base on the ocuments in the cluster. C. Evaluation of Document Cluster Information Retrieval Systems Base on the hypothesis that closely associate ocuments ten to be relevant to the same request [4] some information retrieval systems employ ocument clustering in orer to achieve improvement in retrieval of relevant ocuments. Especially hierarchical cluster methos are very popular for text clustering, since hierarchies of clusters are suppose to be relatively intuitive to browse. Research on using ocument clustering techniques for information retrieval resulte in algorithms like Suffix Tree Clustering [7] an the DisCover Algorithm [11] as well as in the esign of the Scatter/Gather System [12] an the CREDO System [13]. Usually, novel ocument clustering methos are compare to existing methos or to an inverte file search, base on recall an precision measures or some variants of these. A common evaluation methoology for such a comparison is the Optimal Cluster Search ([7], [8], [9], [12], [14]). This methoology evaluates every single cluster an then selects the cluster with the optimal value as a representative for the cluster hierarchy. Text clustering for information retrieval can be one either statically on the whole collection of ocuments [6], [8] or in a queryspecific manner for a small number of top ranking ocuments returne by an inverte file search [7], [13]. In [9], Tombros, Villa an van Rijsbergen evaluate several cluster algorithms on ifferent ata sets with the goal of ientifying the extent to which the number n of top ranking ocuments that are clustere influences the retrieval of relevant ocuments. When creating ocument clusters, ocuments can either be limite to occur in only one leaf cluster of a hierarchy (which means they still appear in their parent clusters all the way to the root) or they are allowe to occur in multiple clusters. Note that forcing ocuments to appear in only one cluster is consiere to be artificial [7], [12] since ocuments can contain more than one topic an a goo cluster structure shoul reflect this. III. COMPUTING EXPECTED VALUES FOR RANDOM CLUSTERING When evaluating a text clustering algorithm for information retrieval with the optimal cluster evaluation methoology one has to consier the impact of this methoology on the achieve results. If an algorithm prouces a large number of clusters an the best one is selecte, then the critical point is, how likely it is to achieve such a result if the clusters were generate ranomly. In this section we review the mathematical backgroun for computing the expectation of ranom clusterings. This expectation value can inicate whether the outcome of an experiment was likely to happen by chance or not. First, we explain how to compute the value for a ranom pick of pages from a set of ocuments. Then we aress how a growing number of ranom picks influences this value. A. One Ranom Pick The probability istribution for the ranom pick scenario is a hypergeometric istribution: Given a finite set of items (ocuments) where some are goo (relevant) an some are
3 ba (irrelevant), we raw without replacement. Suppose we have a set of items consisting of g goo items an b ba items. If we raw exactly items, the probability for rawing a number i of goo items is given by: ( g b ) i)( P (x = i) = i=0 ( g+b i ( g+b ). (4) The corresponing expecte value for this probability amounts to: ( ( g b )) µ () = i i)( i ) = g. (5) g + b Writing this expectation as a function of, the total number of items rawn from the set of available items, shall emphasize that, if g an b are fixe, it only epens on how many items are rawn. B. Multiple Ranom Picks The next step is to analyze to which extent the methoology of first generating multiple clusters an then selecting the best one changes the expecte value of goo items. Therefore, we nee to express the likelihoo of an outcome of a certain number of occurrences of relevant pages with regar to the number an size of clusters. Suppose we create c clusters where each cluster contains items. Since the optimal cluster methoology selects the best cluster from a set of clusters, we nee to escribe the probability that a number of i relevant pages is the maximum number of relevant pages in the set of c clusters an that i really occurs. It is given by: B (x = i,, c) P (x = i, c, ) =, (6) A (, c) where A is the total number of ifferent outcomes of choosing c times the number of items an where B is the number of ifferent ways of choosing these items with i being the maximum number of goo items in the set of c clusters an with i really occurring. A is given by: ( ) c g + b A (, c) =, (7) while B is given by: B (x = i,, c) = c j=1 ( ) c C (i, ) j D (i, ) c j, (8) j where j enotes the number of clusters that actually contain i goo items. The first factor in the sum enotes the number of ways of istributing these j goo items over the clusters c. C is the number of ways of rawing the i goo items, while D counts the number of ways of rawing less then i goo items. C an D are given by: an C (i, ) = k=0 ( g i )( b i ) i 1 ( )( ) g b D (i, ) = k k (9) (10) Expecte number of goo ocuments in best cluster c (Number of clusters) Fig. 2. Expectation values obtaine from applying optimal search to ranomly generate ocument clusters. respectively. The expectation for multiple ranom raws is given by: µ all (, c) = xp (x, c, ) (11) x=0 which only epens on the number of clusters an the number of pages in each cluster. C. A Ranom Cluster Example Given the result in (11), let us consier a simple example of how the optimal cluster evaluation metho can istort the evaluation results for a information retrieval. Suppose there are a 100 ocuments, 10 of which are relevant. From these ocuments we always raw ten ocuments. Fig. 2 shows the expecte number of relevant ocuments that the best raw (optimal cluster) contains in relation to the number of times that we raw. Note that a small number of clusters alreay yiels quite a large number of relevant ocuments: Selecting the best cluster out of five ranom clusters yiels an expecte number of relevant ocuments that is more than twice as high than that of a single raw. Alreay this simple experiment unerlines how important it is either to report etaile information on the experimental statistics (which unfortunately is not common practice) or to use an evaluation measure which incorporates them. IV. A NEW MEASURE FOR EVALUATION OF INFORMATION RETRIEVAL SYSTEMS Recall an precision have been wiely employe by the information retrieval community in orer to evaluate information retrieval systems. We propose new measures calle absolute precision an absolute recall that incorporate the knowlege of the expecte value of a ranom retrieval with similar settings for a given scenario an experiment. We efine them as: P abs (g ret,g exp,r)= g ret g exp (12) r R abs (g ret,g exp,g)= g ret g exp (13) g
4 With g exp, the number of relevant ocuments that can be expecte to be retrieve by a ranom retrieval. Therefore, if the information retrieval system oes not employ ocument clustering, the value of g exp can be compute by (5). For a system that oes employ ocument clustering methos, (11) can be use to compute this value if the optimal cluster search is use for evaluation. These new values for precision an recall can then be use to compute the absolute effectiveness by slightly ajusting (3): E abs = E (P abs,r abs ) (14) Other measures base on recall an precision can be erive in a similar fashion. Note that P abs an R abs can be negative given that a ranom retrieval performe better than an actual experiment. Therefore, the values of E abs will be larger than 1 for these cases. V. EXPERIMENTS In this section we will employ the theoretical moel for computing expecte values for clustering ocuments an compare them to the results reporte by Tombros et al. [9] by using the new evaluation measure introuce above. A. Experimental Settings There are several ocument collections an cluster algorithms evaluate by Tombros et al., but for simplicity we shall consier only one collection an one algorithm. Since statistics such as average cluster size are not being reporte for every combination of algorithms an ocument collection, we will focus on the WSJ text corpora 1 an the group average cluster algorithm for which these statistics are most complete. The group average cluster algorithm is also the one which performe best in [9]. For computing the expecte values of an optimal cluster evaluation, we nee to know the number of ocuments n that are clustere an the number of relevant ocuments g among these ocuments. Furthermore, we nee to know the average cluster size an the average number of generate clusters c. Unfortunately, Tombros et al. only report the first three of these values while the average number of generate clusters is missing. Only for their experiments with the full LISA collection 2 (6004 Documents) o they report the number of clusters(6003). In orer to be able to compute the expecte values for the experiments we assume a relatively low number of generate clusters for each value of n. Since we nee integer values for g an, we roun both values own to the next smaller integer. Since the expectation in (11) is growing monotonously in these values, this will result in values that are slightly smaller than the real expecte values for the experiments, favoring the results in [9]. The actual values for the settings in each experiment are shown in Table I. 1 The Wall Street Journal (WSJ) text collection is part of the Tipster corpus which can be obtaine from 2 A collection of abstracts of the Library an Information Science Abstracts from 1982 TABLE I EXPERIMENTAL SETTINGS scenario number number generate average number retrieve relevant clusters size B. Experimental Results With the above settings we compute the expecte number of relevant ocuments in the optimal cluster for a ranom experiment. These values are use to compute the absolute effectiveness. For each scenario, we give the effectiveness values (the smaller the better) reporte by Tombros et al. as well as the values for the absolute effectiveness. TableII compares the values reporte in their experiments to the values that can be expecte with ranomly generate clusters. Note that for small values of n even though the effectiveness reporte is relatively low the absolute effectiveness is not. In fact, the absolute effectiveness for the value of n = 100 top ranking ocuments is higher than one, inicating that ranomly generating a similar number of cluster with a similar size yiels better results than the results reporte by Tombros et al. One of the goals of Tombros et al. was to etermine the importance of the parameter n for the retrieval. They conclue that this number has very little impact on the achieve effectiveness. However, accoring to our experiments it appears that one shoul avoi making any statement about the results achieve with small values of n, since the employe cluster algorithms i not manage to prouce a significantly goo cluster structures. There are several issues concerning the question whether this comparison is fair or not. First, a source of error in the computation of ranom clusters is that we take the average cluster size to compute expecte values for relevant ocuments. Therefore, for setting the parameter β 1in the expression use for computing effectiveness (see (3)), the effectiveness reporte in [9] will always show a positive bias, since the size of optimal clusters changes with this parameter (as acknowlege by the authors). Furthermore, we assume a hypergeometric istribution of ocuments in each cluster. This basically applies to cluster algorithms that allow ocuments to be inclue in more than one cluster at the same level of the hierarchy. The algorithms evaluate by Tombros et al. though are agglomerative hierarchical cluster algorithms, therefore a ocument is only inclue in one leaf noe. However, when Tombros et al. evalute cluster hierarchies they o not limit their search to leaf noes but also regar intermeiate noes, therefore ocuments actually o occur in more than one noe in the evaluate hierarchy.
5 TABLE II EXPERIMENTAL RESULTS scenario expecte number reporte absolute number of relevant effectiveness effectiveness ocuments β =0.5 β =1.0 β =2.0 β = VI. CONCLUSIONS AND FUTURE WORKS In this paper, we have shown that the commonly use methoology for evaluating the optimal cluster search approach to Information Retrieval may result in misleaing interpretations. Still, we consier evaluating every single cluster an selecting the one with the optimal score to be a vali tool for examining clustering approaches to retrieval, if, that is, one takes into account that certain observations are basically ue to the evaluation methoology. For instance, in a recent survey on the effectiveness of clusteringbase retrieval algorithms [9], the authors report that they i not fin a statistically significant variation in queryspecific cluster effectiveness for ifferent values of topranke ocuments. From the rigorous theoretical analysis presente in this paper, however, we can conclue that the algorithms they evaluate ha to perform poorly given the experimental setting an parameterization that was consiere. Concerning examples like this, our main contribution in this paper is therefore the efinition of a new measure for evaluating information retrieval systems that accounts for statistically expectable results by normalizing the performance measure with respect to these. In a series of experiments that analyze the impact of the number of clustere ocuments on the cluster effectiveness, we have shown that the use of this new measure yiels meaningful results. Compare to the figures obtaine from conventional precision an recall measures the absolute precision an absolute recall inicate, for instance, that no serious statement can be mae for small numbers of clustere ocuments. Since the influence of the number of top ranking ocuments that are clustere on the effectiveness of the clustering was not sufficiently analyze by [9], we will further investigate this influence base on the more appropriate performance measures which we introuce in this paper. Machine learning techniques that are employe for learning the parameters of a text cluster algorithm epen on an error function that inicates gooness of a certain set of parameters. We will compare the training results achieve by error functions base on conventional performance inicators to those achieve with error functions base on evaluation measures erive from our moel of expectation values for ranom clustering. REFERENCES [1] G. Kowalski, Information Retrieval Systems: Theory an Implementation, Kluwer, Norwell, MA, [2] M.T. Maybury, E., Intelligent Multimeia Information Retrieval, MIT Press, Cambrige, MA, [3] G. Salton, Introuction to Moern Information Retrieval, McGraw Hill, New York, NY, [4] C.J. van Rijsbergen, Information Retrieval, ButterworthHeinemann, Newton, MA, [5] G. Salton, Automatic Information Organization an Retrieval, Mc Graw Hill, New York, US, [6] W.B. Croft an C.J. van Rijsbergen, Document clustering: An evaluation of some experiments with the cranfiel 1400 collection, Information Storage Retrieval, 11, pp , [7] O. Zamir an O. Etzioni, Web ocument clustering: A feasibility emonstration, in SIGIR. 1998, pp , ACM. [8] N. Jarine an C.J. Van Rijsbergen, The use of hierarchic clustering in information retrieval, Information Storage an Retrieval, vol.7, pp , [9] A. Tombros, R. Villa, an C. J. van Rijsbergen, The effectiveness of queryspecific hierarchic clustering in information retrieval, Inf. Process. Manage, vol. 38, no. 4, pp , [10] A.K. Jain an R.C. Dubes, Algorithms for Clustering Data, Prentice Hall, Upper Sale River, NJ, [11] K. Kummamuru, R. Lotlikar, S. Roy, K. Singal, an R. Krishnapuram, A hierarchical monothetic ocument clustering algorithm for summarization an browsing search results, in Proc. Int. Conf. on Worl Wie Web. 2004, pp , ACM. [12] M.A. Hearst an J.O. Peersen, Reexamining the cluster hypothesis: Scatter/gather on retrieval results, in Proc. SIGIR , pp , ACM. [13] C. Carpineto an G. Romano, Exploiting the potential of concept lattices for information retrieval with CREDO, J. UCS, vol. 10, no. 8, pp , [14] H. Schutze an C. Silverstein, Projections for efficient ocument clustering, in Proc. SIGIR97, 1997, Classification Methos, pp
More information