An Algorithm For Fast Median Search

Transcription

1 An Algorithm For Fast Median Search A. Juan and E. Vidal Departamento de Sistemas Informáticos y Computación, Universidad Politécnica de Valencia, Valencia, Spain. ajuan@dsic.upv.es Abstract VII National Symposium on Pattern Recognition and Image Analysis, proc. pp , 1997 Searching for a median of a set of patterns is a well-known technique to model the set or to aid searching for more accurate models such as a generalized median or a k-median. While medians and generalized medians are (relatively) easy to compute in the case of Euclidean representation spaces, this is unfortunately no longer true when more complex distance measures are to be used, which often happens in practical situations. In these cases, a direct method to perform median search is not feasible for large sets of patterns. In order to cope with this computational problem, an algorithm for median search is proposed which is faster than the direct method in terms of distance computations. Key Words: Median, Fast Search, Distance, Metric Spaces, Clustering, Nearest-Neighbour techniques. 1 Introduction Assume that we are given a set of patterns in an arbitrary (non-vector) representation space and a distance function to measure the dissimilarity between any pair of them. A well-known technique for modelling the given set consists of finding a pattern of the set whose sum of distances to all patterns is minimum. Such a pattern is called a median. The concept of median is related to a more general concept known as generalized median. This generalization arises when the search is not constrained to the given set, but extended to the whole pattern representation space. Although a generalized median constitutes a more accurate model than a median, finding such a model is often a difficult task that requires complex and specific (approximate) algorithms [2, 1]. Obviously, here we are excluding trivial cases like that of computing the generalized median of a set of patterns in a Euclidean representation space. In these cases where such computation is not trivial, a median can be used as an approximate generalized median or as a starting point for finding better ones [6]. Another generalization of the concept of median is known as k-median. As its name suggests, searching for a -median consists of finding a subset of patterns (models) such that the sum of distances between all the patterns and their corresponding nearest models is minimum. This problem, well-known as a prototype location problem, is NP-Hard [8]. On the other hand, it can be seen as a clustering problem since each possible subset of patterns induces a partition of the set into clusters and a measure of its quality. Good partitions are generally found using a simple k-means-like algorithm which starts on a given initial partition and then loops searching for a median of each cluster and reclassifying patterns according to their nearest medians [5]. Thus, the concept of median also turns out to be useful in finding more accurate models. Searching for a median of a set of Ò patterns is a relatively easy task since it suffices to compute ½ ¾ Ò Ò ½µ pairwise distances between patterns. However, this naive approach is no longer applicable when large sets of patterns are compared through complex distance functions like, for instance, those provided by algorithms of Sequence Comparison [9]. In order to cope with this computational problem, algorithms for fast median search are required but, unfortunately, available techniques are only useful in very particular cases [3, 7].

2 In [4] we proposed an algorithm for fast median search whose complexity is smaller than that of the direct approach both, measured in terms of distance computations, and measured in terms of time not alloted to distance computing (overhead). This algorithm was tested on artificial data and also embedded in a -means-like algorithm which was applied to a corpus of speech data [4]. Although it was found effective in reducing the number of computed distances, greater savings are desirable in practice. Here we propose a faster algorithm in terms of distance computations. A simulation experiment is reported supporting this claim. 2 The Algorithm Let Á ½ Ò be the set of indices associated with a set of Ò different patterns Ü ½ Ü ¾ Ü Ò in a (given) representation space. A metric or distance function is assumed to be given; that is, a function Á Á Ê that satisfies the following conditions µ ¼ (1) µ ¼ if (2) µ µ (3) µ µ µ (4) for all ¾ Á. The sum of distances between a pattern and all the patterns of a given set Ë Á is denoted as Ë µ ¼ ¾Ë if Ë µ if Ë (5) This value will be called the sum on Ë of or simply the sum of if Ë Á. In order to introduce the key idea which enables fast median search, let us suppose that the sums of some patterns have already been calculated. These patterns will be referred as the set Í of used data. As a by-product of this calculation, it is also assumed that the distances between used and unused patterns have been saved in a matrix, and that a pattern ¾ Í whose sum is minimum among those calculated so far has been chosen. Applying the triangle inequality to all ¾ Á we have µ µ µ µ µ µ µ µ µ µ Then, a lower bound of the sum of any unused pattern, Í µ, can be computed as Í µ Í µ ¾Á Í Í µ (6) where ¼ Í µ ÑÜ µ µ ¾Í if Í if Í (7) Notice that all distances involved in the calculation of (6) are available (have already been computed in previous steps). Obviously, we can avoid the calculation of the sum of any unused pattern whose corresponding lower bound is greater than or equals to.

3 Three issues have to be addressed before developing an algorithm based on (6). First, it is plain that we have to distinguish between the unused patterns whose lower bounds are smaller than and the rest of unused patterns. Only the former must be taken into account for sum calculation, so we call them the set of alive data. The rest of unused patterns will be referred as the set of eliminated data. An interesting question is which technique should we follow so as to calculate (6) without extra distance computations and as fast as possible in terms of overhead. It has been assumed that this calculation is carried out with the aid of a matrix where all distances between used and unused patterns are stored. Unfortunately, this technique is very expensive since the size of the required matrix is Í µ ¾ Ç Ò ¾ µ and each calculation of (6) takes Í µí ¾ Ç Ò ¾ µ steps. Let us see an alternative technique which dims the computational burden devoted to this calculation. Assume that we have an array ¾ R Ò where the sums on Í of all alive patterns are stored, and a matrix À ¾ R Ò Ò which contains the estimates provided by (7) between each alive and unused patterns. If a new pattern is selected for sum calculation, the distances involved in the computation of this sum can be used to calculate (6) for each alive pattern as Í µ µ ¾ ÑÜ À µ µµ Although the use of À keeps the space complexity at Ç Ò ¾ µ, this technique is more efficient since each calculation of (6) now takes ¾ Ç Òµ steps. The third issue we are concerned with is about choosing a criterion to select an alive pattern for sum calculation. Since we are interested in quickly choosing a median, it is clear that a natural criterion consists of selecting an alive pattern whose associated lower bound is minimum. The algorithm described in figure 1 searches for a median on the basis of the ideas just discussed. Its space complexity is Ò ¾ µ due to the use of the matrix À. Notice that this matrix can be implemented as a triangular array of size ½ Ò Ò ½µ since (7) fulfils conditions (1) and (3). ¾ Obviously, the time complexity of the algorithm depends on the nature of the distance function adopted. A worst case arises when the discrete metric is used, µ ¼ ½ if if In this case, the estimates given by (7) are the most pessimistic (zero), (6) reduces to sums on Í and no alive patterns are eliminated. As a result, the process ends computing as many distances as the direct method ( ½ ¾ Ò Ò ½µ) and the overhead becomes Ç Ò µ. Suppose now that patterns are represented as Ò different real numbers which are compared through the usual distance, µ Ü Ü where Ü Ü ¾ R ½ Ò. If the first pattern selected for sum calculation is the index of either the minimum or the maximum of these numbers, exact estimates will be given by (7) and thereby (6). Next, a median will be selected making alive patterns into eliminated and, consequently, directly leading to the end of the algorithm. This computational behaviour is optimal since it can be shown that the algorithm can not stop after the calculation of a single sum if Ò ¾. Therefore, although its overhead is Å Ò ¾ µ, it is worth noting that the algorithm can find a median by computing as little as ¾Ò distances. 3 A Simulation Experiment A simulation experiment was carried out on a ¼ (½¼¼ MHz clocked) PC so as to empirically test the computational behaviour of the algorithm. The experiment was based on the generation of sets of Ò

4 Algorithm for Fast Median Search Input A data set Á ½ Ò and a metric Á Á Ê. Output A median ¾ Á and its associated sum ¾ Ê. Variables Í Á; ¾ Ê Ò ; À ¾ Ê Ò Ò ; and ¾ Á. Method Initialize ½, Í, Á, ¼ and À ¼. while Î ¼ do Select Ö ÑÒ ¾ µ and transfer it from to Í. for all ¾ do Compute µ and update and. if then Update and. end if for all ¾ do Let. for all ¾ do Update À ÑÜ À µ and À. if then Transfer from to. end if end while Figure 1: Algorithm for Fast Median Search. artificial patterns (Ò ½ ¾ ½¼¾) randomly drawn from a uniform distribution in the unit - dimensional hypercube ( ¾ ½¼), and compared through the Euclidean distance. Although the algorithm is not intended to deal with patterns in a Euclidean space, we chose such kind of patterns since they are easy to generate. On the other hand, the algorithm never took advantage of their coordinates, but only the distances were used. For each value of Ò and, the algorithm was tested on ½¼¼ independently generated sets of patterns and the resulting numbers of distance computations were averaged so as to get an estimate of the mean number of distances computed by the algorithm. A total of estimates were obtained for different values of Ò and. From these estimates, the percentages with respect to ½ Ò Ò ½µ (number of distances ¾ computed by the direct method) are plotted in figure 2 together with their corresponding ± confidence intervals. In terms of distance computations, these results are better than those (not reported here) of our previous fast algorithm [4] tested on the very same conditions. Although for large dimensions only modest savings are obtained, it should be pointed out that these savings tend to increase as Ò increases and they can be greater than ¼± for low dimensions. On the other hand, the overhead of the new algorithm is significantly higher than that of the previous algorithm [4], which makes the new version only appropriate for computationally expensive distance functions.

5 100 Æ± ½¼ ¾ Ò Figure 2: Average number of distances computed by the new algorithm (together with its corresponding ± confidence interval), expressed as a percentage of the number of distances computed by the direct method (Æ±). Each value was obtained by running the new algorithm on ½¼¼ independent sets of Ò artificial patterns randomly drawn from a uniform distribution in the unit -dimensional hypercube, and compared through the Euclidean distance. 4 Conclusions An algorithm for median search has been proposed which is faster than a previously proposed algorithm [4] in terms of distance computations. Its main application is to be used as a tool for modelling a set of patterns which are compared through a computationally expensive distance function. References [1] F. Casacuberta and M. D. de Antonio. A greedy algorithm for computing approximate Median Strings. In revision, [2] F. Casacuberta and C. de la Higuera. The topology of strings: two NP-Complete problems. In revision, [3] E. Horowitz and S. Sahni. Fundamentals of Computers Algorithms. Computer Science, [4] A. Juan and E. Vidal. An optimized Ã-means-like algorithm. In 4th Portuguese Conference on Pattern Recognition, pages 11 16, Coimbra, Portugal, March [5] A. Juan and E. Vidal. Fast -Means-like Clustering in Metric Spaces. Pattern Recognition Letters, 15(1):19 25, [6] T. Kohonen. Median Strings. Pattern Recognition Letters, 3: , [7] P. B. Mirchandani. The Ô-Median Problem and Generalizations. In Mirchandani and Francis [8], pages [8] P. B. Mirchandani and R. L. Francis, editors. Discrete Location Theory. Wiley, [9] D. Sankoff and J. B. Kruskal, editors. Time Warps, String Edits, and Macromolecules: the Theory and Practice of Sequence Comparison. Addison-Wesley, 1983.