A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment

A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment Panagiotis D. Michailidis and Konstantinos G. Margaritis Parallel and Distributed Processing Laboratory Department of Applied Informatics, University of Macedonia 16 Egnatia str., P.O. Box 191, 4006 Thessaloniki, Greece {panosm,kmarg}@uom.gr http://macedonia.uom.gr/~{panosm,kmarg} Abstract. In this paper, we present three parallel approximate string matching methods on a parallel architecture with heterogeneous workstations to gain supercomputer power at low cost. The first method is the static master-worker with uniform distribution strategy, the second one is the dynamic master-worker with allocation of subtexts and the third one is the dynamic master-worker with allocation of text pointers. Further, we propose a hybrid parallel method that combines the advantages of static and dynamic parallel methods in order to reduce the load imbalance and communication overhead. This hybrid method is based on the following optimal distribution strategy: the text collection is distributed proportional to workstation s speed. We evaluated the performance of four methods with clusters 1, 2, 4, 6 and 8 heterogeneous workstations. The experimental results demonstrate that the dynamic allocation of text pointers and hybrid methods achieve better performance than the two original ones. 1 Introduction Approximate string matching is one of the main problems in classical string algorithms, with applications to information and multimedia retrieval, computational biology, pattern recognition, Web search engines and text mining. It is defined as follows: given a large text collection t = t 1 t 2...t n of length n, ashort pattern p = p 1 p 2...p m of length m and a maximal number of errors allowed k, we want to find all text positions where the pattern matches the text up to k errors. Errors can be substituting, deleting, or inserting a character. In the on-line version of the problem, it is possible to preprocess the pattern but not the text collection. The classical solution involves dynamic programming and needs O(mn) time[14]. Recently, a number of sequential algorithms improved the classical time consuming one; see for instance the surveys [7,11]. Some of them are sublinear in the sense that they do not inspect all the characters of the text collection. D. Kranzlmüller et al. (Eds.): Euro PVM/MPI 2002, LNCS 2474, pp. 432 440, 2002. c Springer-Verlag Berlin Heidelberg 2002

A Performance Study of Load Balancing Strategies 433 We are particularly interested in information retrieval, where current free text collections is normally so very large that even the fastest on-line sequential algorithms are not practical, and therefore the parallel and distributed processing becomes necessary. There are two basic methods to improve the performance of approximate string matching on large text collections: one is based on the finegrain parallelization of the approximate string matching algorithm [2,12,13,6,4,] and the other is based on the distribution of the computation of character comparisons on supercomputers or network of workstations. As far as the second method, is concerned distributed implementations of approximate string matching algorithm are not available in the literature. However, we are aware of few attempts for implementing other similar problems on a cluster of workstations. In [3] a exact string matching implementation have been proposed and results are reported on a transputer based architecture. In [9,10] a exact string matching algorithm was parallelized and modeled on a homogeneous platform giving positive experimental results. Finally, in [,16] presented parallelizations of a biological sequence analysis algorithm on a homogenous cluster of workstations and on an Intel ipsc/860 parallel computer respectively. However, the general efficient algorithms for the master-worker paradigm on heterogeneous clusters have been widely developed in [1]. The main contribution of this work is three low-cost parallel approximate string matching approaches that can search in very large free textbases on inexpensive cluster of heterogeneous PCs or workstations running Linux operating system. These approaches are based on master-worker model using static and dynamic allocation of the text collection. Further, we propose a hybrid parallel approach that combines the advantages of three previous parallel approaches in order to reduce the load imbalance and communication overhead. This hybrid approach is based on the following optimal distribution strategy: the text collection is distributed proportional to workstation s speed. The four approaches are implemented using the MPI library [1] over a cluster of heterogeneous workstations. To the best of our knowledge, this is the first attempt the implementation of approximate string matching application using static and dynamic load balancing strategies on a network of heterogeneous workstations. 2 MPI Master-Worker Implementations of Approximate String Matching We follow master-worker programming model to develop our parallel and distributed approximate string matching implementations under MPI library [1]. 2.1 Static Master-Worker Implementation In order to present the static master-worker implementation we make the following assumptions: First, the workstations are numbered from 0 to p 1, second, the documents of our text collection are distributed among the various workstations and stored on their local disks and finally, the pattern and the number

434 Panagiotis D. Michailidis and Konstantinos G. Margaritis of errors k are stored in the main memory to all workstations. The partitioning strategy of this approach is to partition the entire text collection into a number of the subtext collections according to the number of workstations allocated. The size of each subtext collection should be equal to the size of the text collection divided by the number of allocated workstations. Therefore, the static master-worker implementation that is called P1 is composed of four phases. In first phase, the master broadcasts the pattern string and the number of errors k to all workers. In second phase, each worker reads its subtext collection from the local disk in the main memory. In third phase, each worker performs character comparisons using a local sequential approximate string matching algorithm to generate the number of occurrences. In fourth phase, the master collects the number of occurrences from each worker. The advantage of this simple approach is low communication overhead. This advantage was achieved, a priori, by the search computation, assigning each worker to search its own subtext independently without have to communicate with the other workers or the master. However, the main disadvantage is the possible load imbalance because of the poor partitioning technique. In other words, there is a significant idle time for faster or more lightly loaded workstations on a heterogeneous environment. 2.2 Dynamic Master-Worker Implementations In this subsection, we implement two versions of the dynamic master-worker model. The first version is based on the dynamic allocation of subtexts and the second one is based on the dynamic allocation of text pointers. Dynamic Allocation of Subtexts The dynamic master-worker strategy that we adopted is a known parallelization strategy and is known as workstation farm. Before, we present the dynamic implementation we make the following assumption: the entire text collection is stored on the local disk of the master workstation. The dynamic master-worker implementation that is called is composed of six phases. In first phase, the master broadcasts the pattern string and the number of errors k to all workers. In second phase, the master reads from the local disk the several chunks of the text collection. The size of each chunk (sb) is an important parameter which can be affect the overall performance. More specifically, this parameter is directly related to the I/O and communication factors. We selected several sizes of each chunk in order to find the best performance as we presented in our experiments [8]. In third phase, the master sends the first chunks of the text collection to corresponding worker workstations. In fourth phase, each worker workstation performs a sequential approximate string matching algorithm between the corresponding chunk of text and the pattern in order to generate the number of occurrences. In fifth phase, each worker sends the number of occurrences back to master workstation. In sixth phase, if there are still any chunks of the text collection left, the master reads and distributes next chunks of the text collection to workers and loops back to fourth phase.

A Performance Study of Load Balancing Strategies 43 The advantage of this dynamic approach is low load imbalance, while the disadvantage is higher inter-workstation communication overhead. Dynamic Allocation of Text Pointers Before, we present the dynamic implementation with the text pointers we make the following assumptions: First, the complete text collection is stored on the local disks of all workstations and second, the master workstation has a text pointer that shows the current position in the text collection. The dynamic allocation of text pointers that is called is composed of six phases. In first phase, the master broadcasts the pattern string and the number of errors k to all workers. In second phase, the master sends the first text pointers to corresponding workers. In third phase, each worker reads from the local disk the sb characters of text starting from the pointer that receives. In fourth phase, each worker performs a sequential approximate string matching procedure between the corresponding chunk of text and the pattern in order to generate the number of occurrences. In fifth phase, each worker sends the result back to master. In sixth phase, if the text pointer does not reach the end of the text, then master updates the text pointers for the next position of next chunks of text and sends the pointers to workers and loops back to third phase. The advantage of this simple implementation is that reduces the inter workstation communication overhead since each workstation in this scheme has an entire copy of the text collection on the local disk. However, this scheme requires more local space (or disk) requirements, but the size of the local disk in parallel and distributed architectures is large enough. 2.3 Hybrid Master-Worker Implementation Here, we develop a hybrid master-worker implementation that combines the advantages of static and dynamic approaches in order to reduce the load imbalance and communication overhead. This implementation is based on the optimal distribution strategy of the text collection that is performed statically. In the following subsection, we describe the optimal text distribution strategy and its implementation. Text Distribution and Load Balancing To avoid the slowest workstations to determine the parallel string matching time, the load should be distributed proportionally to the capacity of each workstation. The goal is to assign the same amount of time, which may not correspond to the same amount of the text collection. A balanced distribution is achieved by a static load distribution made prior to the execution of the parallel operation. To achieve a good balanced distribution among heterogeneous workstations, the amount of text distributed to each workstation should be proportional to its processing capacity compared to the entire network: S i l i = p 1 j=0 S (1) j

436 Panagiotis D. Michailidis and Konstantinos G. Margaritis where S j is the speed of the workstation j. Therefore, the amount of the text collection that is distributed to each workstation M i (1 i p) isl i n, wheren is the length of the complete text collection. The hybrid implementation that is called is same as the P1 implementation but we use the optimal distribution method instead of the uniform distribution one. The four entire parallel implementations are constructed so that alternative sequential approximate string matching algorithms can be substituted quite easily [7,11]. In this paper, we use the classical SEL dynamic programming algorithm [14]. 3 Experimental Results In this section, we discuss the experimental results for the performance of four parallel and distributed algorithms. These algorithms are implemented in C programming language using the MPI library [1]. 3.1 Experimental Environment The target platform for our experimental study is a cluster of heterogeneous workstations connected with 100 Mb/s Fast Ethernet network. More specifically, the cluster consists of 4 Pentium MMX 166 MHz with 32 MB RAM and 6 Pentium 100 MHz with 64 MB RAM. A Pentium MMX is used as master workstation. The average speeds of the two types of workstations, Pentium MMX and Pentium, for the four implementations are listed in Table 1. TheMPIimplementation used on the network is MPICH version 1.2. During all experiments, the cluster of workstations was dedicated. Finally, to get reliable performance results 10 executions occurred for each experiment and the reported values are the average ones. The text collection we used was composed of documents, which were portion of the various web pages. 3.2 Experimental Results In this subsection, we present the experimental results concluding from two sets of experiments. For the first experimental setup, we study the performance of four master-worker implementations P1,, and. For the second experimental setup, we examine the scalability issue of our implementations by doubling the text collection. Table 1. Average speeds (in chars per sec) of the two types of workstations Application Pentium MMX Pentium P1, 369367.74 2200079.448 373988.176 2194043.93 37370. 2197613.22

A Performance Study of Load Balancing Strategies 437 Comparing the Four Types of Approximate String Matching Implementations Before we present the results for four methods, we determined from the extensive experimental study [8] that the block size nearly sb=100,000 characters produces optimal performance for two dynamic master-worker methods and, later experiments are all performed using this optimal value for the and. Further, from [8] we observed that the worst performance is obtained for very small and large values of block size. This is because small values of block size increase the inter-workstation communication, while large values of block size produce poorly balanced load. Figures 1 and 2 show the execution times and the speedup factors with respect to the number of workstations respectively. It is important to note that the execution times and the speedups, which are plotted in Figures 1 and 2 are result of average for five pattern lengths (m=, 10, 20, 30 and 60) and four values of the number of errors (k=1, 3, 6 and 9). The speedup of a heterogeneous computation is defined as the ratio of the sequential execution time on the fastest workstation to the parallel execution time across the heterogeneous cluster. To have a fair comparison in terms of speedup, one defines the system computing power, which considers the power available instead of the number of workstations. The system computing power defines as follows: p 1 i=0 S i/s 0 for p workstations used, where S 0 is the speed of the master workstation. As we have expected, performance results show that the P1 implementation using static load balancing strategy is less effective than the other three implementations in case of heterogeneous network. This fact due to the presence of waiting time associated to communications. In other words, the slowest workstation is always the latest one in string matching computation. Further, the implementation using dynamic allocation of subtexts produces better results than the P1 one in case of heterogeneous cluster. Finally, the experimental results show that the and implementations seem to have the best performance compared with the others in case of heterogeneous cluster. These implementations give smaller execution times and higher speedups than in case of using SEL search algorithm, n=13mb and k=3 SEL search algorithm, n=13mb and m=10 100 90 P1 40 3 P1 80 30 70 Time (in seconds) 60 0 Time (in seconds) 2 20 40 1 30 20 10 10 Fig. 1. Experimental execution times (in seconds) for text size of 13MB and k=3 using several pattern lengths (left) and m=10 using several values of k (right)

438 Panagiotis D. Michailidis and Konstantinos G. Margaritis SEL search algorithm, n=13mb and k=3 SEL search algorithm, n=13mb and m=10 6. P1 6. P1 4. 4. 4 4 Speedup 3. Speedup 3. 3 3 2. 2. 2 2 1. 1. 1 1 Fig. 2. Speedup of parallel approximate string matching with respect to the number of workstations for text size of 13MB and k=3 using several pattern lengths (left) and m=10 using several values of k (right) the P1 and ones when the network becomes heterogeneous, i.e. after the 3rd workstation. We now examine the performance of the, and parallel implementations. From the results, we see a clear reduction in the computation time of the algorithm when we use the three parallel implementations. For instance, with k=3 and several pattern lengths, we reduce the average computation time from 9.08 seconds in the sequential version to 17.206, 16.176 and 16.040 seconds in the distributed implementations, and respectively using 8 workstations. In other words, from the Figure 1 we observe that for constant total text size there is an expected inverse relation between the parallel execution times and the number of workstations. Further, the three master-worker implementations achieve reasonable speedups for all workstations. For example, with k=3 and several pattern lengths, we had an increasing speedup curves up to about.2,.86 and.91 in distributed methods, and respectively on the 8 workstations which had the computing power of.92,.93 and.97. Scalability Issue To study the scalability of three proposed parallel implementations, and, we setup the experiments in the following way. We simple double the old text size two times. This new text collection is around 27MB. Results from these experiments have been depicted in Figures 3 and 4. Theresults show that the three parallel implementations still scales well though the problem size has been increased two times (i.e. doubling the text collection). The average execution times for k=3 and several pattern lengths similarly decrease to 34.063, 32.298 and 32.10 seconds for the, and implementations respectively when the number of workstations have been added to 8. Moreover, speedup factors of three methods also linearly increase when the workstations are increased. Finally, the best performance results are obtained with the and load balancing methods.

A Performance Study of Load Balancing Strategies 439 SEL search algorithm, n=27mb and k=3 SEL search algorithm, n=27mb and m=10 200 180 90 80 160 70 Time (in seconds) 140 120 100 80 Time (in seconds) 60 0 40 60 30 40 20 20 10 Fig. 3. Experimental execution times (in seconds) for text size of 27MB and k=3 using several pattern lengths (left) and m=10 using several values of k (right) SEL search algorithm, n=27mb and k=3 SEL search algorithm, n=27mb and m=10 6. 6. 4. 4. 4 4 Speedup 3. Speedup 3. 3 3 2. 2. 2 2 1. 1. 1 1 Fig. 4. Speedup of parallel approximate string matching with respect to the number of workstations for text size of 27MB and k=3 using several pattern lengths (left) and m=10 using several values of k (right) 4 Conclusions In this paper, we have presented four parallel and distributed approximate string matching implementations and the performance results on a low-cost cluster of heterogeneous workstations. We have observed from this extensive study that the and implementations produce better performance results in terms execution times and speedups than the others. Higher gains in performance are expected for a larger number of varying speed workstations in the network. Variants of the approximate string matching algorithm can directly be implemented on a cluster of heterogeneous workstations using the four text distribution methods reported here. We plan to develop a theoretical performance model in order to confirm the experimental behaviour of four implementations on a heterogeneous cluster. Further, this model can be used to predict the execution time and similar performance metrics for the four approximate string matching implementations on larger clusters and problem sizes.

440 Panagiotis D. Michailidis and Konstantinos G. Margaritis References 1. O. Beaumont, A. Legrand and Y. Robert, The master-slave paradigm with heterogeneous processors, Report LIP RR-2001-13, 2001. 433 2. H. D. Cheng and K. S. Fu, VLSI architectures for string matching and pattern matching, Pattern Recognition, vol. 20, no. 1, pp. 12-141, 1987. 433 3. J. Cringean, R. England, G. Manson and P. Willett, Network design for the implementation of text searching using a multicomputer, Information Processing and Management, vol. 27, no. 4, pp. 26-283, 1991. 433 4. D. Lavenier, Speeding up genome computations with a systolic accelerator, SIAM News, vol. 31, no. 8, pp. 6-7, 1998. 433. D. Lavenier and J. L. Pacherie, Parallel processing for scanning genomic databases, in Proc. PARCO 97, pp. 81-88, 1997. 433 6. K. G. Margaritis and D. J. Evans, A VLSI processor array for flexible string matching, Parallel Algorithms and Applications, vol. 11, no. 1-2, pp. 4-60, 1997. 433 7. P. D. Michailidis and K. G. Margaritis, On-line approximate string searching algorithms: Survey and experimental results, International Journal of Computer Mathematics, vol. 79, no. 8, pp. 867-888, 2002. 432, 436 8. P. D. Michailidis and K. G. Margaritis, Performance evaluation of load balancing strategies for approximate string matching on a cluster of heterogeneous workstations, Tech. Report, Dept. of Applied Informatics, University of Macedonia, 2002. 434, 437 9. P. D. Michailidis and K. G. Margaritis, String matching problem on a cluster of personal computers: Experimental results, in Proc. of the 1th International Conference Systems for Automation of Engineering and Research, pp. 71-7, 2001. 433 10. P. D. Michailidis and K. G. Margaritis, String matching problem on a cluster of personal computers: Performance modeling, in Proc. of the 1th International Conference Systems for Automation of Engineering and Research, pp. 76-81, 2001. 433 11. G. Navarro, A guided tour to approximate string matching, ACM Computer Surveys, vol. 33, no. 1, pp. 31-88, 2001. 432, 436 12. N. Ranganathan and R. Sastry, VLSI architectures for pattern matching, International Journal of Pattern Recognition and Artificial Intelligence, vol.8,no.4, pp. 81-843, 1994. 433 13. R. Sastry, N. Ranganathan and K. Remedios, CASM: a VLSI chip for approximate string matching, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 17, no. 8, pp. 824-830, 199. 433 14. P. H. Sellers, The theory and computations of evolutionaly distances: pattern recognition, Journal of Algorithms, vol. 1, pp. 39-373, 1980. 432, 436 1. M. Snir, S. Otto, S. Huss-Lederman, D. W. Walker and J. Dongarra, MPI: The complete reference, The MIT Press, Cambridge, Massachusetts, 1996. 433, 436 16. T. K. Yap, O. Frieder and R. L. Martino, Parallel computation in biological sequence analysis, IEEE Transactions on Parallel and Distributed Systems, vol. 9, no. 3, pp. 283-293, 1998. 433