A Practical Performance Comparison of Parallel Sorting Algorithms on Homogeneous Network of Workstations

Similar documents
PARALLEL PROGRAMMING

A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment

A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster

A Robust Dynamic Load-balancing Scheme for Data Parallel Application on Message Passing Architecture

APP INVENTOR. Test Review

IMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE OPERATORS

Cellular Computing on a Linux Cluster

Partitioning and Divide and Conquer Strategies

Zabin Visram Room CS115 CS126 Searching. Binary Search

Parallel Computing. Benson Muite. benson.

A Comparison of General Approaches to Multiprocessor Scheduling

Parallel Scalable Algorithms- Performance Parameters

A Review of Customized Dynamic Load Balancing for a Network of Workstations

6. Standard Algorithms

A STUDY OF TASK SCHEDULING IN MULTIPROCESSOR ENVIROMENT Ranjit Rajak 1, C.P.Katti 2, Nidhi Rajak 3

Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach

Reliable Systolic Computing through Redundancy

An Empirical Study and Analysis of the Dynamic Load Balancing Techniques Used in Parallel Computing Systems

Operation Count; Numerical Linear Algebra

Parallel Algorithm for Dense Matrix Multiplication

CS 575 Parallel Processing

Centralized Systems. A Centralized Computer System. Chapter 18: Database System Architectures

Load Balancing on a Non-dedicated Heterogeneous Network of Workstations

Distributed Dynamic Load Balancing for Iterative-Stencil Applications

A Practical Approach of Storage Strategy for Grid Computing Environment

Load Balancing & Termination

Performance Metrics for Parallel Programs. 8 March 2010

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

Load Distribution on a Linux Cluster using Load Balancing

Analysis and Implementation of Cluster Computing Using Linux Operating System

Hadoop on a Low-Budget General Purpose HPC Cluster in Academia

SWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri

Jan F. Prins. Work-efficient Techniques for the Parallel Execution of Sparse Grid-based Computations TR91-042

INTRODUCTION The collection of data that makes up a computerized database must be stored physically on some computer storage medium.

Load Balancing In Concurrent Parallel Applications

Record Storage and Primary File Organization

Advances in Smart Systems Research : ISSN : Vol. 3. No. 3 : pp.

1) The postfix expression for the infix expression A+B*(C+D)/F+D*E is ABCD+*F/DE*++

Index Terms : Load rebalance, distributed file systems, clouds, movement cost, load imbalance, chunk.

Load balancing in a heterogeneous computer system by self-organizing Kohonen network

Control 2004, University of Bath, UK, September 2004

Anomaly Detection using multidimensional reduction Principal Component Analysis

Task Scheduling in Hadoop

How To Understand The Concept Of A Distributed System

Distributed communication-aware load balancing with TreeMatch in Charm++

Load Balancing on a Grid Using Data Characteristics

The Pairwise Sorting Network

Performance of Scientific Processing in Networks of Workstations: Matrix Multiplication Example

CGMgraph/CGMlib: Implementing and Testing CGM Graph Algorithms on PC Clusters

Parallel & Distributed Optimization. Based on Mark Schmidt s slides

Symbol Tables. Introduction

Advantages of Parallel Processing and the Effects of Communications Time

Fast Sequential Summation Algorithms Using Augmented Data Structures

AP Computer Science Java Mr. Clausen Program 9A, 9B

Load Balancing using Potential Functions for Hierarchical Topologies

Parallel sorting on Intel Single-Chip Cloud computer

Parallel Computing of Kernel Density Estimates with MPI

LOAD BALANCING FOR MULTIPLE PARALLEL JOBS

Client/Server Computing Distributed Processing, Client/Server, and Clusters

Chapter 18: Database System Architectures. Centralized Systems

Expert Finding Using Social Networking

Binary search tree with SIMD bandwidth optimization using SSE

A Dynamic Programming Approach for Generating N-ary Reflected Gray Code List

Parallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014

Performance of Dynamic Load Balancing Algorithms for Unstructured Mesh Calculations

Introduction to Algorithms March 10, 2004 Massachusetts Institute of Technology Professors Erik Demaine and Shafi Goldwasser Quiz 1.

PERFORMANCE EVALUATION OF THREE DYNAMIC LOAD BALANCING ALGORITHMS ON SPMD MODEL

Scalability and Classifications

Algorithm Visualization through Animation and Role Plays

A Comparative Performance Analysis of Load Balancing Algorithms in Distributed System using Qualitative Parameters

Cluster Computing at HRI

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

Introduction to DISC and Hadoop

Middleware and Distributed Systems. Introduction. Dr. Martin v. Löwis

A Clustered Approach for Load Balancing in Distributed Systems

CS5310 Algorithms 3 credit hours 2 hours lecture and 2 hours recitation every week

Performance metrics for parallelism

Recommended hardware system configurations for ANSYS users

For example, we have seen that a list may be searched more efficiently if it is sorted.

FPGA area allocation for parallel C applications

Figure 1: Network architecture and distributed (shared-nothing) memory.

Building an Inexpensive Parallel Computer

Experiments on the local load balancing algorithms; part 1

Supercomputing applied to Parallel Network Simulation

Hectiling: An Integration of Fine and Coarse Grained Load Balancing Strategies 1

EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES

Implementation of Canny Edge Detector of color images on CELL/B.E. Architecture.

Keywords: Dynamic Load Balancing, Process Migration, Load Indices, Threshold Level, Response Time, Process Age.

RevoScaleR Speed and Scalability

Survey on Load Rebalancing for Distributed File System in Cloud

Advances in Natural and Applied Sciences

Static Data Mining Algorithm with Progressive Approach for Mining Knowledge

Client Based Power Iteration Clustering Algorithm to Reduce Dimensionality in Big Data

A Review of Sudoku Solving using Patterns

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Enabling Multi-pipeline Data Transfer in HDFS for Big Data Applications

External Sorting. Why Sort? 2-Way Sort: Requires 3 Buffers. Chapter 13

DYNAMIC LOAD BALANCING APPLICATIONS ON A HETEROGENEOUS UNIX/NT CLUSTER

CS104: Data Structures and Object-Oriented Design (Fall 2013) October 24, 2013: Priority Queues Scribes: CS 104 Teaching Team

Transcription:

A Practical Performance Comparison of Parallel Sorting Algorithms on Homogeneous Network of Workstations Haroon Rashid and Kalim Qureshi* COMSATS Institute of Information Technology, Abbottabad, Pakistan *Math. and Computer Science Department, Kuwait University, Kuwait Three parallel sorting algorithms have been implemented and compared in terms of their overall execution time. The algorithms implemented are the odd-even transposition sort, parallel merge sort and parallel rank sort. A homogeneous cluster of workstations has been used to compare the algorithms implemented. The MPI library has been selected to establish the communication and synchronization between the processors. The time complexity for each parallel sorting algorithm will also be mentioned and analyzed. Keywords: Parallel sorting algorithms, performance analysis, network parallel computing. 1. Introduction Sorting is one of the most important operations in database systems and its efficiency can influences drastically the overall system performance. To speed up the performance of database system, parallelism is applied to the execution of the data administration operations. The workstations connected via a local area network allow to speed up the application processing time [1]. Due to the importance of distributed computing power of workstations or PCs connected in a local area network [] we have been studying the performance evaluation of various scientific applications [1-3]. The dedicated parallel machines are used for parallel database systems and lot of research have already addressed the issues related to dedicated parallel machines [4]. Little research has been carried out on performance evaluations of parallel sorting algorithms on cluster of workstations. the higher number processor and the lower half is put in the lower number processor. Similarly, in the odd phase, odd number processors (processor i) communicate with the previous even number processors (i-1) in exactly the same fashion as in the even phase. It is clear that the whole list will be sorted in a maximum of p stages. Figure 1 shows an illustration of the odd-even transposition algorithm.. Parallel Sorting Algorithms In this paper, 3 parallel sorting algorithms will be implemented and evaluated. These algorithms are: 1. Odd-even transposition sort.. Parallel rank sort. 3. Parallel merge sort..1 Odd-Even Transposition The Odd-even transposition sort algorithm [5,6] starts by distributing n/p sub-lists (p is the number of processors) to all the processors. Each processor then sequentially sorts its sub-list locally. The algorithm then operates by alternating between an odd and an even phase, hence the name oddeven. In the even phase, even numbered processors (processor i) communicate with the next odd numbered processors (processor i+1). In this communication process, the two sub-lists for each communicating processes are merged together. The upper half of the list is then kept in Figure 1: Odd-even transposition sort, sorting 1 elements using 4 processors. Time Complexity of Odd-Even transposition: At first glance of parallelizing the bubble sort algorithm it seems that the performance will increase a factor of p. However, careful analysis of the complexity reveals that it is actually much more than the stated value. Below is the analysis of the time complexity for the odd-even transposition sorting algorithm [5]: 1

The performance of the sequential bubble sort algorithm is: Time Complexity of parallel merge sort: Sequential merge sort time complexity is O (n log n). when parallelizing the merge sort algorithm the time complexity reduces to O(n/p log n/p) as stated in [5]. The performance of the odd-even transposition algorithm is: n / p n n / p( n / p 1) n n i = 1+ + 3... + = = = o( bn ) p p p i= 1 where b = 1/ p This means that theoretically speaking the time will reduce by 1/p.. Parallel Merge Sort The merge sort algorithm uses a divide and conquer strategy to sort its elements [7].The list is divided into equally sized lists and the generated sub-lists are further divided until each number is obtained individually. The numbers are then merged together as pairs to form sorted lists of length. The lists are then merged subsequently until the whole list is constructed. This algorithm can parallelized by distributing n/p elements (where n is the list size and p is the number of processors) to each slave processor. The slave can sequentially sort the sub-list (e.g. using sequential merge sort) and then return the sorted sublist to the master. Finally, the master is responsible of merging all the sorted sub-lists into one sorted list. Figure shows an illustration of the parallel merge sort algorithm..3 Parallel Rank Sort In the sequential rank sort algorithm (also known as enumeration sort), each element in the list to be sorted is compared against the rest of the elements to determine its rank amongst them [8]. This sequential algorithm can be easily parallelized by enabling the master processor to distribute the list amongst all the processors and assigning each slave processor n/p elements (where n is the list size and p is the number of processors). Each processor is responsible of computing the rank of all the n/p elements. The ranks are then returned from the slaves to the master who in turn is responsible of constructing the whole sorted list (see figure 3). Figure 3: Parallel sort algorithm sorting 1 elements using 4 processors. Time Complexity of parallel merge sort: In the sequential version of the rank sort algorithm. Each element is compared to all the other elements. The complexity of the algorithm can be expressed as: When parallelizing this algorithm, it can be easily seen that the complexity reduces to: Figure : Parallel merge sort algorithm, sorting 1 elements using 4 processors. This means that if n number of processors is used then the sorting time will become almost linear O (n).

3 Implementation Each of the parallel algorithms stated above will be compared to its sequential implementation and evaluated in terms of its overall execution time, speedup and efficiency. The speedup is used to measure the gain of parallelizing an application versus running the application sequentially and can be expressed as: Speed = Execution time using one processor / Execution time using p processor (1) On the other hand, the efficiency is used to indicate how well the multiple processors are utilized in executing the application and can be expressed as: Efficiency = Execution time using p processor / Total number of processor () Parallel Merge Sort: Parallel merge sort is one of the most efficient algorithms for sorting elements. In figure 7 an illustration of the total execution time of the algorithm is displayed. It shows that sorting using up to 8 processors is helpful in reducing the total time required to sort the elements. However, increasing the processors to more than 8 processors will result in lower performance compared to the sequential merge sort algorithm as shown in figure 7. This of course is due to the communication overhead that occurs between the processors to merge the result in to one sorted list. The speedup and efficiency of the parallel merge sort algorithm are displayed in figure 8 and figure 9 respectively. The C programming language used to develop the sorting algorithms. The MPI library routines used to handle the communication and synchronization between all the processors. The performance of the sorting algorithms was evaluated on a homogeneous cluster of SUN workstations, with SUNOS operating system. Each sorting algorithm performance was evaluated for, 4, 6, 8, 10, 1 machines. The speedup and efficiency will be calculated based on the previous records. An array of 10,000 random integers was used to test the parallel algorithms. 3. Results and Discussions Odd-Even Transposition: Figure 4 shows the total execution time for the odd-even transposition sorting algorithm. It can easily be seen that parallel algorithm is by far faster than the sequential bubble sort algorithm. The speed up for the odd-even transposition sorting algorithm is also displayed in figure 5 along with the efficiency in figure 6. Figure 5: Speedup of the odd-even transposition sort algorithm Figure 6: Efficiency of the odd-even transposition sort algorithm Figure 4: Total execution time of the odd-even transposition sort algorithm Parallel RankSort: Running the parallel rank sort algorithm on processors to sort 10,000 integers is 3

slower than the sequential implementation due to the communication overhead needed to distribute the whole unsorted list to all the processors. However, the benefit of parallelization kicks in after increasing the number of processors. Using processors run parallel rank sort should increase the performance of the algorithm conditioned the number of elements to be sorted are greater than 10,000 elements (e.g. 1,000,000 elements). parallel rank sort algorithm can be considered as a memory intensive algorithm. Figure 10 shows the total execution time for the parallel sort algorithm compared to its sequential implementation. When this algorithm runs on 6 processors it can improve the total execution time by a factor slightly greater than. However, lot of communication overheads and data transfer is required which prevents us from increasing the performance beyond this factor. The speedup for the parallel rank sort algorithm is also displayed in figure 11 along with the efficiency in figure 1. Figure 7: Total execution time of the parallel merge sort algorithm. Figure 9: Efficiency of the parallel merge sort algorithm. Figure 8: Speedup of the parallel merge sort algorithm. The limitation of the parallel rank sort is the memory it requires in order to sort its elements. Each processor needs a copy of the whole unsorted list for it to rank its portion of elements. Another memory requirement is to construct an array proportional to the unsorted list size to enable the algorithm of sorting lists with repeated elements. The Figure 10: Total execution time of the parallel rank sort algorithm. 4

4. Conclusions Three parallel sorting algorithms have been developed and executed on a homogeneous cluster of machines. The parallel algorithms implemented are the odd even transposition sorting algorithm, the parallel rank sort algorithm and the parallel merge sort algorithm. Figure 13 shows a comparison between the 3 parallel sorting algorithms when sorting 10,000 integers on, 4, 6, 8, 10, and 1 workstations. sorting algorithm comes in second place because of the time it takes to initially sort its elements locally in each processor using sequential bubble sort which has a performance of O ( n ). The odd-even transposition sorting algorithm can be improved by adapting a faster sequential sorting algorithm to sort the elements locally for each processor in the order of O ( n log n) (e.g. sequential merge sort or quick sort). Figure 11: Speedup of the parallel rank sort algorithm. Figure 1: Efficiency of the parallel rank sort algorithm. From figure 13 it is obvious that the parallel merge sort is the fastest sorting algorithm followed by the odd-even transposition sorting algorithm then the parallel rank sorting algorithm. The parallel rank sort algorithm is the slowest algorithm because each processor needs its own copy of the unsorted list thus, in turn, raises a serious communication overhead. A solution has also been developed and successfully tested to allow parallel rank sort for sorting a list of integers with repeated elements. The odd-even Figure 13: A comparison of the total execution time required for sorting 10,000 integers using parallel merge sort, parallel rank sort and odd-even transposition. References ---------------------------------------------------------------------------------------- [1] Kalim Qureshi and Haroon Rashid, A Practical Performance Comparison of Parallel Matrix Multiplication Algorithms on Network of Workstations., IEE Transaction Japan, Vol. 15, No. 3, 005. [] Kalim Qureshi and Haroon Rashid, A Practical Performance Comparison of Two Parallel Fast Fourier Transform Algorithms on Cluster of PCS, IEE Transaction Japan, Vol. 14, No. 11, 004. [3] Kalim Qureshi and Masahiko Hatanaka, A Practical Approach of Task Partitioning and Scheduling on Heterogeneous Parallel Distributed Image Computing System, Transaction of IEE Japan, Vol. 10-C, No. 1, Jan., 000, pp. 151-157. [4] K. Sado, Y. Igarashi, Some Parallel Sorts on a Mesh- Connected Processor Array and Their Time Efficiency, Journal of Parallel and Distributed Computing, 3, pp. 398-410, 1999. [4] D. Bitton, D. DeWitt, D.K. Hsiao, J. Menon, A Taxonomy of Parallel Sorting, ACM Computing Surveys, 16,3,pp. 87-318, September 1984. [5]. Song, Y.D., Shirasi, B. A Parallel Exchange Sort Algorithm. South Methodist University, IEEE 1989. [6] B.R. lyer, D.M. Dias, System Issues in Parallel Sorting for Database Systems, Proc. Int. Conference on Data Engineering, pp. 46-55, 003. [7] F. Meyer auf der Heide, A Wigderson, The Complexity of Parallel Sorting, SIAM Journal of Computing, 16, 1, pp. 100-107, February 1999. 5