Modeling Parallel Applications Performance on Heterogeneous Systems



Similar documents
An Innovate Dynamic Load Balancing Algorithm Based on Task

Dynamic Placement for Clustered Web Applications

The Research of Measuring Approach and Energy Efficiency for Hadoop Periodic Jobs

An Agent-Based Infrastructure for Parallel Java on Heterogeneous Clusters

CPU Animation. Introduction. CPU skinning. CPUSkin Scalar:

A Scalable Application Placement Controller for Enterprise Data Centers

The Benefit of SMT in the Multi-Core Era: Flexibility towards Degrees of Thread-Level Parallelism

A framework for performance monitoring, load balancing, adaptive timeouts and quality of service in digital libraries

Exploiting Hardware Heterogeneity within the Same Instance Type of Amazon EC2

Local Area Network Management

Implementation of Active Queue Management in a Combined Input and Output Queued Switch

Machine Learning Applications in Grid Computing

Cooperative Caching for Adaptive Bit Rate Streaming in Content Delivery Networks

A Multi-Core Pipelined Architecture for Parallel Computing

A Soft Real-time Scheduling Server on the Windows NT

An Integrated Approach for Monitoring Service Level Parameters of Software-Defined Networking

Construction Economics & Finance. Module 3 Lecture-1

An Approach to Combating Free-riding in Peer-to-Peer Networks

Applying Multiple Neural Networks on Large Scale Data

Efficient Key Management for Secure Group Communications with Bursty Behavior

Managing Complex Network Operation with Predictive Analytics

CRM FACTORS ASSESSMENT USING ANALYTIC HIERARCHY PROCESS

Searching strategy for multi-target discovery in wireless networks

Botnets Detection Based on IRC-Community

Analyzing Spatiotemporal Characteristics of Education Network Traffic with Flexible Multiscale Entropy

Energy Efficient VM Scheduling for Cloud Data Centers: Exact allocation and migration algorithms

Protecting Small Keys in Authentication Protocols for Wireless Sensor Networks

Calculating the Return on Investment (ROI) for DMSMS Management. The Problem with Cost Avoidance

Evaluating Inventory Management Performance: a Preliminary Desk-Simulation Study Based on IOC Model

REQUIREMENTS FOR A COMPUTER SCIENCE CURRICULUM EMPHASIZING INFORMATION TECHNOLOGY SUBJECT AREA: CURRICULUM ISSUES

Online Bagging and Boosting

Extended-Horizon Analysis of Pressure Sensitivities for Leak Detection in Water Distribution Networks: Application to the Barcelona Network

An Improved Decision-making Model of Human Resource Outsourcing Based on Internet Collaboration

Optimal Resource-Constraint Project Scheduling with Overlapping Modes

Energy Proportionality for Disk Storage Using Replication

An improved TF-IDF approach for text classification *

Considerations on Distributed Load Balancing for Fully Heterogeneous Machines: Two Particular Cases

Real Time Target Tracking with Binary Sensor Networks and Parallel Computing

Software Quality Characteristics Tested For Mobile Application Development

Design of Model Reference Self Tuning Mechanism for PID like Fuzzy Controller

ASIC Design Project Management Supported by Multi Agent Simulation

A quantum secret ballot. Abstract

Network delay-aware load balancing in selfish and cooperative distributed systems

PERFORMANCE METRICS FOR THE IT SERVICES PORTFOLIO

Modeling Cooperative Gene Regulation Using Fast Orthogonal Search

An Optimal Task Allocation Model for System Cost Analysis in Heterogeneous Distributed Computing Systems: A Heuristic Approach

Partitioning Data on Features or Samples in Communication-Efficient Distributed Optimization?

Fuzzy Sets in HR Management

Impact of Processing Costs on Service Chain Placement in Network Functions Virtualization

Pure Bending Determination of Stress-Strain Curves for an Aluminum Alloy

Workflow Management in Cloud Computing

ADJUSTING FOR QUALITY CHANGE

Study on the development of statistical data on the European security technological and industrial base

ON SELF-ROUTING IN CLOS CONNECTION NETWORKS. BARRY G. DOUGLASS Electrical Engineering Department Texas A&M University College Station, TX

Exercise 4 INVESTIGATION OF THE ONE-DEGREE-OF-FREEDOM SYSTEM

IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, ACCEPTED FOR PUBLICATION 1. Secure Wireless Multicast for Delay-Sensitive Data via Network Coding

US A1 (19) United States (12) Patent Application Publication (10) Pub. No.: US 2010/ A1 Saha et al. (43) Pub. Date: Mar.

High Performance Chinese/English Mixed OCR with Character Level Language Identification

Physics 211: Lab Oscillations. Simple Harmonic Motion.

Storing and Accessing Live Mashup Content in the Cloud

PREDICTION OF POSSIBLE CONGESTIONS IN SLA CREATION PROCESS

Information Processing Letters

The Application of Bandwidth Optimization Technique in SLA Negotiation Process

Data Streaming Algorithms for Estimating Entropy of Network Traffic

Study on the development of statistical data on the European security technological and industrial base

Calculation Method for evaluating Solar Assisted Heat Pump Systems in SAP July 2013

1 Adaptive Control. 1.1 Indirect case:

Presentation Safety Legislation and Standards

Equivalent Tapped Delay Line Channel Responses with Reduced Taps

Sensors as a Service Oriented Architecture: Middleware for Sensor Networks

Preference-based Search and Multi-criteria Optimization

A Fast Algorithm for Online Placement and Reorganization of Replicated Data

Red Hat Enterprise Linux: Creating a Scalable Open Source Storage Infrastructure

ESTIMATING LIQUIDITY PREMIA IN THE SPANISH GOVERNMENT SECURITIES MARKET

Research Article Performance Evaluation of Human Resource Outsourcing in Food Processing Enterprises

An Application Research on the Workflow-based Large-scale Hospital Information System Integration

SAMPLING METHODS LEARNING OBJECTIVES

The Velocities of Gas Molecules

The AGA Evaluating Model of Customer Loyalty Based on E-commerce Environment

Method of supply chain optimization in E-commerce

Evaluating the Effectiveness of Task Overlapping as a Risk Response Strategy in Engineering Projects

Factored Models for Probabilistic Modal Logic

Adaptive Modulation and Coding for Unmanned Aerial Vehicle (UAV) Radio Channel

arxiv: v1 [math.pr] 9 May 2008

Standards and Protocols for the Collection and Dissemination of Graduating Student Initial Career Outcomes Information For Undergraduates

Reliability Constrained Packet-sizing for Linear Multi-hop Wireless Networks

Data Set Generation for Rectangular Placement Problems

COMBINING CRASH RECORDER AND PAIRED COMPARISON TECHNIQUE: INJURY RISK FUNCTIONS IN FRONTAL AND REAR IMPACTS WITH SPECIAL REFERENCE TO NECK INJURIES

Media Adaptation Framework in Biofeedback System for Stroke Patient Rehabilitation

Use of extrapolation to forecast the working capital in the mechanical engineering companies

How To Balance Over Redundant Wireless Sensor Networks Based On Diffluent

INTEGRATED ENVIRONMENT FOR STORING AND HANDLING INFORMATION IN TASKS OF INDUCTIVE MODELLING FOR BUSINESS INTELLIGENCE SYSTEMS

Generating Certification Authority Authenticated Public Keys in Ad Hoc Networks

Constructing Services with Interposable Virtual Hardware

This paper studies a rental firm that offers reusable products to price- and quality-of-service sensitive

Stochastic Online Scheduling on Parallel Machines

The individual neurons are complicated. They have a myriad of parts, subsystems and control mechanisms. They convey information via a host of

Performance Evaluation of Machine Learning Techniques using Software Cost Drivers

International Journal of Management & Information Systems First Quarter 2012 Volume 16, Number 1

Image restoration for a rectangular poor-pixels detector

Transcription:

Modeling Parallel Applications Perforance on Heterogeneous Systes Jaeela Al-Jaroodi, Nader Mohaed, Hong Jiang and David Swanson Departent of Coputer Science and Engineering University of Nebraska Lincoln Lincoln, NE 68588-0115 [alaroo, nohaed, iang, dswanson@cse.unl.edu] Abstract The current technologies have ade it possible to execute parallel applications across heterogeneous platfors. However, the perforance odels available do not provide adequate ethods to calculate, copare and predict the applications perforance on these platfors. In this paper, we discuss an enhanced perforance evaluation odel for parallel applications on heterogeneous systes. In our analysis, we include achines of different architectures, specifications and operating environents. We also discuss the enabling technologies that facilitate such heterogeneous applications. The odel is then validated through experiental easureents using an agent-based parallel Java syste, which facilitates siultaneous utilization of heterogeneous systes for parallel applications. The odel provides good evaluation etrics that allow developers to assess and copare the parallel heterogeneous applications perforances. Key Words Parallel applications, heterogeneous systes, cluster, and perforance odel. 1. Introduction High perforance, parallel and distributed applications are becoing increasingly resource-intensive, requiring high speed processors, large eory capacity, huge and fast storage systes, and fast, reliable interconnections. However, ost applications are confined within a single architecture due to the achine dependant nature of the developent environents. Recently soe effort has been invested in providing parallel prograing capabilities that can siultaneously span ultiple heterogeneous platfors. Such environents could provide efficient parallel executions by allowing heterogeneous applications to be atched to suitable heterogeneous platfors. One strong direction in developing these environents is using Java, thus enabling the utilization of heterogeneous systes in executing parallel applications. The success of such efforts has also led to the need for soe foral analytical odels to evaluate and copare the perforance of applications utilizing these systes. The heterogeneity of a syste can be defined in different ways depending on the varying characteristics. In one scenario, for exaple, a network of workstations (NOW) ay be considered heterogeneous due to the varying load on each achine, thus the available resources on each workstation are not always the sae. In a second scenario, a heterogeneous syste is a collection of different achines with varying architectures, different nuber of processors and operating environents. In this paper, heterogeneity refers to the second scenario. Execution of parallel applications on heterogeneous systes can benefit fro a suitable odel that easures and evaluates their to anage and schedule resources ore effectively. The existing perforance odels addressing this issue are ostly restricted to heterogeneous systes that contain siilar (hoogeneous) coponents, but with varying loads on each coponent. In addition, alost all odels assue single-thread achines that execute exactly one task of the application at any given tie (see Section 3). This paper presents an enhanced odel for easureents that accoodates heterogeneous systes with varying achine configurations and different nubers of processors per achine. To enable heterogeneous parallel applications, portable and achine independent systes and developent tools are needed. Java is considered a very suitable prograing language for such applications because Java is achine independent. Therefore, Java code can be copiled once and the resulting bytecode can be executed on any achine with any operating syste without any changes. Therefore, uch effort has been put into providing Java-based parallel systes [1]. In this paper, we introduce an exaple agent-based syste for parallel Java applications [2]. This syste uses a Java obect-passing interface (JOPI) [9] and allows siultaneous execution of the application parallel processes on heterogeneous achines. We will use JOPI Proceedings of the International Parallel and Distributed Processing Syposiu (IPDPS 03)

for our experients to evaluate and validate the enhanced perforance odel. This paper will first discuss the enabling technology including the agent-based run-tie environent and JOPI in section 2. Section 3 describes soe of the easureent etrics for parallel applications and the enhanceents added to accoodate heterogeneous applications perforance evaluation. In section 4, we will present the experiental evaluations using soe benchark applications. Finally, Section 5 concludes the paper with soe rearks about current and future work. 2. Heterogeneous Systes In the context of this paper, we define a heterogeneous syste (HS) as a collection of achines of varying architectures, nuber of processors and operating environents that are connected via a local area network (LAN). However, the prototype syste and analysis introduced later in the paper apply to heterogeneous systes connected via other types of networks. An exaple heterogeneous syste is a collection of a ultinode Linux-based cluster connected to an IRIX-based ulti-processor achine (e.g. SGI origin 2000) and a collection of Windows-based servers. Soe research has been conducted to study and analyze heterogeneous applications perforance, but ost of it is based on theoretical analysis and considering heterogeneity in ters of varying loads on siilar achines [15, 5, 10, 14]. This ay be attributed to the fact that, at the tie, the available technology did not provide an easy way to deploy and execute applications on varying platfors siultaneously. However, soe research has been done proposing odels and techniques to ipleent parallel applications for heterogeneous networks such as in [3]. Although any developent tools and languages can be used for such parallel applications and environents, currently, Java s achine independence provides the perfect developent tool for parallel and distributed applications that span heterogeneous platfors. Standard Java technologies such as Java virtual achine (JVM) [13] and JINI [7] provide a variety of features to develop and ipleent distributed Java applications. However, there are soe key features lacking in JVM when directly applied as an underlying infrastructure for constructing and executing parallel Java applications on heterogeneous systes. These issing features, needed by different parallel Java prograing odels, include: Loading user progras onto the reote JVMs of the heterogeneous syste nodes, Managing the heterogeneous syste resources, Scheduling user obs, Security, Job and thread naing, and User coands and collective operations A nuber of research groups have worked on providing parallel capabilities in Java [1], which is portable by nature. However, there is little work done on the utilization of heterogeneous platfors to execute parallel applications and analyze their perforance. This section discusses an exaple Java-based environent that facilitates executing parallel Java applications on heterogeneous systes. The ain advantages of utilizing heterogeneous systes are: 1. Achieving high perforance results at a relatively low cost copared to ultiprocessor parallel achines. 2. Easier expandability and upgrade of the hardware by adding ore achines that are not necessarily the sae as the existing ones or replacing existing achines by newer ones. 3. Allowing for better utilization of the different architectures by scheduling application coponents on the achines ost suitable for the. For exaple, if soe sections of the application are tightly coupled and require frequent counications, these sections can be scheduled on an MPP or SMP achine, while other sections that are relatively independent can be scheduled on a cluster of workstations in the syste. We ipleented an exaple parallel Java syste that supports these requireents. The syste utilizes obects for inforation (data and code) exchange and can siultaneously span ultiple heterogeneous platfors. The next sub-sections will discuss this syste and Section 4 will show the experients done using it. 2.1 An Agent-Based Infrastructure The agent-based syste prototype syste, discussed in details in [2], is designed to satisfy the requireents for heterogeneous parallel applications support. The syste provides a pure Java infrastructure based on a distributed eory odel. This akes the syste portable, secure and capable of handling different prograing odels on heterogeneous systes. The syste also supports ultithreaded execution on ulti-processor achines. The syste also has a nuber of coponents that collectively provide iddleware services for a high perforance Java environent on clusters and heterogeneous networked systes such as the software agents and client services. The ain functions of the syste are to deploy, schedule, and support the execution of the Java code, in addition to anaging, controlling, onitoring, and scheduling the available resources on clusters or on a collection of heterogeneous networked systes. Java serializable obects [13] were used as a counication unit aong the syste s coponents where each obect represents a specific type of request or reply. The client services and environent APIs provide coands for users to interact with the environent. The syste also provides features for reote deployent, ob onitoring and control, ob and thread naing and security. Proceedings of the International Parallel and Distributed Processing Syposiu (IPDPS 03)

2.2 The Java Obect-Passing Interface The agent-based infrastructure is capable of supporting different parallel and distributed prograing odels. For exaple, ipleenting distributed shared obect API s for Java (work in progress) and Java obect-passing interface (JOPI), which is discussed in ore details in [9]. In addition, distributed applications can utilize this infrastructure to facilitate their operation. JOPI was ipleented on top of the services provided by this infrastructure. JOPI is an obect-passing interface with priitives siilar to those in MPI, where inforation is exchanged by eans of obects instead of essages. Using JOPI, users can initiate point-to-point counications in synchronous and asynchronous odes, group counications and synchronization, in addition to a other supporting functions and attributes. JOPI utilizes ost of the features provided by the agent-based infrastructure, including scheduling echaniss, deployent and execution of user classes, control of user threads and available resources, and the security echaniss. Writing parallel progras using JOPI is generally sipler than using C and MPI, ainly due to the obect-oriented nature of Java, thus the user can define a class for solving the sequential version first and test it, then use it as the ain obect in the parallel version. A few additional ethods and classes can be added to the original classes to facilitate the decoposition and distribution of the proble, exchange of interediate results, and reconstruction of the solution. Finally, the ain class will ipleent the parallelization policy such as deciding on the size of sub-proble, load balancing, etc. Passing obects in JOPI preserves the obect-oriented advantages in Java and siplifies the parallelization process and the inforation exchange echaniss. 3. The Perforance Model With the technology available to develop and execute heterogeneous parallel applications on heterogeneous systes, it is necessary to provide a odel that will provide a better understanding of the perforance of such applications. There has been a long debate on what is the best ethod to represent the perforance of a parallel application. Generally, the speedup and efficiency etrics are considered adequate given the appropriate description of how exactly they are easured and reported. In soe cases, however, speedup can be isleading because it ay give a distorted picture. The one etrics that alost all agree on is the elapsed tie (response tie) [4]. Elapsed tie is directly easured for the application and it encapsulates all affecting attributes such as syste tie and counication tie. In addition, it represents what the user can relate to as a perforance easure. However, it is not useful when trying to predict the perforance of the application with different settings such as larger input sets or ore processors. In addition, Most easureents and analyses are done at the application level rather than the lower level to provide a practical view of how parallel applications will perfor on a heterogeneous syste and what the user can expect to achieve with different syste configurations or algoriths. The other issue that needs to be considered is the type of environent used to execute the parallel application. Most of the odels and etrics, developed earlier, are based on the hoogeneous environent, thus, not accounting for the possible differences between the participating achines. For exaple, speedup is easured as the elapsed tie at one processor T 1 divided by the tie at p processors T p. In this case, if soe of the processors are slower than others, the effect will not be evident. Soe research has been conducted to consider the heterogeneity of parallel applications and soe basic etrics were defined in [15, 5, 10, 14]. In [15], the authors introduced a odel to easure and evaluate parallel applications perforance on heterogeneous networks. However, the heterogeneity was defined based on the different loads on the participating achines. Therefore, the odel assued a cluster of siilar achines connected by a unifor network and each achine can execute exactly one task (process) at any given point of tie. However, current technology and developent tools allow for utilizing heterogeneous systes, which these odels do not consider. The reainder of this section briefly discusses the odel introduced in [15] and extends it analytically to accoodate for heterogeneous systes, where participating achines have ultiple processors, different architectures and varying CPU/Meory/IO perforances. In addition, the enhanced odel incorporates achines that execute ultiple tasks siultaneously. The power of a achine M i is defined by the aount of work the achine can coplete in unit tie. The power weight W i (A) of achine M i with application A is given as the aount of work M i can coplete relative to the fastest available achine in the syste. Thus, the power weight of a achine can be calculated by either equation (1) or (2) as follows Si ( A) Wi ( A) ax { S ( 1 A in 1{ T( A, M Wi ( A) T( A, M ) i...(1) )} )}...(2) where S i (A) is the speed of the achine (MIPS or FLOPS) executing application A. T(A, M i ) is the elapsed execution tie for application A on achine M i. Proceedings of the International Parallel and Distributed Processing Syposiu (IPDPS 03)

is the total nuber of achines in the heterogeneous syste HS. Using the Power weight of the achines provides a ore accurate easure of the effect each achine will have on the syste, thus giving a clear view of how the syste perfors. Using this definition [15] derives the speedup SP, efficiency E, and the syste heterogeneity H definitions such that in 1 { T ( A, M )} SP...(3) T ( A, HS) SP E...(4) W ( ) H 1 A 1 (1 W ( A))...(5) Based on the above odel, we introduce a nuber of assuptions to extend the odel for a truly heterogeneous syste. The aor difference is that any achine can execute ultiple parallel tasks siultaneously. The original odel assues that each achine in the syste executes a single task at any point of tie; therefore, the power weight can be calculated using the achine s speed. However, in the current situation, each achine can have one or ore processors and can siultaneously execute ultiple tasks of the application. 1. The power factor calculations reain the sae using the elapsed tie of the application on a single processor of that achine. However, it is also possible to consider the total nuber of processors used on that achine as one, thus, taking the power factor based on the execution tie using all the processors. Nevertheless, this will lead to having to recalculate the power weight of all achines every tie the nuber of processors used changes. 2. Machines are dedicated, thus no owner copute tie is involved. However, other effects on tie such as syste and idle tie and counication costs are considered in the odel as part of total elapsed tie. 3. The easureents are application dependant to a liited extent. In any cases the effect of the application reain negligible as long as the application and its data fit in ain eory during execution, thus no page swapping is involved. Without the overhead of eory paging, applications will behave siilarly. 4. To calculate the power weights of achines we rely on the execution tie of the application, not the speed of the achine. The tie gives a ore accurate easure of the achine s overall perforance because it accounts for all affecting attributes such as actual CPU tie, idle tie and I/O tie. 5. The odel is applied at the application level to reflect the user s perception of the application perforance. In addition, this siplifies the odel by incorporating all tie considerations within a single easureent. This for provides a easure of the relative achine powers with respect to the parallel application used. When ultiple processes execute on the achine, the elapsed tie will include processing tie and additional overhead iposed by both the syste and counication. Thus, the efficiency calculations need to be adusted to accoodate for the varying nuber of processors on each achine. Siilarly, calculating the heterogeneity of the syste requires accounting for the nuber of processor used on each achine. E 1 H 1 SP W * n ( n...(6) *(1 W ( A)))...(7) where n is the nuber of processors (or tasks) used in achine M. By ultiplying the nuber of processors n by the power weight W of the achine M, we can represent the expected power weight of that achine using the n processors rather than a single processor. Based on equation (6), the power weight of a achine M is its weight using a single processor. The product n *W gives an estiate of the power weight when n processors are used to execute n tasks of the application. Therefore, this estiate does not account for the counication overhead iposed between the n processors within the achine M. However, the overhead is relatively very sall because the processors reside in a single achine and within close proxiity of each other. Equation (7) represents the true effect of having achines with different configurations and coputing power, hence providing a representative evaluation etric for applications perforance. Using equation (7), it is also possible to reverse the calculations to predict the execution tie of the application if part of the configurations has changed. Given the efficiency and power weights of the current configuration, the speedup can be calculated fro equation (6). Using the speedup deterined and equation (3), T(A, HS) for the new configuration can be calculated. This value is an approxiate value of the expected perforance with the new configuration. The efficiency value used can either be the sae as the current configuration efficiency or be an estiated value of the next possible efficiency using the available easureents. The efficiency can be estiated by extrapolating the efficiency graph against the nuber of processors used. This is achievable by calculating the efficiency values for a nuber of easureents, plotting the results and eploying a curve fitting technique on the plot. A sipler linear extrapolation can also be used, but this will give a less accurate estiate. Proceedings of the International Parallel and Distributed Processing Syposiu (IPDPS 03)

Siilarly, the expected elapsed tie for the application with a different input size can be estiated. The odel provides a unifor ethod to easure and evaluate parallel application perforance on a heterogeneous syste. The odel relies on the easured elapsed tie to calculate the power weight of the achines. Using the power weights and elapsed tie of the parallel application, speedup, efficiency and heterogeneity can be calculated. Moreover, the calculations can be reversed to estiate the elapsed tie for other proble settings such as larger data sets or different syste configurations. However, one shortcoing is that it ay require recalculating all the power weights of the achines if the fastest achine was reoved or replaced by a faster one. 4. Model Validation Soe benchark applications were written to evaluate the perforance of the heterogeneous syste using JOPI. All experients used standard JVM sdk 1.3.1 and were executed on hoogeneous and heterogeneous clusters. The platfors used are listed in table 1. Nae Platfor Description CSNT 3 CPUs, Intel x86 700MHz, 1.5GB RAM. OS: Windows 2000 advanced server RCF Origin 2000, 32 processors, 250 MHz, 4MB cache, 8GB RAM. OS: IRIX 6.5.13 SH Cluster, 24 nodes, dual 1.2 MHz AthlonMP, 256KB cache, 1GB RAM per node. OS:Linux Table 1: A list of Machines used in the experients 4.1 Traveling Salesan Proble (TSP) The algorith is based on branch-and-bound search [8]. This proble required using any of the JOPI priitives to ipleent an efficient load-balanced solution. Broadcast was used to distribute the original proble obect to processes and used by processes to broadcast the iniu tour value found. This allows other processes to update their iniu value to speedup their search. In addition, asynchronous counication is used by the processes to report their results to the aster process, while continuing to process other parts. The results obtained (Figure 1) show good speedup easures with growing nuber of processors and fixed proble size (22 cities). The TSP was also executed on a heterogeneous cluster consisting of nodes fro SH and CSNT. In this case the initial calculations of the power weight shows that SH is the fastest achine and CSNT has around.94 power relative to SH. TSP was executed using different Speedup 20 15 10 5 0 TSP on Sandhills 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 No. of Processors configurations of heterogeneous processors fro SH and CSNT (see table 2). The results show that the efficiency at four processors is 64%, which is low because of the counication overhead, but when ore of SH (the faster achine) processors are added, the efficiency increases. Using the inforation in table 2, we can estiate the elapsed tie for the application if the configurations change. For exaple, assue we need to estiate the tie for the 6-2 cobination, then we first estiate the expected efficiency. This is done either by using the efficiency fro the 4-2 cobination (77.8%) or by estiating the value by ethods like curve fitting. For illustration, we will use the available value, thus equation (7) gives the speedup for the new configuration to be 6.1323. Using equation (3), T(A, HS) is calculated to be 5806912.415s. Copared to the experiental result, the percentage error is only 3.1 percent. No. of Processors Efficiency Total SH CSNT Tie (sec) Speedup (%) 1 1 0 3560.9729 1 100 1 0 1 37839.885 0.941063 100 2 0 2 18765.547 1.897612 97.7615 2 2 0 18068.563 1.970811 98.5406 4 2 2 14112.931 2.523199 64.9953 6 4 2 7782.309 4.575728 77.7904 8 6 2 5633.393 6.321187 80.1965 PW 1 0.9411 Table 2. Perforance easureent for TSP on a heterogeneous syste. 4.2 Matrix Multiplication TSP IDEAL Figure 1. Speedup results for TSP (22 cities) on a hoogeneous cluster. A dense atrix ultiplication (MM) algorith [6] is used with load balancing echanis and synchronous Proceedings of the International Parallel and Distributed Processing Syposiu (IPDPS 03)

point-to-point counication. The first atrix is divided into blocks of rows and the second is divided into blocks of coluns. Each process gets a block of rows and a block of coluns and when work is done, the process sends its results to the aster process and takes new blocks to copute. Here, a atrix of size 1800x1800 floating nubers was used, with a stripe size of 300 rows or coluns. The initial results, on Sandhills (SH), RCF and CSNT are illustrated in Figure 2. The second part of the experient was conducted using the MM on a heterogeneous cluster consisting of processors for RCF, SH, and CSNT (see Table 3). the efficiency calculated by dividing the speedup by the total nuber of processors used (regardless of their contribution) we find that it becoes lower for systes containing ore slow achines than fast ones. For exaple, for the sae configuration entioned above, the direct efficiency (speedup / No. of processors) is 51%, which is lower than the one calculated by equation (7). Moreover, if we consider a configuration with achines of siilar power weights, the efficiency is siilar in both cases. As shown in this experient, the easureent etrics provided by the odel for heterogeneous parallel applications resulted in a better understanding and ore accurate representation of the results. Speedup 12 10 8 6 4 2 0 MM on Sandhills (SH), RCF, and CSNT SH CSNT RCF IDEAL 1 2 3 4 5 6 7 8 9 10 11 12 Nuber of Procesors Figure 2: Speedup for the MM on hoogeneous clusters. The results reported were derived fro the tie easureents of the application (see Figure 3) using the odel introduced in Section 3. The power weight PW of the achines show that SH has the best response tie thus is taken as the reference for PW calculations using equation (2) for RCF and CSNT. In addition, the speedup was calculated using equation (3); however, fro the results it is evident that speedup alone does not provide an accurate representation of the heterogeneity of the syste. The efficiency, on the other hand, was calculated using equation (7) and it takes into account the differences in achine perforances and gives a better representation of the perforance. By coparing with the experient on SH alone, the efficiency results show how the different speeds of achines in the heterogeneous cluster affected the outcoe. For exaple, at six processors on SH we get 86% efficiency, while using three processors in SH and three fro CSNT, which has siilar speed as SH, we achieved 81%. However, at 12 processors in SH we achieved 74% efficiency as opposed to 63% using six processors fro SH, three fro CSNT and 3 fro RCF, which is uch slower than the other two achines. In addition, if we copare the efficiency calculated here with No. Processors Elapsed Total SH RCF CSNT Tie (sec) Relative Speedup Efficiency (%) 1 1 0 0 280.83 0.94429 1 1 0 1 0 807.65 0.32835 1 1 0 0 1 265.19 1 1 3 3 0 0 100.93 2.62734 92.7448 3 0 3 0 416.49 0.63673 64.6401 3 0 0 3 100.24 2.64566 88.1888 6 3 0 3 55.708 4.76032 81.612 12 6 3 3 42.992 6.16831 63.9152 16 13 0 3 34.591 7.66639 50.1866 PW 0.94 0.33 1 Table 3 Perforance easureent for MM on a heterogeneous syste. Tie (illiseconds) 500000 400000 300000 200000 100000 0 Figure 3. Elapsed tie for MM on hoogeneous clusters and heterogeneous systes. 4.3 Discussion MM on SH, CSNT, RCF and All SH CSNT ALL RCF 0 2 4 6 8 10 12 14 16 No. of Processors As shown by the experients above, the proposed odel gives a better representation and understanding of the heterogeneous parallel applications. The efficiency Proceedings of the International Parallel and Distributed Processing Syposiu (IPDPS 03)

and heterogeneity of the syste can be calculated in a way that reflects the true configurations of the syste. In addition, the experients show how the odel fits with the experientally easured values. Finally, the odel can also be used to estiate the possible elapsed tie for an application with different data sets or different syste configurations. This inforation would be essential in helping decide what achines and how any need to be used to achieve the desired levels of efficiency and response tie. Moreover, the existence of the agent-based parallel Java enabled these experients and provided a valuable opportunity to validate the proposed odel. 5. Conclusion The odel introduced in this paper extends soe available etrics to evaluate the perforance of heterogeneous parallel applications. The extensions ainly accoodate the varying platfors and operating environents used and the possibility of having ultiple tasks of the parallel application on each achine. Using the enhanced etrics, it is possible to calculate the speedup and efficiency of the application and further estiate the perforance for a different proble size or different heterogeneous syste configurations. Generally, this odel is essential because current technology allows for the siultaneous utilization of heterogeneous systes to execute parallel applications. In this paper, we discussed an exaple for such a syste, the agent-based infrastructure and JOPI, which were used to develop the experients and evaluate the introduced odel. The results show that heterogeneous systes can provide aple coputing power without the need to upgrade or change the available systes. In addition, the perforance odel provides a eans of evaluating these applications and representing the results in a way representative of the heterogeneous nature of the environent. However, soe autoated easureent and profiling tools ay be helpful in deterining perforance values. This is left for future investigation. 6. Acknowledgeent This proect was partially supported by a National Science Foundation grant (EPS-0091900) and a Nebraska University Foundation grant (26-0511-0019), for which we are grateful. We would also like to thank other ebers of the secure distributed inforation (SDI) group [12] and the research coputing facility (RCF) [11] at the University of Nebraska-Lincoln for their continuous help and support. References [1] Al-Jaroodi, J., Mohaed, N., Jiang, H. and Swanson, D., A Coparative Study of Parallel and Distributed Java Proects for Heterogeneous Systes, in Workshop on Java for Parallel and Distributed Coputing at IPDPS 2002, Ft Lauderdale, Florida, 2002. [2] Al-Jaroodi, J., Mohaed, N., Jiang, H. and Swanson, D., An Agent-Based Infrastructure for Parallel Java on Heterogeneous Clusters, in proceedings of IEEE International Conference on Cluster Coputing (CLUSTER 02), Chicago, Illinois, Septeber 23 26, 2002. [3] A. A. Aly, A. S. Elaghraby, and K. A. Kael, Parallel Prograing on top of Networks of Heterogeneous Workstations (NHW), Proceedings of The International Society for Coputers and Their Applications, International Conference on Coputer Applications in Industry and Engineering (CAINE- 98), Las Vegas, Nevada., W. Perrizo (editor), pp: 115-118. [4] Crowl, L., How to Measure, Present, and Copare Parallel Perforance, IEEE Parallel and Distributed Technology, Spring 1994, pp 9-25. [5] Donaldson, V., Beran, F. and Paturi, R., Progra Speedup in Heterogeneous Coputing Network, Journal of Parallel and Distributed Coputing 21, 316-322, 1994. [6] Gunnels, J., Lin, C., Morrow, G. and Gein, R., Analysis of a Class of Parallel Matrix Multiplication Algoriths, Technical paper, 1998, http://www.cs. utexas.edu/users/plapack/papers/ipps98/ipps98.htl [7] The Jini Counity http://www.ini.org/ [8] Karp, R. and Zhang, Y., Randoized Parallel Algoriths for Backtrack Search and Branch-and-Bound Coputation, Journal of the ACM, V. 40, 3, July 1993. [9] Mohaed, N., Al-Jaroodi, J., Jiang, H. and Swanson, D., "JOPI: A Java Obect-Passing Interface", in proceedings of the Joint ACM Java Grande-ISCOPE (International Syposiu on Coputing in Obect-Oriented Parallel Environents) Conference (JGI2002), Seattle, Washington, Noveber 2002. [10] Post, E. and Goosen, H., Evaluating the Parallel Perforance of a Heterogeneous Syste, in proceedings of HPC Asia 2001, Australia, Septeber 2001. [11] Research Coputing Facility at UNL hoe page. http://rcf.unl.edu [12] Secure Distributed Inforation Group at UNL. http://rcf.unl.edu/~sdi [13] Sun Java web page http://ava.sun.co [14] Xiao, L., Zhang X. and Qu, Y., Effective Load sharing on Heterogeneous Networks of Workstations, in proceedings of IPDPS 2000, Mexico, May 2000. [15] Zhang, X. and Yan, Y., Modeling and Characterizing Parallel Coputing Perforance on Heterogeneous Networks of Workstations, in proceedings of the 7 th IEEE Syposiu on Parallel and Distributed Processing (SPDPS 95), 1995. Proceedings of the International Parallel and Distributed Processing Syposiu (IPDPS 03)