Load Balance Strategies for DEVS Approximated Parallel and Distributed Discrete-Event Simulations



Similar documents
Performance Evaluation of Improved Web Search Algorithms

A Load Balancing Algorithm based on the Variation Trend of Entropy in Homogeneous Cluster

Supercomputing applied to Parallel Network Simulation

Shared Parallel File System

Approximate Parallel Simulation of Web Search Engines

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

ZooKeeper. Table of contents

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

Development of Software Dispatcher Based. for Heterogeneous. Cluster Based Web Systems

Efficient Parallel Block-Max WAND Algorithm

Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

High Performance Cluster Support for NLB on Window

Load Balancing in Distributed Web Server Systems With Partial Document Replication

SAN Conceptual and Design Basics

Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

Real Time Network Server Monitoring using Smartphone with Dynamic Load Balancing

Graph Analytics in Big Data. John Feo Pacific Northwest National Laboratory

FPGA-based Multithreading for In-Memory Hash Joins

FAWN - a Fast Array of Wimpy Nodes

Cluster Computing. ! Fault tolerance. ! Stateless. ! Throughput. ! Stateful. ! Response time. Architectures. Stateless vs. Stateful.

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

Reverse Auction-based Resource Allocation Policy for Service Broker in Hybrid Cloud Environment

Parallel Computing. Benson Muite. benson.

Load Balancing on a Non-dedicated Heterogeneous Network of Workstations

HP SN1000E 16 Gb Fibre Channel HBA Evaluation

Evaluation Report: Supporting Microsoft Exchange on the Lenovo S3200 Hybrid Array

Experimental Evaluation of Horizontal and Vertical Scalability of Cluster-Based Application Servers for Transactional Workloads

White Paper on Consolidation Ratios for VDI implementations


Tableau Server Scalability Explained

packet retransmitting based on dynamic route table technology, as shown in fig. 2 and 3.

Tableau Server 7.0 scalability

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

How To Model A Search Engine

Performance Analysis of Web based Applications on Single and Multi Core Servers

HP reference configuration for entry-level SAS Grid Manager solutions

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Proposal of Dynamic Load Balancing Algorithm in Grid System

In Memory Accelerator for MongoDB

HP ProLiant Gen8 vs Gen9 Server Blades on Data Warehouse Workloads

Load Balancing on a Grid Using Data Characteristics

A Comparative Performance Analysis of Load Balancing Algorithms in Distributed System using Qualitative Parameters

Best practices for fully automated disaster recovery of Microsoft SQL Server 2008 using HP Continuous Access EVA with Cluster Extension EVA

IMPROVEMENT OF RESPONSE TIME OF LOAD BALANCING ALGORITHM IN CLOUD ENVIROMENT

CentOS Linux 5.2 and Apache 2.2 vs. Microsoft Windows Web Server 2008 and IIS 7.0 when Serving Static and PHP Content

Capacity Planning for Microsoft SharePoint Technologies

How To Write An Article On An Hp Appsystem For Spera Hana

Optimizing Shared Resource Contention in HPC Clusters

The IntelliMagic White Paper on: Storage Performance Analysis for an IBM San Volume Controller (SVC) (IBM V7000)

Michael Kagan.

DISTRIBUTED AND PARALLELL DATABASE

EMC Unified Storage for Microsoft SQL Server 2008

An Optimistic Parallel Simulation Protocol for Cloud Computing Environments

Clusters: Mainstream Technology for CAE

AN EFFICIENT DISTRIBUTED CONTROL LAW FOR LOAD BALANCING IN CONTENT DELIVERY NETWORKS

Performance Modeling and Analysis of a Database Server with Write-Heavy Workload

Overlapping Data Transfer With Application Execution on Clusters

BBM467 Data Intensive ApplicaAons

Principles and characteristics of distributed systems and environments

STUDY AND SIMULATION OF A DISTRIBUTED REAL-TIME FAULT-TOLERANCE WEB MONITORING SYSTEM

Accelerating Microsoft Exchange Servers with I/O Caching

TheImpactofWeightsonthe Performance of Server Load Balancing(SLB) Systems

Violin Memory 7300 Flash Storage Platform Supports Multiple Primary Storage Workloads

MAGENTO HOSTING Progressive Server Performance Improvements

Microsoft Windows Server 2003 with Internet Information Services (IIS) 6.0 vs. Linux Competitive Web Server Performance Comparison

James Serra Sr BI Architect

Distributed Dynamic Load Balancing for Iterative-Stencil Applications

BSPCloud: A Hybrid Programming Library for Cloud Computing *

Mesh Generation and Load Balancing

CS312 Solutions #6. March 13, 2015

Benchmarking Cassandra on Violin

White Paper. Recording Server Virtualization

Massive Streaming Data Analytics: A Case Study with Clustering Coefficients. David Ediger, Karl Jiang, Jason Riedy and David A.

BlobSeer: Towards efficient data storage management on large-scale, distributed systems

Protect SQL Server 2012 AlwaysOn Availability Group with Hitachi Application Protector

Load Balancing Scheduling with Shortest Load First

Study of virtual data centers for cost savings and management

Recommendations for Performance Benchmarking

POSIX and Object Distributed Storage Systems

CHAPTER 4 PERFORMANCE ANALYSIS OF CDN IN ACADEMICS

Performance evaluation of Web Information Retrieval Systems and its application to e-business

Upgrading a Telecom Billing System with Intel Xeon Processors

Oracle Database Scalability in VMware ESX VMware ESX 3.5

A Content-Based Load Balancing Algorithm for Metadata Servers in Cluster File Systems*

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1

DESIGN OF A PLATFORM OF VIRTUAL SERVICE CONTAINERS FOR SERVICE ORIENTED CLOUD COMPUTING. Carlos de Alfonso Andrés García Vicente Hernández

A Flexible Cluster Infrastructure for Systems Research and Software Development

Queue Weighting Load-Balancing Technique for Database Replication in Dynamic Content Web Sites

Tips and Tricks for Using Oracle TimesTen In-Memory Database in the Application Tier

HP ProLiant DL380 G5 takes #1 2P performance spot on Siebel CRM Release 8.0 Benchmark Industry Applications running Windows

Locality Based Protocol for MultiWriter Replication systems

1. Comments on reviews a. Need to avoid just summarizing web page asks you for:

Database Replication with Oracle 11g and MS SQL Server 2008

QoS & Traffic Management

Load Balancing of Web Server System Using Service Queue Length

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Resource Allocation Schemes for Gang Scheduling

Transcription:

Load Balance Strategies for DEVS Approximated Parallel and Distributed Discrete-Event Simulations Alonso Inostrosa-Psijas, Roberto Solar, Verónica Gil-Costa and Mauricio Marín Universidad de Santiago, Chile Yahoo! Research Latin America, Chile Universidad Nacional de San Luis, Argentina, Email: gvcosta@unsl.edu.ar Abstract DEVS is a formalism for modeling and analysis of discrete event systems. PDEVS is an extension of DEVS for supporting Parallel and Discrete Event Simulation (PDES). DEVS has been widely used to model the properties of large scale systems by means of hierarchical and modular model composition. PCD++ is a simulation platform that supports parallel simulations of DEVS models. In PCD++ a simulation can be performed by using either optimistic or conservative schemes for synchronization of causality related events across processors. However, model component allocation in processors is not an automatic process. It can be a time consuming task requiring knowledge of communication patterns among model components. In this paper, we propose and evaluate different allocation strategies devised to improve load balance of parallel DEVS simulations. The experimentation is made on a Web search engine application whose workload is featured by dynamic and unpredictable user query bursts, and high message traffic among processors. Keywords-Parallel Discrete Event Simulation; PDES; DEVS; I. INTRODUCTION Parallel discrete event simulation (PDES) has been widely used in different application areas as an efficient tool to study performance of large scale systems. A PDES program consists of a collection of logical processes (LPs), each simulating a different portion of the model. LPs communicate to each other by exchanging timestamped event messages. A major difficulty of PDES is to efficiently process all events in parallel in global time-stamp order. In the literature, dealing with causality related events issues has two major strategies: conservative and optimistic [1]. In conservative algorithms, the process is blocked until all execution conditions are satisfied. In optimistic algorithms, the process will continue even if some execution condition is not fulfilled, but including mechanisms to recover from causality issues. In [2] the performance of optimistic protocols is increased by removing the mechanisms to recover from causality errors leading to approximate simulations. DEVS is a simulation formalism for modeling and simulating discrete event systems [3], [4]. We work with a PDES platform called Parallel CD++ (PCD++) [5] that supports parallel simulation of DEVS models. In this paper we propose and evaluate different load balancing strategies devised to automatically allocate the elements of a DEVS model to the distributed set of processors. The remaining of this paper is organized as follows. Section II presents the background. In section III we present the WSE application modeled with PCD++. In Section IV we describe the allocation strategies evaluated in this work. In section V we present our experimental results and conclusions are discussed in Section VI. II. BACKGROUND A. Parallel discrete event simulation Parallel discrete event simulation (PDES) consists of executing a single simulation program on a parallel computer. The simulation program is decomposed into a set of concurrent processes called logical process (LP) that may be independently executed on different processors. Each LP is composed by a separate set of state variables of the simulation program. The communication between LPs is performed by exchanging timestamped messages or events. In PDES, synchronization protocols are used to avoid that the parallel simulation violates the local causality constraint (events must be executed in timestamp order). Synchronization protocols can be classified as conservative or optimistic [1]. In conservative algorithms, the process is blocked until all execution conditions are satisfied. In optimistic algorithms the process will continue even if some execution conditions are not fulfilled, but providing mechanisms to recover from causality issues. A new approach for approximate parallel simulations is presented in [2]. It relaxes causality constraints of conservative protocols permitting out of time execution of straggler events, leading to faster execution of simulations. B. Model Partitioning Model partitioning policies are used to decompose the simulation model into a set of disjoint sub-models to exploit the inherent parallelism of the model. Partitioning a set of entities into P processors may be modeled as a graph partitioning problem. The interaction between entities (message passing), is modeled as a communication graph whose vertices represent entities and weighted arcs represent communication costs among entities.

allow to represent the behavior of a system whereas coupled models permit to represent its structure. Coupled models are composed of two or more atomic or coupled models [3]. DEVS allows model reuse of existing models. There are several implementations of DEVS, one of them is Parallel CD++ (PCD++), with versions that use conservative and optimistic protocols [5], [7], [8]. III. MODELING WSE WITH PCD++ Figure 1. Query processing The main goal of model partitioning by means of graph partitioning approaches is to reduce the computing and communication times. We have to consider that finding the optimal graph partition is an expensive computational problem (NP-hard). Instead, there are many graph partitioning heuristics used to obtain a good approximate optimal solution in a reasonable execution time. In [6] is presented an overview of graph partitioning algorithms used for scientific simulations on high performance parallel computers. C. Service Based Web Search Engines Modern WSE are composed of services deployed in computer clusters. Typical services are: Front Service (FS) that handles submitted user queries by routing them to the appropriate service, it also manages the delivery of final results to the user. The Cache Service (CS) implements a partitioned distributed cache storing previously computed results for the most popular queries. And the Index Service (IS), holds an index of the document collection and is responsible of delivering partial results to the Front Service. These services are deployed on clusters of computers and its processing nodes are allocated in racks connected via network switches. To reduce query response times and increase throughput of queries, most services are implemented as arrays of P R processing nodes, where P is the level of partitioning and R is the level of replication of each service partition, as seen in Figure 1. Processing nodes from FS can communicate with every processing node of the other services. Thus, the communication pattern tends to be uniform which makes this a difficult application to be partitioned. D. DEVS The Discrete-Event System Specification (DEVS) [3] formalism provides the means for describing discrete-event systems. DEVS provides two different types of elements to model a real system: Atomic and Coupled. Atomic models Coupled models were not considered in the design of our simulation model since its atomic elements are highly interconnected and its use would lead to an increase in communication levels (specially if atomic models belonging to composed models are assigned to different LPs) Our DEVS model of the WSE is shown in Figure 2. Shaded areas are only intended to help the reader to identify the different services of the model. The Query Generator generates and delivers user queries through its output ports (out 1,...,out n) to the FS replicas selecting them in a roundrobin fashion. Query inter-arrival time is simulated using an exponential distribution. Once a FS replica receives a new query (on its In input port), it immediately sends it to a single replica of the CS through one of its outcs i j output ports. The FS replica chooses the i th partition by computing a hash function on the query terms, and the j th replica in the selected partition is chosen in a roundrobin form. Then, the CS replica responds the query to the FS replica that sent it in the first place with a hit/miss. In case there was a hit the query now contains the results for that search, the FS responds with the query results to the user and the processing for that query is finished. However, if the answer was a miss, the FS replica needs to get the results from the IS. The query is sent to the P IS partitions (the replicas among the partitions are also selected in a round-robin mode) through its outis i j output ports. Once the FS replica has received the partial results from all the selected IS replicas it performs a merge operation producing the the top k final document results. IV. PARTITIONING STRATEGIES The main objective behind the partitioning strategies is to achieve good workload balance, since its consequences are a reduction in execution time, an increase in speedup and a reduction in network communication (which translates as lower probabilities of straggler events). These goals are achieved by properly allocating at the same LP the simulated entities that communicate the most to each other and by having similar amounts of simulated entities at each LP. Each LP is exclusively hosted into a physical processor. We use the following allocation strategies of entities to LPs: User Defined: This strategy requires knowledge about the application and the communication patterns to properly allocate entities evenly at the different LPs.

Figure 2. DEVS model of a WSE RB: A communication graph is built based on the amount of queries sent by each pair of simulated entities. This graph is partitioned with the METIS software 1 by means of a multilevel recursive biscetion approach. As a result, each partition groups the simulated entities that communicate the most. Hash-Based: The Fowler-Noll-Vo non-cryptographic hash function was used. FNV hashes are designed to be fast while maintaining a low collision rate. RR: Entities are assigned in a round-robin fashion. RND: Entities are assigned in a fashion. SP: We apply a algorithm [9] in order to obtain k partition groups. In PCD++ the allocation of atomic models to the different LPs is defined by the user in an offline fashion by explicitily specifying the processor location for each model component. Entity migration among LPs is not allowed. Also, PCD++ lacks of dynamic load balance capabilities. Under these constraints is that the study of different strategies for partitioning the simulation model becomes relevant. A. Experimental Setting V. EVALUATION Partitioning strategies were evaluated by simulating a WSE DEVS model (as in Figure 2) to obtain metrics related to model simulation accuracy and performance of the parallel simulator itself. A services configuration is a simulation parameter indicating the levels of partitioning and replication of each service of the WSE model (i.e., <3,4,5,6,7> specifies a WSE with 3 FS, 4 CS partitions and 5 replicas per partition, and 6 IS partitions with 7 replicas each). The model used in our experiments was specified with a <32,32,8,16,15> configuration simulating 528 physical 1 http://glaros.dtc.umn.edu/gkhome/views/metis processors. Our simulator model supports different query arrival rates because in practice user queries do not arrive at evenly-spaced time intervals. Different query arrival rates were used in order to avoid the saturation of the simulated services, that is, with workload over 80%. It is important to simulate unsaturated services, because the metrics studied in this work cannot be well estimated when services run under overloaded conditions because results are unpredictable and do not follow a stable behavior. To correctly simulate the behavior of a WSE and its relevant costs, the simulator uses actual query logs, document posting lists and ranking times. In particular, we use a log of 36, 389, 567 queries submitted to the AOL Search service between March 1 and May 31, 2006. It was pre-processed according to the rules described in [10].The costs of the most significant operations of a WSE were measured on production hardware [11] and then used by the DEVS model to properly simulate the query processing process. B. Validation Experiments were executed on the Deepthought cluster at the ARS Laboratory at Carleton University using up to sixteen HP PROLIANT DL Servers with Dual 3.2Ghz Intel Xeon processors and 2GB of RAM memory. In the following experiments we use the PCD++ parallel DEVS simulator implementing an approximated optimistic protocol [2] using the MPI message passing library. C. Simulation Accuracy Evaluation Throughput (processed queries per second) results show that the strategies behave well with models of different sizes. According to results in table I strategy RB provides the lowest error levels as the amount of processors involved in the parallel simulation increases. However, the percentage of throughput error of the rest of the strategies only raises to a maximum of almost 0.21%.

Table I THROUGHPUT PERCENTAGE ERROR Strategy 2 Nodes 4 Nodes 8 Nodes 16 Nodes User Defined 0.102% 0.170% 0.186% 0.186% RB 0.105% 0.134% 0.193% 0.182% FNV1A32 0.107% 0.175% 0.193% 0.195% RR 0.118% 0.166% 0.177% 0.211% RND 0.106% 0.163% 0.208% 0.189% SP 0.107% 0.172% 0.192% 0.212% Speedup 6 5 4 3 round robin 2 As more processors are involved in the parallel simulation there should be more probabilities of straggler events. Nevertheless, this is not always the case with strategies RB and RND (as shown in Table I) as they tend to group simulated services with higher communication interaction to the same partition, reducing overall network communication. D. Simulator Performance Evaluation Figure 3 shows speedup and workload efficiency of our simulating model. Strategies show similar results but RB, mainly because it has partitions of different sizes. As expected, it can be observed in figure 3(a) that User Defined and RR strategies obtain the highest speedup levels since these strategies uniformly distribute entities to LPs. Hash-based and SP strategies also present good performance results. Workload efficiency is calculated as: W Efficiency = 1 P P i=1 w i max(w i ) where w i is the amount of events processed by the LP running on processor i, and P is the number of processors involved in the parallel simulation. A workload efficiency close to 1, means that all LPs tends to do the same amount of work. It can be observed in figure 3(b) that the workload efficiency decreases as more processors are involved in the parallel simulation. RB presents the lowest levels of workload efficiency since this strategy does not guarantee a uniform assignment of entities to LP. Therefore, some LPs process more events than others, producing a high variance among the workload of the physical processor. User defined and RR practically overlap each other, showing to be strategies with the best workload efficiency, since they distribute the entities among the parallel processors more uniformly than the rest. As more processors are involved in the simulation, it can be observed in figure 3 that even when RB has the poorest load balance and speedup levels, it presents the lowest straggler events ratio among the studied strategies (as shown in figure 4) since it effectively allocate (no matter how unbalanced) the simulated processing services that communicate the most at the same LP, effectively reducing overall communication costs. Workload Efficiency Straggler Events (Ratio) 1 1 2 4 8 16 1 0.8 0.6 (a) Speedup 0.4 0.2 round robin 0 2 4 8 16 0.16 0.14 0.12 0.1 0.08 Figure 3. (b) Workload Speedup vs Workload balance 0.06 0.04 0.02 round robin 0 1 2 4 8 16 Figure 4. Straggler Events

VI. CONCLUSIONS In this work we have proposed and evaluated different static partitioning strategies devised to automatically allocate DEVS model components to different processors for the PCD++ parallel simulation framework. Currently, this task is effected manually by the user by writing a script that specifies in what processor to place each model component. Clearly this manual approach does not scale to large models composed of thousands of components like Web search engine performance evaluation simulation models and, even for small models, it requires the user to get involved in the details of parallelization. We have observed that the automatic allocation strategies proposed in this paper ease this difficulty and produce fairly well balanced workloads as compared against manual allocations. The experimental results was obtained using a WSE model. Results show that efficient performance is achieved by most of the evaluated strategies. Regarding performance the RR and User Defined strategies present the best choices for implementing an automatic allocation strategy in PCD++ for DEVS. In particular, we performed our experimentation using the approximate optimistic protocol since its implementation became very simple. Nevertheless, the proposed allocation strategies are also directly applicable to the other conservative and optimistic protocols in PCD++ as all of them are built on the concept of LPs executing concurrently on a set of distributed cluster processors. They all present the same requirements in terms of the convenience of performing load balanced parallel simulation. We have also observed that the best allocation strategies tend to improve the accuracy of simulation results in the approximate optimistic synchronization protocol. [7] Q. Liu, Distributed optimistic simulation of devs and celldevs models with pcd++, Master s thesis, Carleton University, 2006. [8] S. J., New algorithms for the parallel cd++ simulation environment, Master s thesis, Carleton University, 2007. [9] A. Y. Ng, M. I. Jordan, and Y. Weiss, On : Analysis and an algorithm, in Advances In Neural Information Processing Systems. MIT Press, 2001, pp. 849 856. [10] Q. Gan and T. Suel, Improved techniques for result caching in web search engines, in WWW, 2009, pp. 431 440. [11] V. G. Costa, M. Marín, A. Inostrosa-Psijas, J. Lobos, and C. Bonacic, Modelling search engines performance using coloured petri nets, Fundam. Inform., vol. 131, no. 1, pp. 139 166, 2014. REFERENCES [1] R. M. Fujimoto, Parallel and Distribution Simulation Systems, 1st ed. New York, NY, USA: John Wiley & Sons, Inc., 1999. [2] M. Marin, V. Gil-Costa, C. Bonacic, and R. Solar, Approximate parallel simulation of web search engines, ser. SIGSIM- PADS 13, 2013, pp. 189 200. [3] B. P. Zeigler, T. G. Kim, and H. Praehofer, Theory of Modeling and Simulation, 2nd ed. Orlando, FL, USA: Academic Press, Inc., 2000. [4] G. Wainer, Discrete-Event Modeling and Simulation: a practitioner approach. CRC Press. Taylor and Francis, 2009. [5] Q. Liu and G. Wainer, Parallel environment for devs and cell-devs models, SIMULATION, vol. 6, no. 83, pp. 449 471, 2007. [6] K. Schloegel, G. Karypis, and V. Kumar, Graph partitioning for high-performance scientific simulations, in Sourcebook of Parallel Computing, 2003, pp. 491 541.