Dynamic Load Balance for Approximate Parallel Simulations with Consistent Hashing

Dynamic Load Balance for Aroximate Parallel Simulations with Consistent Hashing Roberto Solar Yahoo! Labs Santiago, Chile rsolar@yahoo-inc.com Veronica Gil-Costa Universidad Nacional de San Luis, Argentina gvcosta@unsl.edu.ar Mauricio Marin CEBIB, Universidad de Santiago, Chile mauricio.marin@usach.cl ABSTRACT Parallel simulation is a owerful tool to evaluate the erformance of large-scale systems. However when it comes to simulating large scale Web search engines, the arallel simulation execution can introduce imbalance among rocessors because event occurrence is driven by user behavior which is unredictable, making events take lace in different arts of the system in an irregular manner. In this aer, we study the imact of load balance strategies on the erformance of a arallel simulation strategy. In articular, we resent a consistent hashing load balance algorithm aimed to reduce queuing waiting times, evenly distribute the costs of executing events among rocessors, and more imortantly migration of logical rocesses only occurs between neighbor rocessors. We use a Web search engine comosed by services deloyed on a cluster of rocessors as the alication case study. Our simulations are driven by actual query log traces. We evaluate our roosed load balance algorithm in terms of running time, communication, memory consumtion and accuracy. Results show that the roosed load balance strategy is caable of reducing the execution times of simulations and maintaining quality of results. Author Keywords Load balance, scheduling, aroximate arallel simulation. ACM Classification Keywords I.6.8 SIMULATION AND MODELING : Parallel. INTRODUCTION Performance evaluation of new algorithms designed for largescale Web search engines (WSE) is a difficult task. In many cases, their assessment on the actual system running in roduction is not ractical due to its ossible effects on the stable oeration of the search engine. On the other hand, reroducing the search engine comutations on a relicated infrastructure can have significant costs in large search engines. Tyically, a large search engine is comosed of a collection of services which are software comonents executing oerations such as (a) calculation of the to-k documents that best Paste the aroriate coyright statement here. ACM now suorts three different coyright statements: ACM coyright: ACM holds the coyright on the work. This is the historical aroach. License: The author(s) retain coyright, but ACM receives an exclusive ublication license. Oen Access: The author(s) wish to ay for the work to be oen access. The additional fee must be aid to ACM. This text field is large enough to hold the aroriate release statement assuming it is single saced. match an user query; (b) routing queries to the aroriate services and blending of results coming from them; (c) construction of the result Web age for queries; (d) advertising related to query terms; (e) query suggestions, among many other oerations. Parallel discrete-event simulation rovides a ractical alternative to evaluate new algorithms without interfering the system currently in roduction. It has been widely used in the ast years to evaluate the erformance of large scale systems[4, 4]. In fact, there are various roosals to make arallel simulation more efficient [2, 22, 23]. During the last few years, discrete-event simulation have been used as a suorting tool for designing, imlementing and evaluating large-scale web search engines [3, 5, 7, 8]. However, these simulation models usually assume that comutations on the distributed search engine services are fairly well balanced across cluster rocessors, which is an assumtion that can be unrealistic in scenarios where user query traffic dynamically changes in intensity and content. In [6] a new arallel discrete event simulation aroach for erformance evaluation of WSEs called aroximate arallel simulation is introduced. It is a key-based aroach designed to simulate the life cycle of queries in large-scale web search engines. A key is used to identify a articular event and its associated rocessing code. Keys are evenly distributed among the rocessors by means of a hash function. Desite this aroach accelerates the execution of simulations, with the counterart of a small loss of recision in the final results, it can be rone to load-imbalance because the erformance of WSE simulations can be affected by unredictable user query behavior which tends to overload a sub-set of keys. In this work, we roose to tackle the imbalance roblem by introducing a consistent hash load balance algorithm into the aroximate arallel simulator described in [6]. The load balance algorithm works at the communication layer of the simulation framework. Our strategy introduces some overheads during the execution of the simulations since it is necessary to eriodically collect data in order to decide whether the load balance algorithm has to be executed; which can cause the migration of logical rocesses. Nevertheless, we show that in large scale systems where millions of events are simulated the roosed load balance algorithm imroves overall erformance when keys become unevenly distributed among rocessors.

There are various works studding the load balance roblem on arallel simulations [, 9, 2]. However, we have found that our consistent hashing based aroach is a suitable strategy to follow in the arallel aroximate simulations of [6] as it enables a natural maing among the key values and the consistent hash ring that defines the rocessors in charge of serving dynamically changing ranges of key values. To the best of our knowledge, the consistent hash scheme has not been reviously emloyed to balance the load of arallel discrete event simulations. Our exerimental results show that our strategy is caable of quickly imroving the overall load balance of the arallel simulation. The remaining of this aer is organized as follows. The related work is resented in Section 2. Section 3 describes our alication domain. Section 4 resents the roosed load balance algorithm. Exerimental results are resented in Section 5 and final conclusions in Section 6. 2. RELATED WORK The authors in [] study the roblem of load balance for conservative distributed simulations by using a set of benchmark network models. They roose a dynamic load balance algorithm based on the CPU-queue length which indicates the workload at each rocessor. The load balance rocess is comosed by two comonents: load balancer - which is in charge of determining the re-allocation of rocesses, and rocess migration - which is in charge of executing the decision made by the load balancer. Furthermore, there are two load balance strategies roosed: centralized and multi-level. Both methods work in a similar way, where a load balancer eriodically request the load of each rocessor in order to comute the imbalance and determine the exchange of work. The authors in [2] introduce a dynamic artitioning algorithm for otimistic distributed simulation of synthetic workload simulation models. This load balance algorithm oerates on the basis of comutation and communication, which are ossible sources of imbalance. Imbalances are detected by means of estimating three metrics: the caacity of each host, the comutation load generated by each LP and the communication load between LPs. These estimations can trigger two different rocedures: a load balance cycle or a communication refinement cycle. The communication refinement cycle consists of grouing into rocessors the most communicated LPs. The load balance cycle consists of transferring load excess from overloaded to underused rocessors. In [2] the authors resent a scheme to balance the communication and comutational load during the execution of distributed high-level architecture (HLA) based simulations. The load balance scheme consists of a system based on a hierarchical architecture in order to constantly monitor resources and simulations decreasing the monitoring overhead. Once a load imbalance is detected, the system defines a load redistribution according to artitioning olicies and re-configures the load by means of a low-latency federate migration technique. A load balance cycle consists of triggering the monitor of resources in order to detect imbalances. If a load reartitioning is required, load redistribution is invoked and migration moves are erformed. Two dynamic load balance algorithms are resented in [9]. They are devised to balance the comutational and communication load in a Time War simulation of digital logic circuits. Furthermore, authors also resent a rotocol which selects a load balance algorithm by using a multi-state Q-learning aroach. In this aer, a dee first search is used to initially distributed the LPs among rocessors. The comuting balance mechanism is as follows: each rocessor sends comutation and communication values to a master rocessor, the master rocessor builds a biartite grah using comuting values as rocessors and communication values as edges. A biartite grah matching algorithm is used to match the overloaded and under-used rocessors. Overloaded rocessors send a set of L LPs to their corresonding under-used rocessor in order to balance the workload. The communication balance mechanism consists of grouing airs of the most communicated LPs into the same rocessor. In [8] is resented a load balance scheme that combines both static artitioning and dynamic load balance for distributed conservative simulation. A suly-chain model is used to evaluate the roosed artitioning strategies. The static artitioning scheme consists of grouing multile simulation objects into LPs rior to the simulation. To this end, the simulation objects are maed to the rocessors of the given grah, and the edges reresent the interaction between simulation objects. A grah artitioning ackage (METIS or Scotch 2 ) is used in order to grou simulation objects into LPs while minimizing load imbalances and maximizing lookaheads. The dynamic scheme consists of dynamically scheduling LPs to threads. LPs with lower simulation time have high riority of being scheduled. There are two centralized ools: active ool and assive ool. Processors grab LPs from the active ool and return LPs to the assive ool. Finally, when all LPs in the active ool are consumed, the two ools are swaed. More similar to our work, but in the context of real systems imlementations (no simulation is involved), the authors in [] roose a dynamic load balance algorithm uon consistent hashing in order to coe with sudden increases in query traffic and rocessors failure in Caching Service rocessors. The load balance algorithm is based on the Sender Initiated Diffusion (SID) algorithm. Each overloaded artition distributes excess of load to under-used neighbors artitions by using the consistent hashing aroach. There is no migration rocesses involved on the load balance strategy. 3. SIMULATING WEB SEARCH ENGINES We focus on Web search engines comosed by three services deloyed in clusters of comuters interconnected by a highseed network: Front-service (FS), Caching-Service (CS) and Index-service (IS). The Front-service receives queries form users, routes them to the corresonding CS or IS, and is in charge of blending artial results coming from the IS to resent the to-k most relevant documents to the users. The Caching-service kees in memory the most frequent queries along its to-k document results. This service avoid recomuting the to-k results for oular queries. Finally, the htt://glaros.dtc.umn.edu/gkhome/views/metis 2 htt://www.labri.fr/erso/elegrin/scotch/ 2

IS comutes the to-k document results for user queries by accessing an index and comuting score algorithms like the WAND [2]. Figure. Life cycle of a query on a Web search engine. Figure shows the stes executed within the life cycle of a query on a Web search engine. The Front-Service (FS) is comosed of several rocessors where each rocessor is hosted by a dedicated cluster. Each FS rocessor receives user queries and sends back the to-k results for each query to the users. After a query arrives to a FS rocessor, it selects a caching service (CS) rocessor. These rocessors kee the to-k results for frequent queries reviously issued by users, so resonse time for these queries is very small. The queryresults set for the CS is ket artitioned into P disjoint subsets and each artition is relicated R times to increase query throughut. A simle LRU strategy can be used to imlement the CS rocessors. A hash function on the query terms is used to select a CS artition whereas a resective relica can be selected in a round-robin manner. If the query is not cached, it is sent to the index service (IS) cluster. The IS contains a distributed index built from the document collection held by the search engine (e.g. HTML documents from a big samle of the Web). This index is used to seed u the determination of what documents contain the query terms and calculate scores for them. The k documents with the highest scores are selected as the query answer. The index and its resective document collection are artitioned into P artitions and each artition is ket relicated R times to enable high query throughut. Partitions hel to reduce resonse time for individual queries since time is roortional to the number of documents that contains the query terms. Queries are sent to all artitions and the results are merged to obtain the global to-k results. Figure also shows an overall descrition of the message traffic among the Web search engine services. These messages must ass through a number of communication switches to reach their destinations. Switches are usually organized in a fat-tree toology. On the other hand, each rocessor is exected to be a multi-core rocessor. Thus search engine service rocessors are multi-threaded comonents where it is necessary to take into consideration the effects in erformance of concurrency control strategies. The imlementation of the arallel aroximate simulator [6] and its corresonding tool develoed in [], is based on multi-threading and a bulk-synchronous message assing strategy to automatically conduct simulation time advance. The arallelization of execution of simulations is simlified as no rollbacks are considered to correct erroneous comutations. The simulation is designed according to the arallel comuting model named Bulk Synchronous Parallel (BSP) comuting model [24]. Under the BSP model, comutation is organized as a sequence of suerstes. During a suerste, rocessors may erform comutations on local data and/or send messages to other rocessors. At the end of a suerste there is always a synchronization barrier. It ermits that messages sent during the current suerste are available for rocessing at their destinations at the next suerste. The underlying communication library ensures that all messages will be available at their destinations before starting the next suerste. In each rocessor there is one master thread that synchronizes with all other P master threads to execute the BSP suerstes and exchange messages. Then, in each rocessor and suerste the remaining T threads synchronize with the master thread to start the next suerste, though they may immediately exchange messages during the current suerste as they share the same rocessor main memory. Additionally, a key-based aroach is used in [6]. Each event has a key which determines the oeration to be erformed (e.g. ranking, merge, etc.) and by means of a hash function it obtains the rocessor identifier and the thread resonsible to simulate that oeration. In this context, the threads are in charge of erforming the comutations of what is called logical rocesses (LPs) in the arallel discrete event simulation (PDES) arlance [7]. The tyical content of a key is a string like class=classname; instance=id, e.g. class=cacheservice; artition=3; relica=28. The only requirement is to secify the class identification field to let the simulation framework instantiate the right object. As it is well-known, rocessing events in arallel during eriods of time where no messages from other rocessors (remote threads or LPs) are received can lead to the roblem of missing the arrival of event messages at the right simulation time. The aroach resented in [6] simly ignores such situations but rooses a strategy to significantly reduce the arrival of those straggler messages. The aroximate arallel simulations of [6] are erformed in a cluster of P rocessors where each rocessor contains T threads so that LPs are evenly distributed on the T P threads. Each thread/lp has its own instance of a sequential simulation kernel which includes a local event list and functions related to event rocessing and message handling. In each rocessor there is one master thread that synchronizes with all other P master threads to execute the BSP suerstes and exchange messages. In each rocessor and suerste, the remaining T threads synchronize with the master thread to start the next suerste, though they may immedi- 3

ately exchange messages during the current suerste as they share the same rocessor main memory. LPs 4. PROPOSAL We resent a dynamic load balance strategy based on consistent hashing [3], for aroximate arallel simulation of web search engines. Keys are maed into a ring managed as a set of buckets. Each bucket is comosed of a set of equalsized artitions. Each artition is associated with one LP. The buckets can be associated with a variable number of artitions but the total number of artitions is ket fixed. Buckets can only exchange artitions with neighboring buckets along the ring. There is one bucket er rocessor. Our load balance strategy consists of adjusting the flow of events (messages) between rocessors by means of changing the number of artitions of adjacent buckets in order to balance the comutation load while reducing LPs migrations among rocessors. To this end, we imlement a centralized scheduler in charge of collecting information of the messages assing between rocessors. The scheduler evaluates load imbalances (by using a threshold mechanism) and re-configures the consistent hashing ring (by moving the limits of the buckets which requires migration of LPs between rocessors). The main advantage of our load balance strategy is that migrations only occur between neighboring rocessors of the consistent hashing ring which kees small the amount of data transferred during the redistribution of the workload. Figure 2 shows an examle of how the roosed algorithm works. In Figure 2.(a) we show the initial state of the consistent hashing ring. The ring is comosed by a set of 4 buckets and each bucket has 4 artitions (evenly distributed). In Table we show the bucket configuration before executing the load balance algorithm. The 2th column shows the number of artitions assigned to each bucket and the 3th column shows the total workload reorted by each rocessor. In this examle, P has four artitions which reort a workload of {4,3,3,3} resectively. P has four artitions which reort a workload of {,,,} resectively. P 2 has four artitions with a workload of {3,4,3,4} resectively, and P 3 has four artitions reorting a workload of {3,2,3,2}. Processor # of Partitions Workload 4 3 4 4 2 4 4 3 4 Table. Bucket configuration before executing the load balance algorithm. In this case, our algorithm calculates an imbalance of 27%, comuted as E f where the efficiency E f = 73% according to formula () of Section 4.. Then, if we assume a maximum imbalance threshold of %, the load balance algorithm is triggered. In Figure 2.(b) and Table 2 we show the configuration of the consistent hashing ring after executing the load balance algorithm. Therefore, the result of alying the efficiency formula () tell us that the system is 93% balanced, i.e. there is an imbalance of 7%. Therefore, the algorithm attemts to balance the workload by means of changing the number of artitions er bucket. The number of artitions 3 3 2 2 Partition 6 (a) Partition 4 Partition 5 Partition 3 Partition 4 Partition Partition 2 LPs Partition Partition 2 Partition 3 (b) Figure 2. Overview of the consistent hashing load balance algorithm. er bucket is adjusted meanwhile the ercentages of imbalance are ket below the threshold value. Processor # of Partitions Workload 3 6 2 3 3 4 Table 2. Bucket configuration after executing the load balance algorithm. 4. Load balance algorithm We use a key-based aroach, where each LP is associated with an unique key and any dearting event must secify the destination LP by means of a key. We have designed a fine grain load balance algorithm which takes advantage of this aroach on the consistent hashing scheme [3]. At the end of each suerste, the communication layer is in charge of routing each event stored in the outut message queue. The aroach in a standard consistent hash ring consists of alying a hash function to the key of events which is then used to determine the hysical destination (rocessor identifier). In our aroach we roceed differently. Each rocessor maintains a local hash table named T l [index] =< roc, weight >, where index is the hash value of the next 4

Event FS NEWQ FS MERGE CS HIT CS NOHIT IS T CS T / CS U Value (# of triggered events) 5. N IS+4. 5. 5. 5. 5. Time from the Front Service to the Caching Service to search for the query in the cache memory (ste 2 of Figure ). Events with ID CS U reresent messages coming from the Front Service to the Caching Service to udate the cache memory (ste 6 of Figure ).9.37.2.2.4.5 The second column in Table 3 shows the number of events triggered by an event of the first column. For examle, weight(fs NEWQ) = 5 means that the event FS NEWQ triggers 5 new events, also F S M ERGE events triggers N IS + 4 new events, where N IS is the number of Index Service artitions. The third column shows the average execution time send in simulating the corresonding events. We executed a benchmark for the simulation executions in order to obtain these values. Table 3. Benchmark of the simulated events. event key in the outut message queue, weight is the number of events that are going to be rocessed by this event when simulated in its destination LP, and roc is the rocessor holding this LP. At the end of N suerstes, each rocessor sends its local hash table Tl to the master rocessor (to of Figure 4). Each rocessor only sends information regarding non emty entries of Tl. The master rocessor collects all hash entries into a global hash table Tg (middle of Figure 4). Then, the master rocessor determines whether there is load imbalance by alying the following efficiency formula: Figure 3. Maing keys into the local hash table. Keys are string values, therefore we first aly a hash function in order to transform a string key (ke.str ) to an integer key (kint ). To this end, we use the Fowler-Noll-Vo hash function [6]: kint = F N V a(ke.str ), because is fast and maintains a low collision rate. The integer key is maed into the hash table Tl by means of a second hash function index = h(kint ) (based on the module function, see Figure 3). P X Ef = ( load[roci ]/ max (load[,..., P ]))/P () i= The hash table is filled u by rocessing the destination string keys of dearting messages (events), during N suerstes. At the end of each suerste, the communication layer iterates over the outut message queue and determines the destination rocessor of each event by accessing to the hash table Tl. If there is no entry for a given event e and a given destination string key ke.str, we comute the rocessor destination of the event for the first time as Tl [index].roc = hash(ke.str ) and we set Tl [index].weight = weight(e.tye). If the hash entry is not emty, the rocessor destination is retrieved from the hash table and the element Tl [index].weight is increased by the resective weight(e.tye) units. where load[roci ] is the sum of the weights corresonding to events rocessed by the rocessor roci. If the efficiency value is below certain threshold, the load balance algorithm is triggered. The value of weight(e.tye) is the number of new events that must take lace inside the destination LP when the event e is simulated in its LP where each value is amlified by an estimation of the running time demanded by the internal event. For the exerimentation resented in this aer, those values are listed in Table 3 as an examle. The first column shows the most imortant events IDs simulated in our model, which are related to the oerations executed by our WSE test alication. The events with ID F S N EW Q reresent queries arriving from users (ste of Figure ). The F S M ERGE events reresent artial results of queries coming from the Index Service artitions. The Front Service waits for all P artial results before executing the merge oeration (ste 5 of Figure ). Events with ID CS HIT reresent a cache hit for a given query and the CS N OHIT events, reresent a cache miss and the query has to be sent to the Index Service (ste 3 of Figure ). Events with ID IS T reresent messages coming from the Front Service to the Index Service (ste 4 of Figure ). Events with ID CS T reresent messages coming Figure 4. Load balance rocess. The load balance algorithm consists of moving the limits between adjacent buckets of the consistent hashing ring, in order to adjust the workload er rocessor. The goal of the al5

gorithm is to achieve load[roc i ] values close to the average workload reorted by all rocessors. Algorithm Load balance algorithm : rocedure LOAD BALANCE(T g ) 2: m avg load(t g ) 3: clear(load) 4: roc 5: for entry T g do 6: if ( load[roc] + entry.weight m load[roc] m AND roc (P )) then 7: roc roc + 8: end if 9: load[roc] load[roc] + entry.weight : entry.roc roc Remaing. : end for 2: bcast(t g ) 3: clear(weight(t g )) 4: migration() 5: end rocedure Algorithm shows the main stes executed by the load balance algorithm. We first comute the average workload er rocessor m = P e T g (e.weight) (line 2). Then, the load balance algorithm consists of iterating over the global consistent hashing table T g (line 5) and assigning hash entries to rocessors by adjusting their workload close to the average m. In line 6, the algorithm calculates which would be the workload of the current rocessor roc, if it is assigned the next entry of the global table (load[roc] +entry.weight). If the difference between this new estimated workload and the average m is greater than the distance of the current workload to the average, the algorithm roceeds to the next rocessor identifier. In other words, no more entries are assigned to the current rocessor. Next, the rocessor workload is increased by entry.weight units (line 9) and the rocessor assigned to the entry is udated (line ). Finally, the algorithm broadcasts T g (only entry values and rocessor identifiers) and clears the weight values of the global consistent table T g. At this oint the migration rocess is executed. 5. EXPERIMENTS In this section we resent the exerimental results obtained from the execution of the distributed simulator with a set of different configurations. 5. Simulation Model In this work we use a simulation model that describes a web search engine comosed by the following configuration: A set of front-end services (FS) rocessors A set of caching services (CS) rocessors (artitioned and relicated) A set of index services (IS) rocessors (artitioned and relicated) A Query Generator in charge of injecting queries from the AOL query log to the FS by using a different query arrival rate. A two-level caching strategy. On one hand, the first level consists of a result cache which stores the most frequent queries with their to-k results. On the other hand, the second level cache consists of a location cache [5] in charge of storing the artition identifiers of the index service caable of lacing documents in the to-k results for frequent queries. 5.2 Settings Exeriments were erformed on a 32-core latform with 64GB Memory (6x4GB), 333MHz and a disk of 5GB, 2x AMD Oteron 628, 2.GHz, 8C, 4M L2/2M L3, 333 Mhz Maximum Memory, Power Cord, 25 volt, IRSM 273to C3. The oerating system is Linux Centos suorting 64 bits. The simulator was develoed by using C++ (gcc- 4.5.3), BSPonMPI (.2) 3, and OenMP. We used a log of 36,389,567 queries submitted to the AOL Search service between March and May 3, 26. We re-rocessed the query log following the rules alied in (Gan and Suel 29) (removal of stowords, term normalization, deletion of dulicated terms and assumtion that two queries are identical if they contain the same words, no matter the order). The resulting log had 6,9,873 queries, where 6,64,76 are unique queries and the vocabulary consists of,69,7 distinct query terms. We also used a samle (.5TB) of the UK-Web obtained in 25 by the Yahoo! search engine, over which an inverted index with 26,, terms and 56,53,99 documents was constructed. We executed the queries against this index in order to get the emirical cost distributions for our models. The document ranking method used was WAND [2] and the cache olicy was LRU. Exeriments were erformed using P = {, 2, 4, 8, 6, 32} cores. Table 4 shows the three different service configurations (namely, the number of rocessors er service) used for the simulation of the model described in Section 3. The column CS corresonds to the number of artitions assigned to the Caching Service. CS r refers to the number of relicas assigned to the Caching Service. The same nomenclature is used for the Index Service. Additionally, we used three different query arrival time: Low query traffic (the mean value of the exonential distribution is.6 5 ), Medium (the mean value of the exonential distribution is.6 6 ) and High (the mean value of the exonential distribution is.6 7 ). Results were obtained for the simulation of Q =. queries. Furthermore, the load balance algorithm uses a maximum imbalance threshold of 5% and the consistent hashing table has a fixed number of 8.92 entries. 5.3 Results Performance evaluation 3 htt://bsonmi.sourceforge.net 6

configuration FS CS CSr IS ISr small 5 2 6 4 medium 7 35 5 9 2 large 9 5 2 2 26 Table 4. Web service configuration used in the exeriments. Figure 5 shows the time required to simulate Q =. queries with a large system configuration (simulating a total of 3229 services) when query traffic is low. For the roosed load balance algorithm we set N = and erform exeriments varying the number of suerstes elased to comute the imbalance of the simulation (N 2,N 4,N 6 and N 8). The x-axis shows the running time in seconds and the y-axis the number of rocessors. Results show that with few rocessors, the baseline aroach achieves lower running time. The load balance aroach fails to imrove the execution time with few rocessors, mainly because of the comutations required to comute the imbalance of the rocessors and also because of the communication cost associated to transmit the information of the hash tables T l and T g. However, the load balance aroach outerforms by 75% the baseline algorithm when running the simulation with 32 rocessors. The different numbers of suerstes elased to comute the imbalance of the simulation, have a bigger imact on the erformance of the load balance algorithm when using few rocessors (2 or 4 rocessors), but with a larger number of rocessors this imact is negligible. Running Time 7 6 5 4 3 2 N*2 N*4 N*6 N*8 Figure 5. Running times for a large WSE service configuration with low query traffic. Figure 6 shows the results achieved by the baseline aroach and our load balance algorithm when simulating a large WSE service configuration but under high query traffic. With this exeriment we evaluate the erformance of the load balance algorithm when simulations are saturated. In other words, with this exeriment we saturate the communication network and the incoming message buffers in each rocessor among others. Results show that under such extreme conditions, our roosed load balance algorithm is caable of imroving the erformance of the baseline aroach. With P = 2 the load balance aroach imroves by 22% the baseline algorithm. This imrovement tends to rise u to 7% with P = 32. Running Time 3 25 2 5 5 N*2 N*4 N*6 N*8 Figure 6. Running times for a large WSE service configuration with high query traffic. In Figure 7 we comare the erformance achieved by the baseline aroach and the load balance algorithm with different query traffics and different WSE services configurations using P = 32. The x-axis shows the label CH which corresonds to the Consistent Hashing algorithm and the label B, which corresonds to the algorithm. The y-axis shows normalized running times to better illustrate the erformance and differences reorted by the algorithms. To this end, we divide the running time reorted in each exeriment by the maximum reorted in that set of exeriments. In other words, we grou the exeriments according to the query traffic (Low, Medium and High), and each running time reorted by the algorithms in a articular grou, is divided by the maximum in that grou. As exected, the large system configuration requires a larger running time to finish the rocessing of Q queries. Moreover, under high query traffic, the execution time required to simulate a large system configuration is 3% slower than the time required to simulate a medium service configuration system. This reflects the fact that under such conditions, the simulation executions are saturated roducing additional overheads and delays. However, in general, results show that in all cases, the load balance aroach is caable of outerform the baseline erformance by 7% in average. Communication evaluation Figure 8 and Figure 9 show communication ercentage over the total execution time. In Figure 8 the exeriment was erformed with a large service configuration system, when query traffic is low. Results shows that the baseline aroach with P = 2 sends more time erforming comutations tasks and only 9% of the time is for communication. However, as we increase the number of rocessors the time sent for communication is higher, reaching 94% of the total execution time. On the other hand, the load balance algorithm sends u to 29% of the total execution time to erform communication tasks including sending hash tables. But with P = 32, the 7

Normalized Running Time.4.2.8.6.4.2 Large Medium Small Low Medium High Communication ercentage.4.2.8.6.4.2 N*2 N*4 N*6 N*8 CH B CH B CH B Algorithms Figure 7. Normalized running time for P = 32. load balance algorithm reorts lower communication costs, only 8% of the total time is sent in communication oerations. That is because, as we increase the number of rocessors, the baseline communication can be seen as saghetti of messages coming and going from different rocessors, whereas the load balance algorithm tends to organize comutation among rocessors. In Figure 9 we set query traffic to high, but we observe similar results as when query traffic is low. The baseline aroach with P = 2 sends 26% of the total time for communication oerations. With P = 6 the baseline sends 8% of the time in communication oerations and with P = 32 this ercentage in increased to 9% of the total execution time. The communication oerations of the load balance algorithm require 48% of the total execution time for P = 2 and for P = 32 this ercentage is increased to 8% of the total execution time. Figure 9. Percentage of time required to erform communication oerations for a large system configuration with high query traffic. instant of the execution. Results show that both algorithms reort similar memory usage. Thus, although the load balance algorithm uses a local and a global hash table to comute the imbalance of rocessors, the additional amount of memory required to run this oerations is not significant. Bytes 3e+9 2.5e+9 2e+9.5e+9 e+9 CONSISTENT HASHING BASELINE Communication ercentage.4.2.8.6.4.2 N*2 N*4 N*6 N*8 Figure 8. Percentage of time required to erform communication oerations for a large system configuration with low query traffic. Memory evaluation Regarding memory consumtion, Figure shows the maximum amount of bytes used by both the baseline and load balance algorithms with different number of rocessors at any 5e+8 Figure. Maximum memory usage. Number of stragglers and accuracy evaluation Figure and Figure 2 show the number of stragglers events (events rocessed in incorrect order) with a large service configuration system and with low and high query traffic resectively. In both cases (but with a higher imact when query traffic is high), the baseline aroach reorts fewer number of stragglers events. When events arrive to an LP, the aroximate arallel simulation algorithm checks whether this events are stragglers. If an event is considered straggler (its execution time is lower than the clock of the simulation kernel), the clock of the simulation kernel is backed u to the time of the straggler event and it is stored in the FEL (List of Future Events). However, these stragglers events may not be executed in the current suerste. Then, when the load balance algorithm detects an 8

imbalance and some LPs are migrated to other rocessors, some events stored in the FEL of the migrated LPs can be detected as stragglers, because their execution time is comared against the clock of the simulation kernel of the new rocessor.thus, some events may be counted as stragglers more than once. When the query traffic is low, the load balance aroach reorts in average 2% more stragglers events than the baseline aroach (Figure ). This ercentage is increased to 3% when query traffic is high as shown in Figure 2. Then, simulating a saturated WSE with high flow of queries is more likely to affect the accuracy of the aroximate arallel simulations. However, as shown in Table 5, the accuracy is not drastically affected. We evaluate the accuracy of the load balance aroach by comuting the relative error between the results achieved by the sequential simulation and its corresonding arallel aroach (the baseline) and our load balance aroach. We calculated the root mean square error of the deviation which is a measure of the differences between values obtained by the sequential simulation and the values reorted by the arallel simulations with load balance. It is defined as ɛ m = ( (xi x) 2 /(n(n )) and we calculated the relative error (e r ) defined as ɛ m /x. We used the following metrics: throughut, average query resond time (QRT) and hit ratio. Results resented in Table 5 show a small relative error for all metric. The most sensitive one is the average query resonse time which resents an relative error e r =.2,and it is small enough indicating that the load balance aroach is caable of maintaining a good estimation of the metrics obtained by the simulations. Stragglers 2.4e+6 2.2e+6 2e+6.8e+6.6e+6.4e+6.2e+6 e+6 8 6 4 2 C. Hashing Figure. Number of stragglers events under low query traffic and using a large system configuration. Metric Consistent Hashing Throughut.4.45 QRT.24.2 Hit cache.5.69 Table 5. Accuracy of the simulations using the baseline and the load balance aroach with a large service configuration system and high query traffic. Stragglers.3e+6.2e+6.e+6 e+6 9 8 7 6 5 4 3 2 C. Hashing Figure 2. Number of stragglers events under high query traffic and using a large system configuration. 6. CONCLUSION We roosed to use a load balance algorithm to imrove the erformance of aroximate arallel simulations as resented in [6]. The resented algorithm aims to minimize the number of LPs migrations between rocessors by alying a consistent hashing aroach which forms a ring of buckets. Each rocessor holds a sub-set of LPs belonging to each bucket. The migration of LPs is erformed between neighboring rocessors of the consistent hashing ring. To comute the imbalance, our load balance algorithm takes into account the number of events rocessed by each LP and the time required to simulate those events. The exerimental results show that our load balance algorithm is caable of quickly restoring arallel simulation to a situation of efficient erformance. The best results are observed for large number of rocessors oerating under saturation caused by high traffic flows of events in real life simulation models of large Web search engines. ACKNOWLEDGMENTS This research was artially funded by Basal funds FB, Conicyt, Chile; and Mincyt-Conicyt CH24. REFERENCES. Boukerche, A., and Das, S. K. Reducing null messages overhead through load balancing in conservative distributed simulation systems. J. Parallel Distrib. Comut. 64, 3 (Mar. 24), 33 344. 2. Broder, A. Z., Carmel, D., Herscovici, M., Soffer, A., and Zien, J. Y. Efficient query evaluation using a two-level retrieval rocess. In CIKM (23), 426 434. 3. Cacheda, F., Carneiro, V., Plachouras, V., and Ounis, I. Performance analysis of distributed information retrieval architectures using an imroved network simulation model. Inf. Process. Manage. 43, (27), 24 224. 4. Calheiros, R. N., Ranjan, R., Beloglazov, A., De Rose, C. A. F., and Buyya, R. Cloudsim: A toolkit for modeling and simulation of cloud comuting environments and evaluation of resource rovisioning algorithms. Softw. Pract. Exer. 4, (2), 23 5. 9

5. Chowdhury, A., and Pass, G. Oerational requirements for scalable search systems. In Proceedings of the Twelfth International Conference on Information and Knowledge Management (23), 435 442. 6. Fowler, G., Noll, L., Vo, K., and Eastlake, D. The fnv non-crytograhic hash algorithm. Tech. re., IETF Internet-draft (March 22), 2. 7. Fujimoto, R. Parallel discrete event simulation. Comm. ACM 33, (Oct. 99), 3 53. 8. Gan, B. P., Low, Y.-H., Jain, S., Turner, S., Cai, W., Hsu, W. J., and Huang, S. Y. Load balancing for conservative simulation on shared memory multirocessor systems. In Parallel and Distributed Simulation, 2. PADS 2. Proceedings. Fourteenth Worksho on (2), 39 46. 9. Gao, Z., Liu, D., Yang, Y., Zheng, J., and Hao, Y. A load balance algorithm based on nodes erformance in hadoo cluster. In Network Oerations and Management Symosium (APNOMS), 24 6th Asia-Pacific (Set 24), 4.. Gil-Costa, V., Lobos, J., Solar, R., and Marín, M. Ameds-tool: an automatic tool to model and simulate large scale systems. In Proceedings of the 24 Summer Simulation Multiconference, SummerSim 24, Monterey, CA, USA, July 6-, 24 (24), 2.. Gómez-Pantoja, C., Rexachs, D., Marín, M., and Luque, E. A fault-tolerant cache service for web search engines: Radic evaluation. In Proceedings of the 8th International Conference on Parallel Processing, Euro-Par 2, Sringer-Verlag (Berlin, Heidelberg, 22), 298 3. 2. Grande, R. E. D., and Boukerche, A. Dynamic balancing of communication and comutation load for hla-based simulations on large-scale distributed systems. Journal of Parallel and Distributed Comuting 7, (2), 4 52. 3. Karger, D., Lehman, E., Leighton, T., Panigrahy, R., Levine, M., and Lewin, D. Consistent hashing and random trees: Distributed caching rotocols for relieving hot sots on the world wide web. In Proceedings of the Twenty-ninth Annual ACM Symosium on Theory of Comuting, STOC 97, ACM (New York, NY, USA, 997), 654 663. 4. Kunkel, J. Using Simulation to Validate Performance of MPI(-IO) Imlementations. In Suercomuting (23), 8 95. 5. Marín, M., Ferrarotti, F., Mendoza, M., Gómez-Pantoja, C., and Gil-Costa, V. Location cache for web queries. In Proceedings of the 8th ACM Conference on Information and Knowledge Management, CIKM 29, Hong Kong, China, November 2-6, 29 (29), 995 998. 6. Marin, M., Gil-Costa, V., Bonacic, C., and Solar, R. Aroximate arallel simulation of web search engines. In Proceedings of the 23 ACM SIGSIM Conference on Princiles of Advanced Discrete Simulation (23), 89 2. 7. Marin, M., Gil-Costa, V., and Gomez-Pantoja, C. New caching techniques for web search engines. In Proceedings of the 9th ACM International Symosium on High Performance Distributed Comuting (2), 25 226. 8. Markatos, E. On caching search engine query results. Comut. Commun. 24, 2 (2), 37 43. 9. Meraji, S., Zhang, W., and Troer, C. A multi-state q-learning aroach for the dynamic load balancing of time war. In Princiles of Advanced and Distributed Simulation (PADS), 2 IEEE Worksho on (May 2), 8. 2. Peschlow, P., Honecker, T., and Martini, P. A flexible dynamic artitioning algorithm for otimistic distributed simulation. In Princiles of Advanced and Distributed Simulation, 27. PADS 7. 2st International Worksho on (June 27), 29 228. 2. Quaglia, F., and Baldoni, R. Exloiting intra-object deendencies in arallel simulation. Inf. Process. Lett. 7, 3 (999), 9 25. 22. Quaglia, F., and Beraldi, R. Sace uncertain simulation events: Some concets and an alication to otimistic synchronization. In Proceedings of the Eighteenth Worksho on Parallel and Distributed Simulation (24), 8 88. 23. Rao, D. M., Thondugulam, N. V., Radhakrishnan, R., and Wilsey, P. A. Unsynchronized arallel discrete event simulation. In Proceedings of the 3th Conference on Winter Simulation (998), 563 57. 24. Valiant, L. G. A bridging model for arallel comutation. Commun. ACM 33, 8 (Aug. 99), 3.