A Load-Balancing Algorithm for Cluster-based Multi-core Web Servers

Journal of Computatonal Informaton Systems 7: 13 (2011) 4740-4747 Avalable at http://www.jofcs.com A Load-Balancng Algorthm for Cluster-based Mult-core Web Servers Guohua YOU, Yng ZHAO College of Informaton Scence and Technology, Beng Unversty of Chemcal Technology, Beng 100029, Chna Abstract The demand for hgh performance web servers leads to the utlzaton of mult-core cluster-based web servers. Furthermore, a lot of dynamc requests are changng tradtonal web envronment. So the load-balancng algorthm s crucal to the cluster-based web servers. However, tradtonal load balance algorthm dd not consder the servce tme dstrbuton of the dynamc requests and the characterstcs of mult-core web servers. Ths paper proposes a new load-balancng algorthm. The new approach, accordng to the servce tme dstrbuton of dynamc requests, assgns the dynamc requests, and keeps load balance n mult-core web servers by Genetc Algorthm. Smulaton experments have been done to evaluate the new algorthm. The obtaned results prove that the new algorthm s farer and has better performance. Keywords: Dynamc Requests; Web Server; Cluster; Mult-core; Genetc Algorthm 1. Introducton Wth the rapd development of Internet ndustry, people are more lke to rely on the web for ther daly actvtes. Consequently, the web servers played a crucal role n the world of nformaton and busness-orented servces. To meet demands of users, the web servers must be made wth hgher performance. One of the most popular schemes for addressng ths problem s cluster-based web servers [1, 2]. Fg.1 descrbes ts archtecture. The archtecture manly conssts of a web swtch and a set of web servers [3]. In ths desgn, we assume all the web server nodes are homogeneous. Conceptually, the web swtch acts as a centralzed global scheduler that assgns the requests based on load-balancng algorthm. Furthermore, wth the emergence of mult-core technology, most web servers began to adopt mult-core CPUs to mprove the hardware performance n past few years. Mult-core system ntegrates two or more processng cores nto one slcon chp [4]. In ths desgn, every processng core has ts own prvate L1 cache and shared L2 cache [5]. All the processng cores share the man memory and the system bandwdth. Fg.2 shows the archtecture of mult-core system. When the cluster-based web servers employ mult-core CPUs, there wll be some new problems. Each node of cluster has a servce applcaton ncludng multple threads, whch serve the requests. When the mult-thread applcaton n the mult-core node serves dynamc requests, t s easer to gve brth to png-pong effect [6], whch wll greatly degrade the performance of the mult-core system. To elmnate the png-pong effect, we ntroduce CPU affnty. CPU affnty s the capacty of bndng a process or thread to Correspondng author. Emal addresses: alan_you@163.com (Guohua YOU). 1553-9105/ Copyrght 2011 Bnary Informaton Press December, 2011

4741 G. You et al. /Journal of Computatonal Informaton Systems 7:13 (2011) 4740-4747 a specfc CPU core [7]. Some works [8-12] for mprovng the performance of applcatons n mult-core system by CPU affnty have been done. The processng of the dynamc requests s complcated. Some dynamc requests are very smple, but some dynamc requests are very complex. So the servce tme of the dynamc requests dffers greatly, and usually obeys heavy-taled dstrbuton [13]. Actually, some tradtonal load-balancng algorthms n cluster-based web servers, such as Round-Robn (RR), Content Aware Polcy (CAP) [14] and Weghted Round-Robn (WRR), ddn t consder the characterstcs of mult-core web servers and the servce tme dstrbuton of the dynamc requests. As a result, we proposed a load-balancng algorthm, whch could addresses the above-mentoned problems, and mprove the performance of the cluster-based mult-core web servers effectvely. Fg.1 Web Server Cluster Archtecture Fg.2 Archtecture of Mult-core CPUs [5] The remander of the paper s organzed as follows: The new load-balancng algorthm s descrbed n Secton 2. Secton 3 ntroduces the smulaton experments of the new algorthm and presents an evaluaton of the performance. And fnally, we present our conclusons n Secton 4. 2. New Load-balancng Algorthm 2.1. Descrpton of Algorthm In a webste, although there are a lot of dynamc requests, the types of dynamc requests are lmted. Many dynamc requests are dfferent ust because ther parameters are dfferent (For example, http://www. testexample.com/web.aspx?name=tom and http://www.testexample.com/web.aspx?name=mke), but the requested pages are the same one. We consder the dynamc requests that request same page as same type. Generally, at a mult-core node of cluster, ncomng requests wll be assgned to the threads from a thread pool. When the same type dynamc requests are assgned to threads, these threads would execute the same code. Thus, these threads have shared data. Furthermore, accordng to the thread schedulng strategy of the mult-core system, O/S always tres to assgn these threads to dfferent processng cores due to load balance

G. You et al. /Journal of Computatonal Informaton Systems 7:13 (2011) 4740-4747 4742 between cores [15], so the shared data wll be contnually transferred between both L1 caches of the dfferent processng cores back and forth, whch s the png-pong effect. Fg.3 Load-balancng Algorthm Because the web server nodes n cluster are homogenous, we can assume that each web server node has the same amount of threads. So we can consder all the mult-core CPUs n the cluster as a Super Mult-core CPU. The Super CPU ncludes all cores of the cluster. Consequently, we can assgn the threads that serve the same type requests to the same core by hard affnty method. Moreover, for the load balancng between dfferent cores, we calculate the thread allocaton strategy by Genetc Algorthm. The load balance between cores could ensure the load balance between nodes n the cluster. As t s shown n Fg. 3, when they arrve at classfer from TCP queue, dynamc requests are classfed based on ther URLs and the same type requests wll be assgned to the same request queue. The weght of each request queue can be calculated based on the access frequency and the mean servce tme of ths knd of dynamc requests, whch can be ganed from log fle [16]. In a multple-threads system, CPU tme s allocated to every thread equally durng a CPU cycle, so the number of threads represents the proportons of CPU capacty. The number of threads that serve a request queue could be calculated based on the weght of the request queue. To avod the png-pong effect, all threads that process the same request queue should be assgned to the same core. After the thread allocaton scheme s determned, the dynamc requests n the request queues are assgned to these threads. Then these threads begn to execute. After executon, the results generate the new dynamc pages, whch are sent to network scheduled by I/O management, and these are responses. In the desgn, we deploy the new load-balancng algorthm nto web swtch. 2.2. Calculaton of Algorthm Parameter 2.2.1. Weght of Dynamc Request Queue After ganng the number of vsts of each knd of requests at a specfed tme nterval from log fles, we can calculate the percentage C of the access tmes of the request queue n total access tmes of all

4743 G. You et al. /Journal of Computatonal Informaton Systems 7:13 (2011) 4740-4747 request queues. Lkewse, we can calculate the average servce tme T of requests n the request queue. Thus, the weght W of the request queue can be calculated gven by W= CT (1) where M = 1 C = 1 and M s the total of the request queues. 2.2.2. Threads Number of Request Queue Accordng to the weght W of the request queue, the number λ of the threads that serve the request queue can be calculated based on the followng formula λ = M W = 1 W H (2) where H s the total of the threads n the thread pool, and M s the total of the request queues. λ s the number of threads that are used to handle the request queue, and t s an nteger through roundng. 2.2.3. Load Balance between Cores In order to avod the png-pong effect between threads, the threads servng the same request queue should be assgned to the same processng core as a whole. After these threads are allocated to the processng cores as a whole, the number of threads on cores dffers greatly too. Thus, ths wll gve rse to a new queston: load balance between cores. To keep load balance between cores, we must allocate the threads servng the same request queue to the same processng core and keep the number of threads on the dfferent cores evenly. We can solve the problem by means of the Genetc Algorthm. If the Super Mult-core CPU has N cores,and the number of the request queues s M. So we can defne the chromosome of the Genetc Algorthm as followng where R s an nteger and 0 R N 1. R s a gene n chromosome and represents the seral number of the core, to whch the threads servng the request queue are assgned. So a chromosome stands for a thread assgnment soluton. For purpose of brevty, we defne the threads that serve the same request queue as a Servce Thread Group ( STG ). If a STG serves the request queue, then t s STG. And the number of threads n STG s λ, whch has been calculated accordng to Formula (2). From chromosome, we can acqure all the

G. You et al. /Journal of Computatonal Informaton Systems 7:13 (2011) 4740-4747 4744 STGs that are allocated to core. If we defne the number of STGs on core as B, then we can enumerate all the STGs on core : STG, STG,..., STG,..., STG, 1 k B. A A A A 1 2 k B STG A k s the Servce Thread Group that serves request queue A k. If the number of threads on core s X, then X could be calculated by the followng formula: X B = λa k (3) k= 1 So we can get the number of threads on every core: X, X,..., X. We defne D( X ) as the 1 2 N varance of X, X,..., X. If D( X ) s large, t means that the numbers of threads on dfferent cores 1 2 N dffer greatly. So as to keep load balance between cores, the lower value of D( X ) s favorable. Therefore, we defne the ftness functon n the Genetc Algorthm as followng 1 f() e = DX ( ) + 1 (4) where e s a chromosome. Because the lower value of D( X ) s helpful to keep load balance between cores, and sometmes D( X ) mght be zero. So we use the recprocal of D( X ) + 1 as ftness functon. So the larger value of f ( e ) means better load balance between cores. Genetc Algorthm has the followng procedure: (1) Intal Populaton: a populaton s a collecton of chromosomes. The populaton sze L can be determned expermentally. The ntal populaton s usually generated randomly. We can generate L chromosomes by assgnng a random nteger, whch ranges from 0 to N 1, to every gene n chromosome. (2) Calculaton of the Ftness Value: we can use the ftness functon formula (4) and above method to calculate the ftness value of every chromosome n the populaton. (3) Selecton: we select the ftter chromosomes by the roulette wheel method. The greater the ftness values of a chromosome, the larger the probablty to be chosen. We repeat the selecton operaton as many tmes as the number of chromosomes. (4) Crossover: For the randomly selected couple of chromosomes, we decde whether to perform crossover or not based on crossover probablty. If the crossover s allowed, t wll generate a new couple of chromosomes by exchangng portons of the two old chromosomes. (5) Mutaton: we randomly choose a chromosome from the populaton. For a gene of the chromosome, we allow a random change wth very small probablty. If t happens, the gene wll be assgned to random nteger, whch ranges from 0 to N 1 and s dfferent from the orgn value of the gene. (6) Termnaton Condton: n Genetc Algorthm, the generatonal processes (2), (3), (4), and (5) are repeated. The chromosome wth the largest ftness value wll be recorded at each teraton. If the largest ftness value doesn t change for fve tmes, we thnk the teratons should be ended. Fnally, we gan the recorded chromosome wth the largest ftness value. From the chromosome, we can

4745 G. You et al. /Journal of Computatonal Informaton Systems 7:13 (2011) 4740-4747 gan the best threads assgnment soluton, whch could keep the load balance between cores effectvely. And then the load balance between cores could ensure the load balance between nodes n the cluster. 3. Experments and Evaluaton 3.1. Experments Setup To valdate the new load-balancng algorthm, we developed a smulaton program, whch was called RSSP, deployed t n Web Swtch. Classfer n RSSP parses the request s URL, assgns the requests wth the same URL to the same request queue. After classfcaton, RSSP calculates the weghts of each request queue, and decde the number of threads that serve the request queues. Then the thread assgnment scheme could be determned based on the Genetc Algorthm. After the threads are assgned to the cores on the bass of thread assgnment soluton, the scheduler n RSSP could assgn the requests to the correspondng cores. We can utlze the hard affnty method to accomplsh the threads assgnment. In order to smulate the generaton of dynamc pages, we created 50 DLL fles nstead of 50 dynamc pages. When a request s assgned to a thread, the correspondng DLL fle would be loaded and executed so as to smulate the generaton procedure of the dynamc page. The functons of these DLL fles are dstnct. So the executon tme of these DLL fles s dfferent. Furthermore, the threads that execute the same dynamc web page have shared data. So we could set shared data n these DLL fles. The default value of shared data s 2k. So as to smulate the vst behavor of users to webstes, we desgned a sendng requests module, whch can automatcally send requests to the cluster-based web servers n a specfc arrval process. In our experment, the default arrval process s heavy-taled dstrbuton. 3.2. Results Evaluaton In ths secton, we dscuss the results of smulaton experments. We changed load ntensty and measured the cluster response tme, throughput and scalablty for the three schedulng polces: WRR, CAP and RSSP. 3.2.1. Mean Response Tme From Fg. 4 (a), we can see the mean response tme curves of three polces. For all algorthms, the average response tme curves are exponental. The average response tme curves are relatvely smooth at frst, and then begn to ncrease sharply wth the ncrease of clents. The average response tme of CAP algorthm s smlar to RSSP. However, they dffer greatly for hgher number of clents. As t can be seen n Fg.4 (a), the WRR has hgher average response tme than CAP and RSSP for all the number of clents. That s because one or more web servers n the cluster wth WRR scheme reaches to CPU bottleneck sooner due to shortcomng of load balance. 3.2.2. Throughput Evaluaton Fg.4 (b) llustrates the throughput of these algorthms n the cluster. Generally, throughput rses at frst as the number of clents ncrease, and then peaks when the CPU resource becomes bottleneck on web servers. From Fg.4 (b), we can see that WRR s easer to acheve a peak than CAP and RSSP because t reaches CPU bottleneck more easly. CAP ndcates comparable throughput to the throughput that s obtaned by

G. You et al. /Journal of Computatonal Informaton Systems 7:13 (2011) 4740-4747 4746 RSSP. However, because of better load balance, RSSP can respond to more clents than the WRR and CAP schemes, and can reach hgher throughput. 3.2.3. Scalablty for Three Schedulng Polces One of the mportant characterstc of a cluster s scalablty and load-balancng algorthm has great nfluence on the scalablty of the cluster. So we evaluate the scalablty of the cluster n terms of maxmum throughput. We measured the cluster maxmum throughput wth the change of number of servers. Fg. 4(c) demonstrates the scalablty of cluster wth ncreasng server nodes. RSSP has better scalablty than WRR and CAP due to better load balance. As the number of nodes ncreases, the throughput for CAP becomes more flat. That s because the overhead exsts n CAP algorthm. WRR has worse throughput than RSSP and CAP, because t has neffcent request assgnment scheme and blndness regardng type of requests. Fg.4 Evaluaton for Three Load Balance Algorthms (a) Mean Response Tme (b) Throughput (c) Scalablty (d) Mean Response Tme when Sze of Shared Data s Changed 3.2.4. Changng the Shared Data In our experment, the DLL fles have the shared-data to smulate the dynamc web pages wth shared data. When we changed the sze of the shared data n 50 DLL fles, the mean response tme for three strateges was measured and shown on Fg. 4 (d). We can see that the mean response tme of WRR and CAP ncreases as the sze of the shared data ncreases. However, the mean response tme of RSSP changes lttle. The reason s that we adopted the new load-balancng algorthm, whch dspels the png-pong effect. When the sze of the shared data ncreases, the mpact on the mean response tme of RSSP s slght.

4747 G. You et al. /Journal of Computatonal Informaton Systems 7:13 (2011) 4740-4747 4. Conclusons In order to avod the png-pong effect and mprove the performance of load balance n a cluster-based mult-core web server, we propose a new load-balancng algorthm, whch apples the affnty method to the cluster-based web server and tres to keep load balance between cores and nodes n cluster-based mult-core web server. We descrbe the prncple of the new algorthm and gve the calculaton formulas. Furthermore, we developed RSSP, a smulaton program based on the new method, and dd the smulaton experments usng t. We analyze the key ndces of performance and do compare wth WRR and CAP strateges. The new algorthm has better performance, and can avod the png-pong effect effectvely. Acknowledgement Ths paper has been partally supported by the Natonal Grand Fundamental Research 973 Program of Chna (No. 2011CB706900). References [1] E. Casalccho, V. Cardelln, and M. Colaann. Content-aware dspatchng algorthms for cluster-based web servers. Cluster Computng, 5: 65 74, 2002. [2] J. Yang, G. Tan, F. Wang, and D. Pan. Soluton to new task allocaton problem on mult-core clusters. Journal of Computatonal Informaton Systems, 7(5): 1691-1697, 2011. [3] S. Sharfan, S.A. Motamed, and M.K. Akbar. A predctve and probablstc load-balancng algorthm for cluster-based web servers. Appled Soft Computng, 11: 970-981, 2011. [4] P. Kongetra, K. Angaran, and K. Olukotun. Nagara: a 32-way multthreaded sparc processor. IEEE Mcro, 25: 21-29, 2005. [5] J. M. Calandrno, J. H. Anderson, and D. P. Baumberger. A hybrd real-tme schedulng approach for large-scale multcore platforms. In 19th Euromcro Conference on Real-Tme Systems (ECRTS'07), pages 247-258, 2007. [6] G. You, Y. Zhao. Dynamc requests schedulng model n mult-core web server. In the 9th Internatonal Conference on Grd and Cloud Computng (GCC2010), pages 201-206, 2010. [7] R. Bolla, R. Brusch. PC-based software routers: hgh performance and applcaton servce support. In Workshop on Programmable Routers for Extensble Servce of Tomorrow (PRESTO 08), pages 27-32, 2008. [8] A. Chonka, W. Zhou, K. Knapp, and Y. Xang. Protectng nformaton systems from DDoS attack usng mult-core methodology. In Proceedngs of the IEEE 8th Internatonal Conference on Computer and Informaton Technology, pages 270 275, 2008. [9] Y. Lu, J. Tang, J. Zhao, and X. L. A case study for montorng-orented programmng n mult-core archtecture. In Proceedngs of the 1st nternatonal workshop on Multcore software engneerng (IWMSE 08), pages 47-52, 2008. [10] R. Islam, W. Zhou, Y. Xang, and A. N. Mahmood. Spam flterng for network traffc securty on a mult-core envronment. Concurrency Computat.: Pract. Exper, 21: 1307-1320, 2009. [11] H. Feng, E. L, Y. Chen, and Y. Zhang. Parallelzaton and characterzaton of SIFT on mult-core systems. In IEEE Internatonal Symposum on Workload Characterzaton (IISWC 2008), pages 14-23, 2008. [12] C. Terboven, D. an Mey, D. Schmdl, H. Jn, and T. Rechsten. Data and thread affnty n openmp programs. In Proceedng of the 2008 workshop on Memory access on future processor (MAW 08), pages 377-384, 2008. [13] E. Hernández-Orallo, J. Vla-Carbó. Web server performance analyss usng hstogram workload models. Computer Networks, 53: 2727-2739, 2009. [14] M.Andreoln, E.Casalccho, M.Colaann, and M.Mambell. Acluster-based web system provdng dfferentated and guaranteed servces. Cluster Computng, 7 (1): 7 19, 2004. [15] S.B. Sddha. Mult-core and Lnux Kernel. http://oss.ntel.com/pdf/mclnux. pdf. [16] S. Sharfan, S. A. Motamed, and M. K. Akbar. A content-based load balancng algorthm wth admsson control for cluster webservers. Future Generaton Computer Systems, 24: 775-787, 2008.