Intelligent Sensor Placement for Hot Server Detection in Data Centers - Supplementary File

Itelliget Sesor Placemet for Hot Server Detectio i Data Ceters - Supplemetary File Xiaodog Wag, Xiaorui Wag, Guoliag Xig, Jizhu Che, Cheg-Xia Li ad Yixi Che The Ohio State Uiversity, USA Michiga State Uiversity, USA Florida Iteratioal Uiversity, USA Washigto Uiversity i St. Louis, USA I. INTRODUCTION The supplemetary file of the paper is orgaized as follows. Sectio II presets the derivatio of the detectio probability ad false alarm rate model. Sectio IV elaborates o the spatial temperature iterpolatio scheme ad the details of the lightweight sesor placemet algorithm. Sectio III gives example outputs of CFD simulatios ad the goverig equatios used i CFD. Sectio V presets the CFD model validatio, as well as additioal simulatio results. Sectio VI presets additioal testbed results based o the simplified rack model. II. HOT SERVER DETECTION MODEL I this sectio, we presets the detailed derivatio of the detectio probability ad false alarm rate model i the paper. We have preseted i the paper that we deote the measuremet oise stregth measured by sesor i as N i, which follows the zero-mea ormal distributio with a variace of, i.e., N i N, ). Sice the measuremet of a sigal by a sesor is i the eergy form, which i our case is the heat amout the sesor sesed, a oise item is also i eergy form. The oise eergy should be added to the fial temperature readig. Therefore, the fial measured temperature, T m, from a sesor at locatio x i,y i,z i ) ca be preseted as T m x i,y i,z i )=T r x i,y i,z i )+Ni 2 ) where T r is the real temperature at that locatio without oise. Assumig there are sesors withi the data fusio regio of a moitored spot, the detectio probability of the hot server existece at the moitored poit ca be calculated as ) P D = P T r x i,y i,z i )+N 2 i ) >η = P Ni σ ) ) 2 η Σ T rx i,y i,z i ) 2) where η is the detectio threshold of overheatig. Because of the measuremet oise from the sesor device, η is differet from the real temperature threshold for a hotspot, deoted as C. With a high oise level from the measuremet, a detectio system is likely to give a false alarm whe there is o real evet. I our case, we defie the false alarm rate whe there is actually o hot server as follows P F = P = P 2 Ni + C ) ) >η ) 2 Ni σ ) 3) η C) We assume Gaussia Noise, i.e., N i /σ N, ). Therefore, N i/σ) 2 follows the Chi-square distributio [] with degrees of freedom, deoted as χ ). Hece, Equatios 2) ad 3) ca be modified as) follows η Σ P D = χ T r x i,y i,z i ) 4) ) η C) P F = χ 5) III. CFD SIMULATION RESULT EXAMPLES I this sectio, we preset the goverig trasport equatios used i CFD, as well as two CFD simulatio result examples based o the simplified server room model ad the fier-graied server rack model, respectively. A. Goverig Trasport Equatios i CFD Below is the goverig trasport equatios represeted i the coservatio law form: ρφ t + ρu jφ = φ Γ φ,eff )+S φ 6) x j x j x j where φ represets differet parameters such as mass, velocity, temperature or turbulece properties; ρ is the fluid air) desity; t is the time for trasiet simulatios; x j is the coordiate variable for x, y or z with j beig, 2 or 3; U j is the velocity i differet directios; Γ is the diffusio coefficiet; ad S is the source for the particular variable. For example, whe φ is the air temperature, S stads for the volumetric heat rate from a source compoet. The four equatio terms represet trasiet, covectio, diffusio, ad source parts of trasport pheomeo takig place i the spatial domai [2]. The partial differetial equatios listed i Equatio 6) represet a system, where all the trasport equatios are coupled together ad require to be solved simultaeously. B. Example Temperature Map of Simplified Server Room Model Figure shows a colored temperature map of the model i Figure i the mai file after solvig the trasport

2 A/C i Plae A A/C Out A/C I/Out Plae A A/C Out A/C Out Plae B Plae B Fig.. Temperature C ) map i the server room with all servers ruig at full power. The middle graph is the top view temperature map at plae of z =.5m. The top graph is the temperature map of cross sectio plae A. The bottom graph is the temperature map of cross sectio plae B. Solid-lie boxes represet server racks. IV. CFD-GUIDED SENSOR PLACEMENT I this sectio, we itroduce the iterpolatio techique we used to iterpolate the missig temperature values from CFD results. We also elaborate o the lightweight sesor placemet i detail ad preset the pseudo code of the algorithm. Fig. 2. Temperature C) map at the back of the two-rack clusters with a fier-graied CFD model. Each boxes is a server o the rack. The triagles illustrate the possible overheatig servers, which are the targets for overheatig moitorig ad detectio, i our experimetal setup. All the possible overheatig servers are cofigured with overheatig power cosumptio i this example. No server is placed at the shaded slot. equatios i Fluet. This is a sceario i which all the server racks are ruig with maximum power. We ca see that the back sides of server racks are sigificatly hotter tha the frot sides. More details of CFD modelig o this example ca be foud i [3]. C. Example Temperature Map of Fider-graied CFD Rack Model Figure 2 shows a temperature map from the CFD aalysis at the back of the two-rack cluster with the fier-graied CFD rack model. The servers o those two racks are all DELL U servers. The power cosumptio of the marked servers are cofigured with overheatig power cosumptio 2x of the server maximum labeled power) i the CFD simulatio. Note that some places o the rack show a relatively low temperature sigature. This is because there are o servers mouted at those correspodig rack slots i.e., the shaded slots i Figure 2). A. Spatial Temperature Iterpolatio CFD is essetially a fiite elemet method, which divides the spatial i to fiite umber of cells ad calculates the temperature for each cell. The result from CFD is the fial stable temperature at the ceter of each cell. The graularity of the data set depeds o the desity of the grid i the previous geometric model establishmet i Gambit. There is a trade-off betwee graularity ad performace. A higher desity grid meas a fier graied graularity of the data ad more accurate results. However, it also requires more sigificat computatioal resources, i.e., computatioal time ad memory size. Therefore, i our desig, we choose a graularity of approximately cm for our model. This choice results i a oe-time ruig expese of Fluet for less the two hours. For temperatureukow locatios, we choose to iterpolate the data. There are various kids of spatial iterpolatio methods [4]. The measurig target i our project is the spatial temperature, which is cotiuous ad short rage correlated. Therefore, we choose to use the Iverse Distace Weightig IDW) [5] techique to iterpolate the temperature. IDW method estimates the value of a attribute at usampled poits by liear combiatio of values at sampled poits weighted by a iverse fuctio of distace from the sampled poit to the iterpolatig target poit. This method is sufficiet because of the cotiuity of the spatial temperature distributio. The sampled poits i our problem are the poits at the ceter of each cell from the discretizatio phase i the CFD simulatio. For these poits, CFD simulatio outputs the fial temperature. The usampled poits are the poits for which the CFD result does ot have explicit temperature outputs because of the graularity of the discretized CFD model. Deote the

3 usampled poit whose temperature is to be iterpolated as l ad the kow sampled poits as l i, the detailed iterpolatio is calculated as T l )= λ i T l i ) 7) where λ i is the weight of each elemets, which ca be expressed as /d p i λ i = 8) /dp i where d i is the distace betwee l ad l i, p is a power parameter, ad is the umber of sampled poits used for the estimatio. I our approach, we adopt the same rage as the fusio radius R ad iclude all the kow temperature poits from CFD withi that rage from the iterpolatig target poit. The assumptio behid IDW is that sampled poits closer to the usampled iterpolatig target are more similar to the target tha those further away poits i their values. The mai factor affectig the accuracy of IDW is the value of the power parameter. The choice of power parameter p is arbitrary. The most popular choice of p is 2 ad the resultig method is ofte called the Iverse Square Distace IDS). Hece, we adopt 2 as the power parameter. B. Lightweight Sesor Placemet LSP) Algorithms I our LSP algorithm, we first reduce the search space of the etire problem by dividig the etire search domai, i.e., the server room, ito several clusters. For example, from Figure i the mai file we see that servers ca be grouped based o their geographical locatios such that the distace betwee two groups is greater tha the desired fusio rage. It is ot feasible to calculate a solutio i CSA while some sesors are placed outside the fusio regio of ay moitored locatios. Therefore, we group the moitored locatios ito clusters accordig to the rule that for a moitored locatio l i to be i a cluster, at least oe other locatio i the same cluster should be withi a 2R distace of l i. R is the fusio radius. The reaso for choosig 2R as the clusterig parameter is that if a moitored locatio l i is ot withi the 2R rage of ay other moitored locatio i oe cluster, C, o sesor ca cover l i ad ay locatio i C at the same time. Therefore, l i should ot be put i cluster C. After the clusters are formed, we calculate a ew search space S k for cluster C k by the followig equatios S kumi = mi l iu R l i C k 9) S kumax = max l iu + R l i C k where u is oe of the coordiate subscripts, x, y, or z. Figure i the mai file shows a illustratio of the clusters doted boxes) i which the search space is sigificatly reduced. We ow itroduce i detail the heuristic algorithms for solvig the placemet problems, based o the clusters ad the CSA solver. I our algorithms, we add sesors to the etire moitorig domai oe by oe. The questio is Algorithm LIGHTWEIGHT SENSOR PLACEMENTD) Iput: Sesor umber N, Cluster Iformatio : C k, S k,k,cfd data Output: P lacemet solutio D : d j =for all j K 2: Pj [N] =for all j K 3: for i =to N do 4: ΔP = 5: for j =to K do 6: = d j + 7: if > j the 8: [ P j [],D j []] 9: = CSA, S j,c j,cfd data ) : j = : ed if 2: ΔP = P j [] d j Pj [d j ] 3: if ΔP > ΔP the 4: k = j 5: ΔP =ΔP 6: ed if 7: ed for 8: d k = d k + 9: D = K j= D j[d j ] 2: ed for 2: retur D which cluster should take i the ew sesor. We calculate the icremets of the global average detectio probability by addig the ew sesor to each particular cluster. The icremets of average detectio probability, whe addig the ew sesor to a specific cluster, is calculated by the CSA solver withi the search space of that cluster. After calculatig all the clusters with the ew sesor added i, we choose the cluster that adds the most icremeted global average probability ad add the ew sesor to it. Note that i this step we do ot eed to calculate the result for every cluster, because the clusters that are ot picked by the last sesor have already bee calculated i the last roud for the ew sesor umber. Therefore, the computatioal time is sigificatly reduced. Aother reaso that the LSP algorithms ca decrease the computatioal complexity is that the beefit of addig a sesor to a specific cluster is greater whe fewer sesors were previously i that cluster. Therefore, the algorithms favor addig sesors to clusters which have fewer sesors. This reduces the umber of variables whe usig the CSA solver. For the problem of detectio probability maximizatio, the correspodig algorithm termiates whe all the available sesors are placed. For the sesor umber miimizatio problem, the correspodig algorithm termiates whe all the targets reaches the required detectio probability. The pseudo code of the algorithm to solve the detectio probability maximizatio problem is listed i Algorithm. We do ot show the pseudo code of the algorithm to solve the sesor umber miimizatio problem sice it oly requires little modificatio to Algorithm, which has a differet

4 Temperature C) 4 35 3 25 2 5 5 Measuremet CFD Calibrated CFD 2 3 4 5 6 7 8 9 2 3 Rack ID Averag ge Detectio Probability.5.4.3.2. Uiformly Radom CFD+Proportioal Global Optimal 2 3 Number of Sesors Averag ge Detectio Probability.8.6.4 2.2 Uiformly Radom CFD+Proportioal 3 5 7 9 3 Number of Sesors Fig. 3. Fig. 6. Averag ge Detectio Probability CFD temperature data calibratio..9.7.5.3 Fig. 4. Average detectio probability compared Fig. 5. Average detectio probability uder differet sesor umber. with global optimal solutio. B. Detectio Probability Maximizatio Uiformly Radom CFD+Proportioal..4.6.8.2 Fusio Rage R m) Average detectio probability uder differet fusio rage. termiatio coditio. V. CFD VALIDATION AND SIMULATION RESULTS I this sectio, we first preset the validatio of our CFD model based o real measuremet. We the preset more probability maximizatio simulatio results based o the server room model preseted i Figure of the mai file. A. CFD Model Validatio We ru the CFD computatio for the ormal operatig server room with a ormal workload server power dissipatio) ad compare the result from CFD with the real temperature data collected from all the racks, show i Figure 3. We see that the computatioal result from CFD slightly deviates from the real temperatures i the room. The discrepacy betwee the CFD temperature results ad the measuremets is maily caused by two factors. First, it is caused by the simplificatio of usig 4 boxes to represet the etire rack i our rack model. Secod, it is caused by the model iput error such as the error from the flow-rate idicator. We eed to calibrate the data from CFD before usig it as iput to the sesor placemet algorithm. To reduce the impact from the system iput error, such as a flow rate lower tha the real value idicated by the flow-rate idicator, we first derive a offset temperature chage δt by the least square error method based o these sample data. The offset value is the compesated back to CFD results for all locatios. The calibrated CFD data i Figure 3 is the compesated data at the correspodig locatios. We see that the average discrepacy betwee the CFD results ad the measuremets is reduced to 2 C o average. The improved CFD rack model described i Sectio IV.B of the mai file ca further improve the accuracy of the CFD simulatio results. I this subsectio, we show the simulatio results of the differet placemet schemes to solve the problem of detectio probability maximizatio with a give umber of sesors to deploy i the server room show i Figure of the mai file. I the first experimet, we explore the average detectio probability performace uder differet sesor umbers. We first show a compariso betwee differet placemet performaces with the global optimal results. The global optimal results are derived by ruig CSA solver to solve the placemet problem for the etire room without dividig servers ito clusters. Because of the tremedously high computatioal complexity, we are oly able to ru the optimal solutio with, up to three sesors. The results are show i Figure 4. We ca see that shows the closest performace to the global optimal solutio with oly a approximately 7% differece, while the other two placemet schemes are further away from the global optimal solutios. This demostrates that ca better approximate the global optimal solutios. Figure 5 shows the detectio probability performace with more sesors. Compared with Uiformly Radom, shows a performace improvemet of more tha %. This is because the scheme utilizes the theoretical alaysis results from CFD as the basis of the placemet algorithm, while Uiformly Radom does ot have ay theoretical guratee whe placig the sesor. also shows a % icremetal icrease i the detectio probability over CFD+P roportioal. This is because whe choosig a cluster to place each sesor i, we cosider the global maximum improvemet, while CSP+P roportioal oly cosiders the optimal placemet i each idividual cluster. The reaso that the improvemet is ot sigificat is because the size of each cluster ad the umber of moitored servers i each cluster i our sever room are approximately the same, which leads to approximately the same umber of sesors pre-assiged to each cluster. However, if with a server room that the cluster size, as well as the umber of servers i each cluster, differs sigificatly, the performace improvemet is likely to be more sigificat. This is because proportioally assigig a differet umber of sesors to each cluster may cause such a problem that a small cluster does ot have eough sesors to cover all the overheatig servers, as the majority of the sesors are uecessarily assiged to a big cluster, leadig to bad detectio performace for small clusters.

5 Average tectio Probability Det.8.6 4.4.2 Uiform 3 3 32 33 34 Temperature Threshold C) Rep ported Temperature C) 5 45 4 35 3 25 Uiform Threshold 3 C) 2 3 4 5 6 7 8 Heater Locatio Fig. 7. Frot ad back sides of the server racks used i hardware experimets the heater used to emulate overheatig server block is highlighted). Fig. 8. Average detectio probability with differet temperature threshold. Fig. 9. Reported temperature whe the heater is placed i differet locatio temperature threshold is 3 C). I the secod experimet, we explore the performaces of the differet placemet approaches with differet fusio radius settigs. The results are show i Figure 6. With the fusio radius icreasig, the average detectio probability icreases at first, for all three approaches. This is because with a larger fusio radius, more sesors ca be ivolved i the decisio makig of a overheatig locatio. However, whe the fusio radius is larger tha m, the average detectio probability begis to decrease. This is because several distat sesors are icluded i the fusio regio of moitored locatios whe the fusio radius is high, leadig to a distortio of the fusio result. Amog all three placemet schemes, the performs best with a approximately 9% average detectio probability whe the fusio radius is m. Uiformly Radom performs the worst with oly, at most, a 3% average detectio probability. VI. HARDWARE TESTBED RESULTS I this sectio, we show the experimetal results from our hardware testbed i the server room based o the CFD aalysis with the simplified rack model. This set of testbed experimet is coducted o the same testbed as show i the mai file. Based o the simplified CFD server rack model, which cosists of four blocks as we have explaied i Sectio 4. i the mai file, the testbed cluster has 8 blocks, i.e., 8 moitored locatios i total for the two racks. To emulate ad create the overheatig sceario of each block, we place a small fa heater at the back of the rack to heat up the air aroud the moitored locatio, oe at a time, as show i Figure 7. The labeled peak power usage of the heater is,5w. Because the heater ad a real overheatig server block still have differet thermal behaviors, the CFD results for the etire server room caot be directly applied. Therefore, we calibrate the CFD model by usig Telosb motes to collect the temperature data i the overheatig scearios. The calibrated model is the used to guide our placemet algorithm to fid the ear-optimal sesor placemet solutios. We compare our scheme with a baselie called Uiform, which is similar to the uiform-distace placemet strategy adopted i may real data ceters [6][7]. Four sesors are used for the two racks i both methods. I Uiform, two sesors are placed o each rack, with oe at the /4 height ad the other at the 3/4 height of the rack. Figure 8 shows the average detectio probability uder differet overheatig temperature thresholds. Each poit is the average of 2 repeated experimets. The results show that sigificatly improves detectio performace uder all thresholds due to its better sesor placemet. Whe the threshold is 3 C, s detectio probability is twice that of Uiform. Figure 9 shows the temperatures reported by the sesors whe the heater is placed at the 8 moitored locatios i the case whe the temperature threshold is 3 C. We ca see that all the overheatig scearios are correctly detected temperature exceedig the threshold) by due to its optimized sesor placemet for maximized detectio probability i the etire 8 moitored blocks. I cotrast, Uiform ca oly detect some of the overheatig scearios with a commoly used uiform sesor placemet strategy. VII. CONCLUSIONS AND FUTURE WORK This supplemetary file presets additioal details of the hot server detectio model ad desig of the itelliget sesor placemet algorithm, as well as additioal simulatio ad testbed results of the paper: Itelliget Sesor Placemet for Hot Server Detectio i Data Ceters REFERENCES [] W. Feller, A itroductio to probability theory ad its applicatios. Joh Wiley & Sos, Ic., 968. [2] S. V. Patakar, Numerical Heat Trasfer ad Fluid Flow. Hemisphere Publishig Corporatio, New York, 98. [3] X. Wag, X. Wag, G. Xig, J. Che, C.-X. Li, ad Y. Che, Towards optimal sesor placemet for hot server detectio i data ceters, i ICDCS, 2. [4] J. Li ad A. D. Heap, A review of spatial iterpolatio methods for evirometal scietists. Geosciece Australia, Caberra, 28. [5] E. H. Isaaks ad R. M. Srivastava, A itroductio to applied geostatistics. Oxford Uiversity Press, 989. [6] C. Bash, C. Patel, ad R. Sharma, Dyamic thermal maagemet of air cooled data ceters, i ITHERM, 26. [7] C. Bash ad G. Forma, Cool job allocatio: measurig the power savigs of placig jobs at coolig-efficiet locatios i the data ceter, i USENIX, 27.