Sustainable Computing: Informatics and Systems

Transcription

1 Sustainable Computing: Informatics and Systems 3 (2013) Contents lists available at SciVerse ScienceDirect Sustainable Computing: Informatics and Systems jo u r n al hom epa ge: Maximizing the detection probability of overheating server components with sensor placement based on thermal dynamics Xiaodong Wang a,, Xiaorui Wang a, Guoliang Xing b, Cheng-Xian Lin c a The Ohio State University, USA b Michigan State University, USA c Florida International University, USA a r t i c l e i n f o Article history: Received 4 September 2012 Accepted 29 January 2013 Keywords: Data center CFD Sensor placement Thermal monitoring Overheating detection a b s t r a c t Server overheating has become a well-known issue in today s data centers that host a large number of high-density servers. The current practice of server overheating detection is to monitor the server inlet temperature with the temperature sensor on the server enclosure, or the CPU temperature with on-die thermal sensors. However, this is in contrast to the fact that different components in a server may have different overheating thresholds, which are closely related to their respective thermal failure rates and expected lifetimes. Moreover, the thermal correlation between the inlet (or CPU) and other server components can be different for every server model. As a result, relying on the single inlet or CPU temperature for server overheating detection is over-simplistic, which may lead to either degraded detection performance or false alarms that can result in excessive cooling power, leading to unnecessarily low inlet temperature. In this paper, we propose a model-based approach that leverages thermal dynamics to intelligently choose sensor placement locations for precise overheating server component detection. We first formulate the detection problem as a constrained optimization problem. We then adopt Computational Fluid Dynamics (CFD) to establish the thermal model and analyze the thermal status of the server enclosure under various overheating conditions, such as inlet overheating, fan failures and CPU overloading. Based on the CFD analysis, we apply data fusion and advanced optimization techniques to find a near-optimal solution for sensor placement locations, such that the probability of detecting different overheating components is significantly improved. Our empirical results on a real rack server testbed demonstrate the detection performance of our solution. Extensive simulation results also show that the proposed solution outperforms other commonly used overheating monitoring solutions in terms of detection probability and error rate Elsevier Inc. All rights reserved. 1. Introduction In recent years, server overheating has become one of the most important concerns in large-scale data centers. Due to the considerations such as real estate and integrated management, data centers continue to increase their computing capabilities by deploying high-density servers (e.g., blade servers). As a result, the increasingly high server and thus power densities can lead to some serious problems. First, the reduced server space may result in a greater probability of thermal failures for various components within the servers, such as processors, hard disks, and memories. Such failures may cause undesired server shutdowns and service disruption. Corresponding author. Tel.: addresses: wangxi@ece.osu.edu, xwang43@gmail.com (X. Wang), xwang@ece.osu.edu (X. Wang), glxing@cse.msu.edu (G. Xing), lincx@fiu.edu (C.-X. Lin). Second, even though some components may not fail immediately, their lifetimes may be significantly reduced due to overheating. It is reported in [1 3] that the lifetime of an electronic device decreases exponentially with the increase of the operating temperature. Finally, the generated heat dissipation can also lead to negative environmental implications. Therefore, it is important for each server component to run at a temperature below its overheating threshold. However, in today s data centers, how to precisely detect whether any component in a server is overheating remains an open question. The current practice of detecting and monitoring an overheating server can be divided into two categories. The first category is a coarse-grained approach that only uses the temperature at a proxy component, e.g., CPU [4] or at a fixed location, e.g., the server inlet, for server overheating monitoring. This is in contrast to the fact that different components in a server may have different overheating thresholds, which are closely related to their respective thermal failure rates and expected lifetimes. Relying on a single /$ see front matter 2013 Elsevier Inc. All rights reserved.

2 X. Wang et al. / Sustainable Computing: Informatics and Systems 3 (2013) threshold at the server inlet or at the proxy component is therefore over-simplistic, because the thermal correlation between the inlet (or the proxy component) and each server component can be different for every server model. As a result, monitoring only the inlet temperature or a proxy component, such as the CPU, may lead to either missed detection of overheating for the components other than CPU, resulting in a degraded system lifetime or false alarms that result in excessive cooling power to unnecessarily lower the inlet temperature. The second category of server thermal monitoring approach assumes that each different component has its own built-in thermal sensor. Extensive research [5 8] of server thermal management has recently been conducted based on this assumption. Unfortunately, today s high-density severs are not equipped with a thermal sensor on every component. In most servers, only the processors have on-die sensors while some memory chips may also have builtin sensors. Therefore, it is important to provide a mechanism for measuring the temperatures of other components (e.g., hard disk, network chips), such that the previously proposed thermal management schemes can work effectively. More importantly, even if every component has its own thermal sensor, those sensors are used only for the control loops of those components in an isolated way. As a result, they cannot provide a system-level thermal picture that can help the fan system of the server and the cooling systems in the data center to efficiently cool down overheating components. Furthermore, low-end sensors used in server components commonly have measurement noises and hardware biases that may lead to failed detection or false alarms. Recent studies [9,10] have shown that the collaborative data fusion of multiple sensors can significantly improve the detection accuracy. Therefore, it is preferable to have server-level thermal monitoring with multiple sensors that can precisely detect overheating components. In this paper, we propose to leverage the thermal dynamics in a server to intelligently place sensors for precise overheating server components detection. Our sensor placement solution features a model-based approach, which adopts Computational Fluid Dynamics (CFD) as a theoretical foundation to establish the thermal model and analyze the thermal status of the server enclosure under various overheating conditions. CFD is a powerful mechanical fluid dynamic analysis approach and is widely used to analyze the fluid dynamics in various engineering fields, such as aircraft engine design and thermal analysis for buildings. CFD has already been used by computer system packaging designers to make intelligent decisions on server component layout design, but not yet for sensor placement in the server box. While CFD-based thermal monitoring has shown promise, a key limitation of CFD is its high computation overhead. As a result, CFD cannot be effectively used to report thermal emergencies in real time. In this work, we propose to use CFD to analyze the thermal dynamics offline and then optimally place sensors based on the analysis results to conduct online overheating detection. Such an integrated approach can enable us to achieve the benefits of both the systematic modeling of thermal dynamics (from CFD), as well as online measurement calibration and fast responsiveness (from sensors). Our solution provides a way to equip external sensors on the existing servers deployed in data centers for more accurate overheating monitoring. The proposed solution can also be used on future servers to place more sensors on the motherboard during the design phase. In our integrated thermal monitoring solution, we first use CFD to model the thermal environment of a given rack server box under different overheating conditions, including inlet overheating, fan failure and CPU overloading. We then calculate the most correlated regions in the server box for each specific component by correlation analysis. Accordingly, for a given number of sensors, we seek to place them in the server box such that the overheating components can be detected with the maximum detection probability, while the error rate of the detection can be bounded. We formulate this problem as a constrained optimization problem. Based on the CFD analysis, we design a heuristic algorithm to find a nearoptimal sensor placement solution. In our algorithm, we apply data fusion techniques to allow sensors to make collaborative detection decisions of server component overheating. Specifically, the contributions of this paper are four-fold. While the current thermal monitoring solutions rely on simplistic sensor placement, i.e., a single sensor at the inlet or the CPU, we propose a novel sensor placement scheme to intelligently place sensors for maximized overheating detection probabilities of each server component of interest. We use CFD analysis as a theoretic foundation to design our proposed sensor placement scheme. Our CFD analysis models the thermal dynamics of a rack server box in various overheating scenarios, including inlet overheating, CPU overloading, and fan failure. We formulate optimal sensor placement as a constrained optimization problem and propose a heuristic algorithm to find a near-optimal solution. Temperature correlation analysis is conducted to find the most correlated regions for each server component. We evaluate our sensor placement scheme in a real-world rack server box. Both our empirical and simulation results demonstrate that our placement solution can significantly improve hot server detection performance. The remainder of this paper is organized as follows. Section 2 highlights the distinction of our work by discussing related work. Section 3 presents the data fusion model, the formulation of the server overheating detection problem, as well as the temperature threshold setting for each different components. Section 4 introduces the fundamentals of the Computational Fluid Dynamics approach and provides an example of how to model a rack server box. Section 5 elaborates on how to use the analytical results from CFD in our sensor placement problem and proposes a heuristic algorithm to solve the problem. In Section 6, we introduce our experiment methodology and then evaluate our sensor placement scheme using both simulation and experiments on hardware testbed. In Section 7, we discuss an interesting variant problem formulation, as well as a potential application of our sensor placement scheme. Section 8 concludes the paper and discusses the possible future work. 2. Related work Thermal management for computer systems has been widely studied in the past. Skadron et al. have proposed a temperatureaware microprocessor management tool, HotSpot [11], which uses thermal resistances and capacitances to model the temperature of microprocessors. Performance and thermal behaviors of storage systems are extensively studied in [12], which identifies the knob for temperature optimization of high speed disks. Lin et al. [8] have proposed a software thermal management scheme for DRAM Memory, which has been implemented on real machines. However, few studies have been done on the joint thermal monitoring and management across different system components. Jeohwang et al. have modeled the thermal profile for an operating server system and a rack in [13] to provide a bridge between the individual component thermal status and data center thermal profile. A joint energy, thermal and cooling management technique (JETC) is proposed in [14] to optimize the cooling and operating energy for both CPU and memory. Different from all the previous work that addresses a single component individually, our work focuses on the

3 150 X. Wang et al. / Sustainable Computing: Informatics and Systems 3 (2013) joint thermal monitoring of multiple components in a single rack server system. Data center thermal management has also attracted a lot of research efforts. El-Sayed et al. [15] studied how to safely raise the operating temperature set point of data center cooling system such that more cooling power can be saved. An automated, online, predictive thermal management scheme for data centers is also proposed in [16]. Workload scheduling according to data center thermal profile has been studied in [17]. Another important aspect of data center thermal management namely temperature and thermal prediction, has also been studied. A fast prediction framework for data center transient temperature is proposed in [18]. Prediction system based on online thermal sensor readings to reach a fast and accurate data center temperature prediction is proposed in [19]. Chen et al. [20] proposed to combine both online sensor readings and CFD analysis results for data center temperature prediction. Although our work presented in this paper focuses on the thermal monitoring issue for overheating server component, it is actually complimentary to all the above mentioned datacenter-level thermal management studies. One of the goals for a datacenter-level thermal management system is to run the data center cooling system more efficiently. A higher risk of overheating is often introduced by such an cooling management system. With our thermal monitoring system at server component level, overheating issue can be more efficiently monitored and captured. Sensors have been deployed to conduct thermal management in computer systems. The existing thermal management with sensors can be categorized into two classes. The first class is to deploy sensors in server rooms and large data centers for environment temperature monitoring. For example, a hybrid wired and wireless sensor network is used in [21] for data center thermal monitoring. Sensors are also used in [9] to detect the overheating servers at the single system level. The second class is to deploy sensors inside or around different computer components for a specific component thermal monitoring. For example, the current CPU temperature thermal management schemes deploy on-die thermal sensors to monitor the CPU temperature at runtime [22]. Temperature sensor circuits have also been adopted in the DRAM design to provide thermal monitoring for memory chips [23]. Chip level thermal profile is also studied in [24] by using runtime temperature sensor readings. Our work is different from all the aforementioned research. We use Computational Fluid Dynamics (CFD) and temperature correlation of different components to guide sensor placement, such that the efficiency of the thermal emergency detection can be maximized. Different sensor deployment approaches for improved monitoring and detection performance have also been studied before. A sensor placement scheme based on the Multivariate Gaussian Process model is proposed in [25]. Though it provides informative monitoring results, an offline training stage before the actual deployment is required. This is not feasible for thermal monitoring of production server systems because thermal emergency should not be created for the collection of the training data. A fast sensor placement approach for fusion-based target detection is also proposed in [10] to minimize the number of deployed sensors while achieving assured detection performance. Different from the aforementioned work, we propose a new model-based sensor deployment approach, which leverages the theoretical computational results from CFD to maximize the detection performance of server component thermal emergency. 3. Overheating server component detection In this section, we first introduce the detection model for overheating server components. We then formulate overheating server component detection as a constrained optimization problem. Lastly, we introduce how to set the overheating temperature threshold for each component Overheating component detection model In the design of a computer system, it is always desirable to optimize the cooling efficiency of the system. However, due to the difference in functionalities and the variance in manufacturing processes, each component in the system usually requires a different safe operating environment temperature. Therefore, in order for the computer system to operate more efficiently and safely, the operating environment temperature of each component should be monitored separately based on their own requirement. Ideally, individual thermal monitoring and cooling mechanism should be provided to each single component. For example, the current design of CPU incorporates on-die thermal sensors, such that the temperature of the CPU chip can be monitored at runtime. Moreover, a heat sink is usually attached on top of the CPU chip to increase the air flow rate over CPU, such that the cooling efficiency can be improved. Unfortunately, there is usually no such on-die sensor embedded onto other components, such as memory chip and network chip. Therefore, new techniques are needed to monitor the operating environment of all the components, such that their overheating conditions can be detected and reported promptly. In this paper, we propose to place additional sensors into the computer system box to monitor the operating environment temperatures of all the components in the computer system. With all the components and cooling equipments running, the thermal environment inside a computer box is complex, which could cause more noise in the sensor readings. Furthermore, the number of sensors that can be placed into a high-density server box is limited, as one wants to maximize the space utilization for all kinds of server components and avoid complex wiring and costly installation in the already compact server box. Thus, the additional sensor nodes added to the server box should collaborate with each other to maximize their utility. To address these challenges, we adopt data fusion [26], a widely adopted collaborative sensing technique, to jointly process noise data from multiple sensors. It is clear that temperatures at distant locations from a component are less likely to be correlated with the ambient temperature of that component. Therefore, we define a fusion region for each monitored component as a disc with a fusion radius R, where each monitored component is located at the center of that disc. The sensors within the fusion region of a monitored component should collaborate to make the overheating detection decision for that component. Moreover, because of the complex air flows inside the system, temperatures at different locations within the fusion region have different correlation with the ambient temperature of the monitored component. For example, based on the air flow direction, the temperatures at locations behind the CPU are more correlated with CPU ambient temperature, compared with the temperatures at the locations in front of the CPU. Therefore, we further define a correlation threshold Th(i, j) for each pair of location i and component location j. To contribute to the ambient temperature monitoring for component j, sensors should be placed at location i within the fusion radius of component j, where the correlation value should be larger than Th(i, j). To decide the ambient temperature at the monitored component location, we adopt a data fusion scheme which calculates the average temperature of all the reported temperatures from sensors that meet the above two criteria. We compare the average temperature value with a detection threshold j. If the average temperature is higher than the threshold, the decision of a component being operating in an overheating environment is positive. The ambient temperature, T j, of component j can be derived from the temperature reading, T i, at the location (x i, y i ) of sensor i. The

4 X. Wang et al. / Sustainable Computing: Informatics and Systems 3 (2013) approach we use to derive the temperature T j is explained in Section 5.2. For now, we just denote this derivation as T j = f j (T i ). Measurement noise is usually included in the sensor readings. Denote the measurement noise strength measured by sensor i as N i, which follows the zero-mean normal distribution with a variance of 2, i.e., N i N(0, 2 ) [25]. We assume that all the temperature sensors are identical, such that they follow the same measurement distribution. The final reported temperature for the location of component j can be presented as T j = f (T i ) + N 2 i (1) where N 2 is the noise in energy form. The noise is taken out from the i transformation since it is additive to the real temperature readings. Assuming there are n j sensors within the data fusion group of a component at location j, the detection probability of the overheating component j in a specific overheating scenario can be calculated as ( n j ) 1 P Dj = P f n j (T i ) + N 2 > i j (2) j i=1 where j is the detection threshold of overheating for the component at location j. Because of the measurement noise from the sensor device, j includes both the real temperature threshold for a component, denoted as C j, and the measurement noise. With a high noise level from the measurement, a detection system is likely to report a false alarm when there is no real event. In our case, we define the false alarm rate when the environment of the monitored component is actually not overheating as follows ( n j ) 1 ( ) P Fj = P 2 Ni + C n j > j (3) j i=1 We assume Gaussian Noise, i.e., N i / N(0, 1). Therefore, nj i=1 (N i/) 2 follows the Chi-square distribution with n j degrees of freedom, denoted as nj ( ). Hence, Eqs. (2) and (3) can be modified as follows: ( ) n j j n j i=1 P Dj = 1 f j(t i ) nj 2 (4) ( ) nj ( j C j ) P Fj = 1 nj Problem formulation We assume that there are M components in a computer server, whose operating ambient temperatures need to be monitored. Given N sensors, (N M), we need to find the placement of these N sensors such that we can detect the overheating emergency at any of the M locations with the highest possible confidence. We assume N M is because it is preferable to place as few sensors as possible in the server box for thermal monitoring purpose, considering the complexity and high cost of the wiring design on the mother board. Our goal is to maximize the average detection probability of all the monitored locations M max 1 P Dj (6) M j=1 subject to the following constraint P Fj 1 j M (7) where is the tolerable detection false alarm rate bound. We note that the false alarm rate needs to be bounded in many practical scenarios in order to reduce the waste of system resources. (5) For a certain sensor placement, P Fj is a necessary condition in our problem. By Eq. (5), we convert the constraint in Eq. (7) to j ( 2 1 n j (1 )/n j ) + C j, a constraint for the detection threshold j at monitored location j, where 1 ( ) is the inverse function of ( ). Using this equation, we can obtain the threshold that satisfies the false alarm rate bound while maximizing the detection probability. From Eq. (4) we know that P Dj decreases when j increases. Therefore, to maximize the detection probability, we remove the inequality in the constraint and only use the lower bound. Hence, j can be calculated as j = 2 1 n j (1 ) n j + C j (8) 3.3. Component temperature threshold Before solving the problem in Section 3.2, we need to set the overheating threshold for each components in the system. Among all the factors that contribute to the lifetime of semiconductor devices, operating junction temperature, i.e., the highest temperature inside the semiconductor device, is a critical deciding factor. With a higher junction temperature, devices tend to fail sooner. There has been research [11,1] studying the temperatureinduced failure mechanisms of semiconductor devices. In most of the models studied, the operating junction temperature shows an exponential impact on the failure rate of a device, which is: ( ) exp E a (9) kt J where k is the Boltzmann s constant, 8.6 ev/k. E a and T J are the activation energy of electromigration and the operating junction temperature, respectively. The common activation energy for Al and Al with silicon is 0.6 ev. Hardware components from manufacturers often come with a warranty time. For example, both Intel and AMD sell their products with a three-year warranty package. Note that this warranty time indicates the time period that the device should work properly without hard intrinsic failures, even running under extreme conditions within the specification. However, as a common practice, computer systems usually serve for a longer period of time than three years with upgrades to some components, such as adding new disks for larger storage space. To extend the working time, we need to lower the operating ambient temperature threshold of each component. Given the extended lifetime requirement t and the lifetime requirement t under warranty, we can use Eq. (9) to calculate the new operating junction temperature threshold T J as 1 T = k ( t ) ln + 1 (10) E J a t T J In this work, we use sensors to monitor the temperature of the operating environment, which is the ambient temperature of a working component. The ambient temperature T A can be calculated using junction temperature T j in Eq. (9) as T A = T J P JA (11) where P is the operating power of the device and JA is the junctionto-ambient thermal resistance [27]. Based on all the above derivations and related values from data sheets of different components, we set the operating environment temperature threshold C j for component j in our work by one of the following three methods: (1) directly taken from the datasheet. For some of the components in the computer system, the maximum operating environment temperature is listed in the datasheet or the manual. Fig. 1 is the platform used in our experiment. It is a

5 152 X. Wang et al. / Sustainable Computing: Informatics and Systems 3 (2013) scenarios, the overheating detection probability model in Eq. (2) needs to be modified as: ( K ( n P Dj = 1 j )) 1 P f k K n j (T i) + N 2 > i j (12) j k=1 i=1 where f k j ( ) is the temperature mapping from sensor location i to component location j in the overheating scenario k. Similar to Eq. (4), under the Gaussian noise assumption, we can transform Eq. (12) to the following equation: ( )) K P Dj = 1 (nj 1 n j j n j i=1 f k j (T i) K 2 (13) k=1 Based on the above overheating detection probability model for the overheating component detection under multiple overheating scenarios, we can formulate the probability maximization problem as: M max 1 P Dj (14) M j=1 Fig. 1. The DELL PowerEdge U rack server used in our hardware testbed. The yellow boxes are the chips whose operating environment temperatures need to be monitored. The red dashed box in the lower picture highlights the front panel assembly of the server. The red dashed box in the upper picture highlights the temperature sensor used by the DELL server to monitor the temperature at the inlet. Except CPU and Memory, chips need to be monitored for temperature are indexed and highlighted with yellow boxes. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of the article.) 2U DELL rack server equipped with an AMD Opteron 2222SE Dual- Core processor. The maximum operating temperature listed on the datasheet for this type of CPU is 69 C. (2) Converted from the junction temperature threshold. For example, the maximum junction temperature and the junction-to-ambient thermal resistance for Lattice ispmach CPLD chip in our system are 75 C and 41.8 C/W, respectively. Applying Eqs. (10) and (11) with lifetime requirement of 7 years, we can get the ambient threshold as 60 C. (3) For the unknown type of chips or the chips whose datasheets are not available, we use 43 C, the default System Board Ambient Temperature setting required by OpenManage, DELL s server management tool Overheating detection probability maximization for combined overheating scenarios In Section 3.2, we have formulated the detection probability maximization problem under a specific overheating scenario such as an inlet overheating or CPU overloading. However, in practice, there are usually no means to know what kind of overheating scenario is going to happen at a future time. Therefore, it is important to prepare the system for multiple possible overheating scenarios. One simplistic way to achieve this goal is to consider every possible overheating scenario one by one and deploy sensors for every scenario. Although this approach does not require change to our previous detection model, it can result in a large number of sensors if the number of overheating scenarios is large. This kind of monitoring system is not desirable because of the space in the server box to deploy additional sensors is limited. To mitigate this problem, we propose to maximize the average detection probability across multiple different overheating scenarios. Assume we have K possible overheating scenarios, to get the average detection probability across multiple overheating subject to the same constraint as shown in Eq. (7). As shown in the experimental results presented in Section 6.5, this formulation leads to a smaller number of sensors to be placed in a server with the desired overheating detection probability. 4. CFD modeling for server box and components In this section, we first introduce Computational Fluid Dynamics (CFD), the tool we use to analyze the thermal environment inside the server box. We then provide an example to demonstrate how to model a server box and each of its components in practice using Fluent [28], a widely used CFD modeling software package CFD modeling CFD is a fluid mechanics approach that analyzes properties of fluid flows based on numerical methods and algorithms. CFD analysis gives great insight into the flow pattern and distribution of a targeted environment. Compared with the traditional experimental method of studying the flow pattern distribution such as using flow sensors, CFD has its significant advantages. First, CFD can reach a high resolution in the space and time domains while the traditional method usually can only study a limited number of points and time instants. Second, CFD can be applied for virtually any problem using realistic operating condition setups while experimental methodology can only work on limited conditions and environments. Third, the scale of CFD simulation can cover a wide range while the traditional method usually only works on a laboratory-scale model. The key for CFD modeling is to solve the governing transport equations represented in the following conservation law form: t + U j x j = x j ( Ɣ,eff x j ) + S (15) where represents different parameters such as mass, velocity, temperature or turbulence properties; is the fluid (air) density; t is the time for transient simulations; x j is the coordinate variable for x, y or z with j being 1, 2 or 3; U j is the velocity in different directions; Ɣ is the diffusion coefficient; and S is the source for the particular variable. For example, when is the air temperature, S stands for the volumetric heat rate from a source component. The four equation terms represent transient, convection, diffusion, and source parts of transport phenomenon in the spatial domain [29].

6 X. Wang et al. / Sustainable Computing: Informatics and Systems 3 (2013) The partial differential equations listed in Eq. (15) represent a system, where all the transport equations are coupled together and require to be solved simultaneously. For a complicated environment, such as a server enclosure, closed-form solutions are hard to be found for the airflow and heat transfer of the entire system. Therefore, the most fundamental consideration in CFD is how to treat a continuous fluid in a discretized fashion, such that numerical methods can be applied to find the solutions. Most CFD software packages apply the control volume method to find numerical solutions Example of server box CFD modeling Using CFD to perform a continuous fluid model requires the discretization of the spatial domain into small cells. One method to perform this discretization is to generate volumetric grid. After the discretization, necessary boundary conditions and suitable algorithms need to be applied to solve the above-mentioned transport equations. Several popular software packages, such as Fluent, FLOTHERM, Flovent and Phoenics, can be used for CFD modeling purpose. In our project, we use Fluent, a widely used CFD software package from ANSYS Inc., to perform the geometry meshing and solution finding. The CFD model we establish in this example is for the DELL PowerEdge 2950 server box, shown in Fig. 1. In the first step, we use Gambit, which is a grid generator, to perform the geometry establishment for this server. Basically, we choose different geometric shapes and perform unification or split to establish the geometric model for the entire server based on the real measured scales. Then we add different geometric shapes into the server box geometry to model the server components, such as the system fan and CPU sink, according to their geographic location and corresponding scale. After all components are added into the geometric model, we need to specify different boundary types, such as the server walls, the fans, and the inlets/outlets of the server box. The last step is to divide the entire geometric model into smaller scale cells by applying geometry meshing in Gambit. The grid size is a userspecific parameter. With a finer grid, more accurate CFD modeling can be reached. However, a fine grid increases the computational burden in the following stage when the transport equations are solved by numerical methods. We use 1 mm as the grid size to mesh the geometry. Although the CFD geometry model takes some time to generate because of the complicated component layout in the server box, we note that it is a one-time work that can be used for the analysis on all different overheating conditions for the same server, which is feasible for an offline sensor placement approach. After meshing the entire server in Gambit, we export the grid to the second software package, Fluent, to solve the transport equations in Eq. (15). Fluent requires all the boundary conditions of our geometric model to be specified. For example, we need to specify the power dissipation of each heat dissipating components such as CPU, memory, disk and all the other system chips. We also need to specify the inlet temperature and the system fan speed. After all the parameters are set up, the standard k-epsilon two-equation turbulence model is chosen to simulate the turbulent flow. Each simulation of one running condition takes about 20 min to finish. Fig. 2 shows a colored cross-section temperature map after solving the transport equations in Fluent. This is a scenario in which all the components are running under the power setting specified on their datasheets. 5. CFD-guided sensor placement In this section, we introduce how to use the results from the CFD analysis to guide sensor placement inside the server box, with the Fig. 2. Colored temperature map ( C) of the DELL server running CPU intensive benchmarks. The small black boxes indicate all the chips whose temperatures need to be monitored. The large box in the middle is the CPU sink. The four vertical short lines in the middle represent the four system fans. The four horizontal thin blocks underneath the CPU sink represent the memory modules. The temperature of the memory closest to the CPU sink is also required to be monitored. Disk is on the left side of the graph. goal of maximizing the overheating detection probability for all the components. We then introduce a heuristic algorithm for solving this detection probability maximization problem Overview of our approach Using CFD tools for our sensor placement in the server box primarily involves two steps. In the first step, we establish a geometric model for the server box in Gambit, mesh the geometry, and export the grid to Fluent. We then take measurements for the incoming air temperature and air flow rate at the inlet of the server. These measurements, along with the power consumption of each component and the fan speed, are the input parameters to Fluent. We repeat the first step by tuning the actuating parameter of the overheating scenarios to get multiple results of CFD analysis. For example, in an overheating scenario caused by inlet overheating, we change the inlet temperature to several different values to run CFD analysis. Based on the CFD results with different inlet temperatures, we obtain the temperature correlation between any spatial location, defined by the CFD grid, and each component location. We also use the CFD data to obtain an approximation function for each spatial location and targeted component location pair, such that the temperature at the targeted location can be calculated from the temperature at any spatial location with a high correlation. In the second step, we feed the results from the CFD analysis, including the overheating scenario temperature data and the correlation data to our optimization algorithm to find the best locations for sensor placement. We assume that our sensor placement needs to monitor the temperature of the point above the center of each component s top face. To solve the placement problem efficiently, we develop our algorithm based on the Constrained Simulated Annealing approach [30]. The algorithm is explained in detail in the following sections Component ambient temperature function and correlation In Section 3.1, we denote the reported temperature of component at location j from sensor i by a relationship T j = f j (T i ). Because of the complex fluid dynamics and thermal distribution in the server box, the temperature at location i can be very different from the temperature at location j, even if the physical distance between the two locations is short. Therefore, we need a function mapping from T i to T j such that the temperature reading from sensor at location

7 154 X. Wang et al. / Sustainable Computing: Informatics and Systems 3 (2013) i can be used to report the component temperature T j. We use the CFD analysis results from the last section to derive this relationship mapping. We first repeat the CFD analysis with different parameter settings. For example, in the inlet overheating scenario, the inlet temperature is changed at different runs of the CFD analysis. Based on all the temperature data from different runs of CFD, we establish a second-order polynomial model to approximate the relationship between any temperature T i and the component temperature T j as: T j = a j,i T 2 i + b j,i T i + c j,i (16) We have also introduced in Section 3.1 that our sensor placement scheme only places sensors at the locations that have high temperature correlations to the monitored targets. Therefore, we use the same set of CFD data as used in the above function approximation to calculate the spatial correlation between the temperatures T i and component temperature T j. Person s correlation is a widely adopted metric [31] that calculates the degree of association between two variables. Assuming that we have n sets of CFD data with different inlet temperature settings, we can calculate Person s correlation r(t i, T j ) by n k=1 (T k T i i )(T m T j j ) r(t i, T j ) = (17) n k=1 (T k T i i ) 2 n k=1 (T k T j j ) 2 The polynomial function approximation and correlation values are all inputs to the algorithm in the next section Sensor placement algorithm Procedure 1. CFD-guided sensor placement (D) Input: Sensor number N, Component Location list x[k] and y[k], CFD data, Correlation data r data, Overheating Threshold List C[K] Output: Placement solution D 1. for j = 1 to K do 2. x[j] min = x j R ; x[j] max = x j + R 3. y[j] min = y j R ; y[j] max = y j + R 4. end for 5. x min = min(x[k]); x max = max(x[k]); 6. y min = min(y[k]); y max = max(y[k]); 7. (P, D) 8. = CSA(N, x min, x max, y min, y max, C[K], CFD data, r data ) 9. return D Our goal is to find the optimal sensor placement locations in the server box to maximize the average overheating probability for all the monitored component locations. We propose to use a nonlinear programming solver based on the Constrained Simulated Annealing (CSA) algorithm [30]. CSA is an extension of the conventional Simulated Annealing algorithm for solving the global constrained optimization problem with discrete variables. Theoretically, CSA can reach a global optimal solution by converging asymptotically to a constrained global optimum with a probability of 1. However, a limitation of CSA is that its computational complexity grows exponentially with respect to the number of variables and the solution search space [30,10]. Therefore, before we apply CSA, we first reduce the search space of the algorithm by calculating the plausible search space according to the component locations. In our sensor placement problem, we propose to utilize sensors that are within the fusion range of a component location to collaboratively decide if the operating environment temperature of that component is overheating. Therefore, the search space is only plausible for that component if the sensor is placed inside the fusion range R of that component. We aggregate all the plausible search spaces of each component together by finding the maximum and minimum possible x and y values of a sensor. The aggregated region is then used as the search space for the sensor placement algorithm. The pseudo code of this algorithm is listed in Algorithm 1. Fig. 3. Comparison at multiple locations in the sever between temperature measurements on the testbed and CFD simulation results. Testbed runs the same CPU intensive workload as in Fig. 2. Lines 1 6 calculate the plausible solution search region. Based on the CFD and correlation analysis, i.e., CFD data and r data, lines 7 8 use CSA solver to find the placement solution D that maximizes the detection probability P. Algorithm outputs the placement solution D. 6. Evaluation In this section, we first validate our CFD model by comparing the CFD analysis result with the real sensor measurements. Then we introduce the experiment set up and the methodology used for the performance evaluation on our hardware testbed. After that, the overheating component detection performance is evaluated in both simulation and hardware testbed experiments in three different individual overheating scenarios, including inlet overheating, fan failure, CPU overloading and the combined overheating scenario using the previous three individual scenarios Model validation and experiment methodology To validate our server model in the CFD analysis, we place 19 sensors into the server box. The server is placed in an isolated server room with a dedicated air conditioning system. We measure the temperature under a normal server running condition, in which the server is running the SPEC CPU2006 benchmarks at an average temperature of 19.6 C at the inlet, with a 0.5 C fluctuation because of the air conditioning actuation. The measurements are taken when the server is running under stable thermal status with sensors placed in the closed enclosure. The sensors we used for the real temperature measurement are the Telosb sensor motes [32]. We choose this type of sensors because we can collect the temperature readings from those sensors with wireless signal without opening the server enclosure. We note that our approach does not depend on a particular sensor type and can utilize either wired or wireless communications (though wireless sensors can be less intrusive to the already complicated server environment). Fig. 3 shows the comparison between the CFD analysis temperature result and the testbed measurement result. We can see that the temperature difference between CFD analysis and real measurement is about 6.3% on average, which shows that our computational CFD result is sufficiently close to the real temperature measurements. If a different type of sensors that is smaller in size is used, the difference can be further reduced. There are totally five different sensor placement strategies that we evaluate across all the experiments. CFD-guided sensor placement is the placement approach we propose in this work to place sensors based on the analytical results from CFD analysis. Chip Best is the placement resulting from a best effort approach. To get this best performance, we first place sensors at all the exact

8 X. Wang et al. / Sustainable Computing: Informatics and Systems 3 (2013) Average ection Probability (%) Dete CFD Chip Best Chip Average Uniform Grid Random Sensor Number Fig. 4. Server temperature map of a partial inlet overheating scenario. The red dashed boxes are the chips whose environment temperatures exceed their individual overheating thresholds. Triangles indicate the sensors placed by CFD-guided approach, when the given sensor number is four. The black crosses indicate the four sensors placed by the baseline Chip Best approach. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of the article.) chip locations in the overheating experiment, one for each chip, to collect the temperature data. Then, for a given number of N sensors (less than the number of M chips), we find the combination with the N locations that results in the best detection performance from all possible combinations. Note it is infeasible to use Chip Best in a real implementation, because it needs to test all different combinations of sensor/chip pairing and select the best one. Different from Chip Best, Chip Average calculates average detection performance of all the possible combinations. Random is a simple heuristic strategy that places sensor randomly in the server box, which is the average results from 10 runs of random placements. Uniform Grid divides the server box into uniform-sized grid and places one sensor in each grid randomly. In all of our experiments, we evaluate the average detection probability and the error rate for different placement approaches. The average detection probability is defined as the number of overheating chips that are detected divided by the total number of overheating chips. The error rate evaluated consists of both the false alarm and mis-detection. For all of our testbed results, we run each overheating experiment 10 times and calculate the average value of each performance metric. There are no average results in simulation, since there is no variation in CFD temperature results, when the experiment settings remain the same Inlet overheating detection In this subsection, we evaluate the detection performance under a partial inlet overheating condition. Partial inlet overheating is often hard to be captured by the single inlet temperature sensor on the front panel assembly in Fig. 1. Ideally, one could adjust the air conditioning system in the room (e.g., reducing its blowing range) to emulate inlet overheating caused by cooling systems. However, due to limited allowed access to the air conditioning system in the room, we use a hair dryer to blow warm air into the server at the lower left corner of the front inlet to emulate the partial inlet overheating in our testbed experiment. To calculate the spatial temperature correlation and the target temperature function, CFD analysis is conducted in different scenarios with different inlet overheating temperatures. As a result, the sensor placement solution computed by our algorithm can handle the dynamics in different inlet overheating scenarios, despite that we only test a subset of those scenarios. Fig. 4 shows the temperature distribution of the server box under the highest partial inlet overheating Fig. 5. Average detection probability of the proposed CFD-guided solution and the baselines in the proposed CFD-guided solution and the baselines in the inlet overheating case (simulation). temperature. We can see that 9 chips (red dashed frames in the figure) out of the total 11 monitored chips are overheating in this scenario. Fig. 5 shows the average detection probability in the partial inlet overheating scenario. We see that the CFD-guided approach has the highest overheating detection probability. Compared with Chip Best, CFD shows a maximum performance advantage of about 22% when the sensor number is 2. This is mainly because when a sensor is placed at the exact location of one chip by Chip Best, it cannot always provide temperature monitoring for other chips, as chips are usually not placed close to each other. Although Chip Best may show some acceptable overheating component detection performance when the number of sensors is large, this performance is actually hard to achieve without testing all the combinations of sensor locations with the given number of sensors. Without exhaustively testing all the combinations, one can choose chip locations randomly, leading to the detection performance of the Chip Average scheme. We see that the CFD-guided placement outperforms the Chip Average at all sensor numbers in the experiment, with a highest performance gain of 45% when sensor number is 2. The other two baselines, Random and Uniform Grid, show significantly worse performance than CFD-guided, Chip Best, and Chip Average since they are only heuristic approaches. To illustrate the difference between CFD-guided and Chip Best, a placement example with 4 sensors is given in Fig. 4. We see that CFD placement does not place sensors on any of the chips. Instead, it places sensors in between chips, such that each sensor can cover more chips, thus leading to better detection results. Fig. 6 shows the average error rate in this scenario. We see that CFD-guided placement shows significantly lower error rates than the other two chip-location placement schemes. Fig. 6. Average detection error rate of the proposed CFD-guided solution and the baselines in the inlet overheating case (simulation).

9 156 X. Wang et al. / Sustainable Computing: Informatics and Systems 3 (2013) Fig. 7. Average detection probability of the proposed CFD-guided solution and the baselines in the inlet overheating case (testbed). This demonstrates that with the analytical results from CFD analysis, the placement can cover more targets, which leads to less miss-detection. Figs. 7 and 8 show the detection probability and error rate of detection on the hardware testbed. We extract the sensor placement locations from the simulations and place all the sensors into the server box accordingly. Because of the limited space, we only place up to five sensors into the server box. Since we evaluate three different sensor placement schemes, the maximum number of sensors placed in the server at the same time is 15. From the result we see that the detection probability and detection error performance on the hardware testbed matches the simulation results well. Among all the three schemes, CFD-guided shows the best detection performance and Chip Average has the worst performance. Fig. 9. Server temperature map in a scenario with single fan failure. The red dashed frame are the chips whose environment temperatures exceed their individual operating temperature thresholds. The black solid triangles indicate the sensors placed by the proposed CFD-guided approach, when the given sensor number is two. The black crosses indicate the two sensors placed by the baseline Chip Best approach. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of the article.) 6.3. Fan failure detection In this experiment, we conduct both simulation and hardware testbed experiment on a fan failure scenario. To ensure the safe operation of the system, we only disable one single fan in the system. To calculate the spatial temperature correlation and the target temperature function, several runs of CFD analysis with different fan speeds are conducted. Similar to the inlet overheating scenario discussed before, our sensor placement solution can handle the dynamics in different fan failure scenarios, because the CFD analysis is conducted with different fan speeds. Fig. 9 shows the colored temperature map of the server with a single fan disabled. The missing line at one of the fan positions represents the failed fan. We see that 4 chips (marked in read frame) out of the total 11 monitored chips are operating in the overheating environment. The average overheating detection probability from simulation is shown in Fig. 10. We see that CFD placement approach only requires two sensors to reach a 100% of overheating component Fig. 10. Average detection probability of the proposed CFD-guided solution and the baselines in the scenario with single fan failure (simulation). detection for all the four overheating locations while Chip Best requires three sensors. The placements with two sensors by these two approaches are marked in Fig. 9. We see that CFD placement tries to cover all the right corner overheating chips by putting only one sensor in middle of the chips. Compared with Chip Average, CFD shows significantly better performance by a 60% higher detection probability. As expected, Uniform Grid and Random schemes perform much worse than the other placement schemes. Fig. 11 shows the average error rate of the fan failure scenario in simulations. We see that despite some random errors, CFD outperforms Fig. 8. Average detection error rate of the proposed CFD-guided solution and the baselines in the inlet overheating case (testbed). Fig. 11. Average error rate of the proposed CFD-guided solution and the baselines in the scenario with single fan failure (simulation).

10 X. Wang et al. / Sustainable Computing: Informatics and Systems 3 (2013) Fig. 12. Average detection probability of the proposed CFD-guided solution and the baselines in the scenario with single fan failure (testbed). the other two baseline approaches. Chip Average shows the worst performance among the three approaches. Figs. 12 and 13 show the detection probability and detection error rate on the hardware testbed based on the extracted sensor placement locations from the simulation. From Fig. 12 we see that CFD has similar performance with Chip Best, but both of them still outperform the Chip Average scheme significantly. Fig. 13 shows the average error rate in this fan failure case. We see that CFD performs just a little worse than Chip Best, but still performs much better than the Chip Average. The degraded performance in this fan failure scenario is most likely caused by the model inaccuracy of the CFD analysis. Disabling a fan makes the thermal fluid dynamics more complex than other scenarios, leading to an increase of the modeling error. Please note again that Chip Best is actually not feasible in a real implementation, because it needs to test all different combinations of sensor/chip pairing and select the best one CPU overloading detection In this section, we present the simulation results for overheating scenario induced by CPU overloading. With the widely adopted DVFS technique, CPU power is well known to be a cubic function of CPU frequency [33]. By overclocking CPU frequency to 1.5 of the maximum value listed on data sheet, 3 overloaded power consumption can be easily reached. Unfortunately, the platform we use in our hardware experiment does not support CPU overclocking. Therefore, we only show the simulation results in this section for the detection performance under CPU 3 overloading. To calculate the spatial temperature correlation and the target temperature function, several runs of CFD analysis with different CPU power settings are conducted. Note again that our sensor placement solution is designed to handle the dynamics in different CPU overloading scenarios. Fig. 14 shows the colored temperature map for the CPU overloading 3 power scenario. Although the color pattern is quite Fig. 14. Server temperature map in the scenario of CPU overloading 3x the listed power consumption on the data sheet. The red dashed boxes are the chips whose environment temperature exceeds their individual operating temperature threshold. The black solid triangles indicate the sensors placed by the proposed CFD-guided approach, when the given sensor number is two. The black solid crosses indicate the two sensors placed by the baseline Chip Best approach. similar to the result in Fig. 2, i.e., a normal run with benchmark workload, it shows significantly higher temperature than that in the normal run. The highest temperature can reach up to about 120 C. Six chips are found to be working under overheating condition among all the 11 monitored chips. The placement results with three sensors is illustrated in Fig. 14 for both CFD-guided placement and Chip Best. We see that CFD placement places sensors in the middle of the cluster of overheating chips such that more chips can be covered by the limited number of sensors. Fig. 15 is the average detection probability of this CPU overloading scenario. We can see that CFD placement constantly shows the best detection probability result, and outperforms both Chip Best and Chip Average. With a sensor number of 2, the performance of CFD reaches twice as high as that of CFD Average. The average error rate of the component overheating detection with CPU overloading is shown in Fig. 16. We see that CFD placement outperforms both the Chip Best and Chip Average with all different number of sensors Detection performance in combined overheating scenarios We have evaluated our sensor placement scheme in three different individual overheating scenarios, including partial inlet overheating, fan failure and overheating under CPU overloading. As the type of overheating condition is usually unknown before it actually occurs, we need to prepare the system for monitoring any of the possible overheating condition. In this section, we evaluate the overheating detection performance of different sensor placement schemes in a combined overheating scenario. The combined 100 Fig. 13. Average error rate of the proposed CFD-guided solution and the baselines in the scenario with single fan failure (testbed). Average ction Probability (%) Dete Sensor Number 8 CFD Chip Best Chip Average Random Uniform Grid Fig. 15. Average detection probability in the scenario of CPU overloading 3 power

11 158 X. Wang et al. / Sustainable Computing: Informatics and Systems 3 (2013) Fig. 16. Average error rate in the scenario of CPU overloading 3 power. Fig. 19. Average detection probability in the combined overheating scenarios (testbed). Average ection Probability (%) Dete Sensor Number 8 CFD Chip Best Chip Average Fig. 17. Average detection probability in the combined overheating scenarios (simulation). overheating scenario consists of the previous three different individual overheating scenarios. We prepare the system by deploying sensors to monitor overheating component in any of the above three overheating conditions based on the formulation in Section 3.4. Specifically, we use all the three CFD analysis from the previous three different overheating scenarios as input and conduct our sensor placement algorithm, targeting to maximize the average overheating detection probability across all the three overheating scenarios. We conduct the evaluation first in simulation and then on our testbed. Figs. 17 and 18 are the simulation results that show the average detection probability and average error rate, respectively, of the detection performance for this combined scenarios. We see that CFD has almost the same detection performance as the Chip Best approach. Compared with its detection performance in each of the individual overheating scenario (as shown in previous sections), CFD performs slightly worse in the combined scenario. This is mainly because the optimization algorithm needs to consider all the overheating scenarios at the same time and makes tradeoffs between different scenarios. However, as discussed before, Chip Best needs to test all different combinations of sensor/chip pairing and select the best one, which is actually infeasible in the real implementation. Compared with Chip Average, the CFD-guided approach still performs significantly better on both the detection probability and the detection error rate. We then test different sensor placement strategies on our testbed. The overheating detection probability and detection error rate of the hardware experiment are shown in Figs. 19 and 20, respectively. As explained before, since we are unable to overclock the CPU to create the event of CPU overloading, a single round of each experiment consists of two overheating scenarios, the partial inlet overheating and the fan failure overheating. From the results we see the CFD placement has slightly better performance than Chip Best. Both of them consistently perform better than the CFD Average placement. The hardware result slightly differs from the simulation result because of the deviation in the CFD modeling process. 7. Discussion In this section, we first discuss a closely related problem, sensor number minimization problem. We then discuss the possible future work based on the overheating component monitoring system using our sensor placement scheme Sensor number minimization The main design goal of this paper is to optimizing the deployment locations of given sensors to reach a maximized average overheating detection probability of all the major components within the server box. While the probability maximization is important for overheating detection, sometimes it is also interesting to know the minimum number of sensors required to reach a targeted overheating detection probability, especially in our server box component overheating detection application. This is because with all the existing components and wires, the space within the server box Fig. 18. Average error rate in the combined overheating scenarios (simulation). Fig. 20. Average error rate in the combined overheating scenarios (testbed).

12 X. Wang et al. / Sustainable Computing: Informatics and Systems 3 (2013) is usually very compact, and thus the available space that can be used to deploy additional sensors is usually limited. The framework proposed in this work for detection probability maximization can be easily modified to serve the sensor number minimization purpose. To formulate the sensor number minimization problem, we can add an additional constraint of targeted detection probability. More specifically, the formulation is: arg min (x i,y i )) i N (18) subject to the following constraints P Fj (S N ) 1 j M (19) P Dj (S N ) ˇ 1 j M (20) where S N is the list of locations of all the N sensors. To solve this problem, we can use the same algorithm proposed in Section 5.3. Basically, we need to find out the smallest number of sensors that can provide the required detection probability from constraint Eq. (20) and also meet the false alarm rate constraint in Eq. (19). Since this problem is essentially a variant of the proposed detection probability maximization problem, which can be solved with a similar algorithm, we do not repetitively show experiment results in this paper Other potential applications We have introduced that our proposed sensor placement scheme can be used to deploying sensors to monitor and detect overheating server component under an unknown overheating scenario using combined overheating scenario monitoring. We now discuss how to integrate server-level thermal monitoring into another potential application, overheating root cause diagnosis. Although it is important to capture the overheating components, it is often more desirable if we can further determine the overheating reason. In other words, it is often more desirable to diagnose the root that is causing the overheating phenomenon, such that actions, such as increasing fan speed or lowering the inlet temperature, can be taken to correct the abnormal overheating behavior of the equipment. To accomplish this goal, in addition to the temperature sensors, we can further deploy other types of sensors, such as flow and acoustic sensors, using the same sensor placement framework proposed in this work. With the additional types of sensors, we can further characterize the working behavior of each cooling related component and conditions, such as server fan, inlet flow speed and flow passage across the server. By characterizing and monitoring the working conditions of these components, we can determine whether they are working properly to provide the desired cooling capabilities. We plan to integrate sensor placement with overheating diagnosis in our future work. 8. Conclusions Efficient thermal monitoring is critical for today s server systems to ensure safe operation and continuous service. It is also important for each server component to maintain a desirable lifetime of service. However, the current practice of server thermal monitoring simply relies on either sensors placed at the server inlet or ondie thermal sensors equipped only with some of components, such as CPU, memory or both, which may lead to degraded overheating detection performance for certain components. In this paper, we have presented a novel solution to place additional sensors into server box for overheating server component detection based on the CFD analysis of the thermal and fluid dynamics inside the server box. Our sensor placement scheme applies Constrained Simulated Annealing algorithm with a reduced search space to find a sensor placement with maximized overheating component detection probability. Our solution also adopts data fusion techniques to collaboratively make the overheating detection decision, resulting in improved detection performance. We evaluate our CFD-based sensor placement strategy with a real-world 2U rack server in different component overheating scenarios. Our results show that the proposed placement strategy achieves significantly better overheating detection performance than several well-designed baselines. Extensive simulation results also demonstrate the effectiveness of our CFD guided sensor placement scheme. Acknowledgements This work was supported, in part, by the US National Science Foundation under grants CCF , CNS , CNS (CAREER Award), and CNS (CAREER Award), and by the US Office of Naval Research under grant N (Young Investigator Program). References [1] J. Srinivasan, S. Adve, P. Bose, J. Rivers, Lifetime reliability: toward an architectural solution, IEEE Micro 25 (3) (2005) [2] J. Srinivasan, S. Adve, P. Bose, J. Rivers, The case for lifetime reliability-aware microprocessors, in: in: ISCA, [3] F.J. Mesa-Martinez, E.K. Ardestani, J. Renau, Characterizing processor thermal behavior, in: in: ASPLOS, [4] N. Tolia, Z. Wang, P. Ranganathan, C. Bash, M. Marwah, X. Zhu, Unified thermal and power management in server enclosures, in: in: ASME, [5] J. Donald, M. Martonosi, Techniques for multicore thermal management: classification and new exploration, in: in: ISCA, [6] R.Z. Ayoub, K.R. Indukuri, T.S. Rosing, Energy efficient proactive thermal management in memory subsystem, in: in: ISLPED, [7] S. Gurumurthi, A. Sivasubramaniam, Thermal issues in disk drive design: challenges and possible solutions, Transactions on Storage 2 (2006). [8] J. Lin, H. Zheng, Z. Zhu, E. Gorbatov, H. David, Z. Zhang, Software thermal management of dram memory for multicore systems, in: in: SIGMETRICS, [9] X. Wang, X. Wang, G. Xing, J. Chen, C.-X. Lin, Y. Chen, Towards optimal sensor placement for hot server detection in data centers, in: in: ICDCS, [10] Z. Yuan, R. Tan, G. Xing, C. Lu, Y. Chen, J. Wang, Fast sensor placement algorithms for fusion-based target detection, in: in: RTSS, [11] K. Skadron, M. Stan, W. Huang, S. Velusamy, K. Sankaranarayanan, D. Tarjan, Temperature-aware microarchitecture, in: in: ISCA, [12] Y. Kim, S. Gurumurthi, A. Sivasubramaniam, Understanding the performancetemperature interactions in disk i/o of server workloads, in: in: HPCA, [13] J. Choi, Y. Kim, A. Sivasubramaniam, J. Srebric, Q. Wang, J. Lee, Modeling and managing thermal profiles of rack-mounted servers with thermostat, in: in: HPCA, [14] R. Ayoub, R. Nath, T. Rosing, Jetc joint energy thermal and cooling management for memory and CPU subsystems in servers, in: in: HPCA, [15] N. El-Sayed, I.A. Stefanovici, G. Amvrosiadis, A.A. Hwang, B. Schroeder, Temperature management in data centers: why some (might) like it hot, in: in: SIGMETRICS, [16] J. Moore, J.S. Chase, Weatherman: automated, online, and predictive thermal mapping and management for data centers, in: in: ICAC, [17] J. Moore, J. Chase, P. Ranganathan, R. Sharma, Making scheduling cool : temperature-aware workload placement in data centers, in: in: USENIX, [18] M. Jonas, R.R. Gilbert, J. Ferguson, G. Varsamopoulos, S.K.S. Gupta, A transient model for data center thermal prediction, in: in: IGCC, [19] L. Li, C.-J.M. Liang, J. Liu, S. Nath, A. Terzis, C. Faloutsos, Thermocast: a cyberphysical forecasting model for datacenters, in: in: SIGKDD, [20] J. Chen, R. Tan, Y. Wang, G. Xing, X. Wang, X. Wang, B. Punch, D. Colbry, A highfidelity temperature distribution forecasting system for data centers, in: in: RTSS, [21] C.-J.M. Liang, J. Liu, L. Luo, A. Terzis, F. Zhao, RACNet: a high-fidelity data center sensing network, in: in: SenSys, [22] S. Memik, R. Mukherjee, M. Ni, J. Long, Optimizing thermal sensor allocation for microprocessors, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 27 (3) (2008) , [23] T. Yasuda, On-chip temperature sensor with high tolerance for process and temperature variation, in: in: ISCAS, [24] Y. Zhang, A. Srivastava, M. Zahran, Chip level thermal profile estimation using on-chip temperature sensors, in: in: ICCD, [25] A. Krause, C. Guestrin, A. Gupta, J. Kleinberg, Near-optimal sensor placements: maximizing information while minimizing communication cost, in: in: IPSN, [26] P.K. Varshney, Distributed Detection and Data Fusion, Springer-Verlag, Inc, New York, 1996.

13 160 X. Wang et al. / Sustainable Computing: Informatics and Systems 3 (2013) [27] S. Marsh, Direct extraction technique to derive the junction temperature of hbt s under high self-heating bias conditions, IEEE Transactions on Electron Devices 47 (2000). [28] CFD flow modeling software and solutions from fluent, [29] S.V. Patankar, Numerical Heat Transfer and Fluid Flow, Hemisphere Publishing Corporation, New York, [30] B.W. Wah, Y. Chen, T. Wang, Simulated annealing with asymptotic convergence for nonlinear constrained optimization, Journal of Global Optimization 39 (2007). [31] A. Verma, G. Dasgupta, T.K. Nayak, P. De, R. Kothari, Server workload analysis for power minimization using consolidation, in: in: USENIX, [32] MEMSIC, TelosB mote, [33] K. Choi, W. Lee, R. Soma, M. Pedram, Dynamic voltage and frequency scaling under a precise energy model considering variable and fixed components of the system power dissipation, in: in: ICCAD, Xiaodong Wang is currently a Ph.D. Student in the Department of Electrical and Computer Engineering at the The Ohio state University. Before joining The Ohio State University, he was a Ph.D. student at University of Tennessee, Knoxville. He is the recipient of the first Min Kao Fellowship of Electrical Engineering and Computer Science Department at University of Tennessee, Knoxville from 2007 to He also received the ESPN Graduate Student Fellowship and the Chancellors Award for Extraordinary Professional Promise Award from University of Tennessee, Knoxville, in 2010 and 2011, respectively. He received his M.S. in Computer Engineering from University of Tennessee, Knoxville in 2009 and B.S. degree in Electrical Engineering from Shanghai Jiao Tong University, China, in In 2007, he worked at PDF Solutions Inc. as a Data Analysis Engineer. Xiaorui Wang received the Ph.D. degree from Washington University in St. Louis in He is an associate professor in the Department of Electrical and Computer Engineering at The Ohio State University. He is the recipient of the US Office of Naval Research (ONR) Young Investigator (YIP) Award in 2011, the US National Science Foundation (NSF) CAREER Award in 2009, the Power-Aware Computing Award from Microsoft Research in 2008, and the IBM Real-Time Innovation Award in He also received the Best Paper Award from the 29th IEEE Real-Time Systems Symposium (RTSS) in He is an author or coauthor of more than 60 refereed publications. From 2006 to 2011, he was an assistant professor at the University of Tennessee, Knoxville, where he received the EECS Early Career Development Award, the Chancellors Award for Professional Promise, and the College of Engineering Research Fellow Award in 2008, 2009, and 2010, respectively. In 2005, he worked at the IBM Austin Research Laboratory, designing power control algorithms for high-density computer servers. From 1998 to 2001, he was a senior software engineer and then a project manager at Huawei Technologies Co. Ltd., China, developing distributed management systems for optical networks. His research interests include power-aware computer systems and architecture, real-time embedded systems, and cyber-physical systems. He is a member of the IEEE and the IEEE Computer Society. Guoliang Xing received the B.S. degree in electrical engineering and the M.S. degree in computer science from Xian Jiao Tong University, China, in 1998 and 2001, respectively, and the M.S. and D.Sc. degrees in computer science and engineering from Washington University in St. Louis, in 2003 and 2006, respectively. He is an assistant professor in the Department of Computer Science and Engineering at Michigan State University. From 2006 to 2008, he was an assistant professor of computer science at City University of Hong Kong. He is an NSF CAREER Award recipient in He received the Best Paper Award at the 18th IEEE International Conference on Network Protocols (ICNP) in His research interests include wireless sensor networks, mobile systems, and cyber-physical systems. Cheng-Xian Lin is currently an Associate Professor in the Department of Mechanical and Material Engineering at FIU. His prior positions include Associate Professor in the University of Tennessee, Knoxville and Summer Faculty Fellow at Air Force Research Laboratory in WPAFB. He earned his Ph.D. in Mechanical Engineering (Thermal Engineering) from Chongqing University, China. He has authored and co-authored over 150 papers in peer-reviewed journals and conference proceedings. His current research interests include Computational Fluid Dynamics, Heat Transfer, Thermal Management, Energy Efficiency and Renewable Energy in Built Environments. He is a member of the ASME and ASHRAE.