Prediction Is Better Than Cure CFD Simulation For Data Center Operation.

Size: px
Start display at page:

Download "Prediction Is Better Than Cure CFD Simulation For Data Center Operation."

Transcription

1 Prediction Is Better Than Cure CFD Simulation For Data Center Operation. This paper was written to support/reflect a seminar presented at ASHRAE Winter meeting 2014, January 21 st, by, Future Facilities. 1/21/2014

2

3 Table of Contents Introduction... 1 Why Do Data Centers Fail to Achieve The Design Goals?... 3 Figure 1. Room Filled with Notional Equipment to Reflect Design Assumptions... 3 Figure 2. Uniform Load: All Cabinets 4.25kW / Cabinet (408kW total)... 4 Figure 3. Uniform Airflow Requirement 240l/s/cabinet (23m 3 /s total)... 4 Figure 4. All IT Equipment Operates In the ASHRAE Temperature Compliance Recommended Range... 4 Figure 5. Typical Enterprise Varied Equipment Configuration... 5 Figure 6. Cabinet Loads Vary from 0kW to 13.2kW... 5 Figure 7. ASHRAE Temperature Compliance Is Not Achieved For Varied Equipment... 6 Do Core DCIM Tools Address This Loss In Performance?... 7 Prediction Is Better Than Cure... 8 Simplifications In The Modeling Toolset... 8 Figure 8. Small Datacom Hall with 8x No. 5kW Cabinets... 8 Figure 9. Comparison of Elevation of Temperature In Cold Aisle... 9 Figure 10. Flow in the Raised Floor Predicted by RANS CFD (Left) and PFM CFD (Right)9 Figure 11. Flow from In-Row Coolers Predicted by RANS CFD (Left) and PFM CFD (Right) Simplifications/Assumptions When Creating the Model Figure 12. Temperatures When 1U Server Are Above Blade Center Figure 13. Temperatures When 1U Server Are Below Blade Center Modeling for Operation In Conclusion... 14

4 Introduction Advancing technology has brought us more powerful computer hardware for our data centers. In addition, it brings the opportunity to instrument and monitor the Datacom halls to better understand where the IT load is and the resulting environment, at least in terms of air temperature at an array of selected locations. Instrumentation and monitoring is, without doubt, a step forward that helps the operator in their quest to understand and control energy use of their undoubtedly complicated asset. But is measurement enough? Data center operators typically populate their white space slowly over time. They do so assuming that the design intent (characterised by very high level parameters such as Total IT kw and IT kw/cabinet or IT kw/sqft) is always valid. The problem with this assumption is that it does not reflect the reality of the operational configurations that will exist in the future. In essence, a data center designer has a challenging task to design: The infrastructure for a room of unknown and varying electronics With unknown and varying load To be placed in an ill-defined climate the local climate varying throughout the year. So, the designer can only consider: A range of design scenarios for simplified configurations e.g. Day 1 50%, 100%, using generic assumed loads and selected ambient conditions A range of failure scenarios to check the design is resilient in the event of the unexpected One thing we know is this: the configurations considered by the designer are unlikely ever to occur in practice! This is not to say that CFD used in design to assess the design in relation based on these assumptions is futile. On the contrary, such an assessment is the key to checking, selecting and optimising the underlying concepts and strategies. Further, it allows sensitivity studies for variations in IT load density, IT equipment type and configuration in order to avoid design flaws that are subsequently exposed by small deviations from the design assumptions during operation. However, such design Page 1

5 assessment can never be considered a true prediction, because the configurations that will occur over time will be unique to that facility and probably to a particular day. Given that these conceptual design simulations do not cannot guarantee performance in normal operation, are the advances in measurement and monitoring the saving grace? Page 2

6 Why Do Data Centers Fail to Achieve The Design Goals? First, we must answer the question: what factors cause the data a center to fail? We have discussed that configurations may vary. Consider a room cooled by perimeter down-flow units supplying air via a raised floor to contained cold aisles. In a design scenario, the room with uniformly loaded racks/cabinets might look something like that shown in Figure 1. Figure 1. Room Filled with Notional Equipment to Reflect Design Assumptions The design assumption is that the power distribution and heat production in the room is uniform, Figure 2 (p.4). Similarly, the assumption is that the airflow requirement of the IT equipment is uniform, Figure 3 (p.4). A design scenario CFD simulation is run to determine whether sufficient cooling is delivered to each item of IT equipment, given a specified cooling capacity and control scenario. One way of measuring whether there sufficient cooling for the IT equipment is to look at the ASHRAE Temperature Compliance to see if the conditions fall within the recommended range during normal operation. For the above scenario, the design cooling strategy achieves conditions for all equipment that satisfy ASHRAE temperature compliance for normal operation, as indicated by the cabinets being colored green, Figure 4 (p.4). Page 3

7 Figure 2. Uniform Load: All Cabinets 4.25kW / Cabinet (408kW total) Figure 3. Uniform Airflow Requirement 240l/s/cabinet (23m 3 /s total) Figure 4. All IT Equipment Operates In the ASHRAE Temperature Compliance Recommended Range Page 4

8 In practice, the equipment distribution that is installed in most enterprise data centers will not be a uniform and homogeneous layout. Consider a more typical installation of a variety of different equipment types, Figure 5. Figure 5. Typical Enterprise Varied Equipment Configuration The total power/ heat load in the room is the same, but the distribution is non-uniform with some cabinets/racks having no power and others more than three times the average power, Figure 6. Figure 6. Cabinet Loads Vary from 0kW to 13.2kW Page 5

9 A similar variation applies to airflow requirement. This non-uniformity and deviation from the conceptual configuration has consequences: some IT equipment is now receiving air that meets only the Allowable (orange) range on ASHRAE s Temperature Compliance scale, rather than the Recommended range. Further, some inlet temperatures fall outside the Recommended and Allowable ranges, as indicated in Red, Figure 7. Figure 7. ASHRAE Temperature Compliance Is Not Achieved For Varied Equipment In simple terms, configured in this way, the data center cannot be filled to capacity without risk of loss of availability. As a consequence, to avoid risk of loss of availability due to equipment overheating, the management team will stop installing new equipment as soon as equipment starts to exhibit temperature warnings/ alarms. The result? Capacity is lost. Page 6

10 Do Core DCIM Tools Address This Loss In Performance? In reality, many enterprise-scale data centers undergo changes on a daily basis and the impact of change can result in equipment being put at risk in the way described. To manage these frequent changes to complex modern datacom environments the industry has turned to Data Center Infrastructure Management, or DCIM for short. A key aspect of DCIM is to include monitoring. The monitoring systems are becoming more prolific largely because of improvements in price, availability and better access to the data they provide. The data is used to provide specific alarms, but also it provides views of power and temperature throughout the IT hall. While some temperature data is given for the IT inlets (IT is what you really care about), many systems have sensors at locations other than the inlets to the IT equipment. In such a configuration, the sensors will not necessarily indicate there is a problem at all. For example, when recirculation is the source of overheating and the limited number of sensors means that sensors are not included in the recirculation path, no alarm may ever be issued. Perhaps much more importantly (even if every inlet were monitored), is the fact that the data provided by sensors can only tell you what is happening now or what has happened in the past. But, critically, it does not tell you what will happen in the future when you make your next installation. In practice, the fact that equipment already installed is cool is no guarantee that there is sufficient air to cool new equipment. In simple terms, core DCIM tools such as monitoring simply look at the past or the present and NOT the future. Perhaps worse still, it is also quite often the case that when a new installation is made the adverse effect may be on items of IT equipment in another location entirely. As a result, previously installed IT equipment that has been operating satisfactorily may suddenly be adversely affected. Without any foresight, the first signs of this are seen in the form of equipment alarms indicating that the environment they are experiencing is close to limits for hardware use. In fact, many data centers start to see thermal alarms from the IT equipment as early in their life as 60-70% of the design capacity. In order to avoid this lost capacity deployment decisions need to be assessed in terms of their future impact. Page 7

11 Prediction Is Better Than Cure Given that strategic measurements indicate the symptoms, that is they look backwards not forwards, the obvious way is to use the same simulation tools used in design CFD simulation for the operational configuration.an extension to the design. The difference now is that, unlike the design scenario where CFD is used to model conceptual configurations, the model must now consider the actual configurations allowing for the as-built facility, infrastructure, the IT systems, and the deployment practices actually in use. The use of simulation represents the only practical way to predict the likely performance, short of building a mock-up or installing the equipment in test mode in the real facility. However, it is important to realise that the model must not only use the real equipment in the chosen locations, but must also reflect the actual installation and practices. When using CFD as a prediction tool in operation, there are many details that if ignored may lead you to the wrong deployment decision. These fall into two categories: 1. Simplifications in the modeling toolset 2. Simplifications/assumptions when creating the model Simplifications In The Modeling Toolset Given the computational expense of traditional CFD tools, it is tempting to identify elements of the physics that can be ignored and thus simplify and speed up the simulation process. Consider a small equipment room, Figure 8. Figure 8. Small Datacom Hall with 8x No. 5kW Cabinets The room is cooled by a perimeter down flow unit distributing cool air via a raised floor plenum to two short rows of 5kW cabinets. The simulation was performed with and without Page 8

12 thermal buoyancy being accounted for (i.e. ignoring the physics that hot air rises), Figure 9. It is clear that even in the high flows produced in a data center the change in flow and temperature distribution in the cold aisle are significantly such that IT equipment inlet conditions are very different. Figure 9. Comparison of Elevation of Temperature In Cold Aisle Now consider another alternative. Consider the possibility of using a simpler methodology for determining the flow. The examples below compare a traditional finite volume RANS (Reynolds Averaged Navier Stokes) solution and a potential flow solution. Figure 10 compares the flow predicted in the raised floor. Figure 10. Flow in the Raised Floor Predicted by RANS CFD (Left) and PFM CFD (Right) The streamlines show that using a potential flow solution (which, by its nature, does not conserve momentum) results in a very different airflow pattern very typically characterized Page 9

13 by a lack of recirculation or separation in the flow. This results in a very different airflow distribution through the perforated tiles. In the body of the room the picture is similar. Figure 11 shows the flow from in-row cooling units. Figure 11. Flow from In-Row Coolers Predicted by RANS CFD (Left) and PFM CFD (Right) The failure to conserve momentum, predict separation or recirculation results in very different flow patterns and consequent IT equipment inlet temperatures. In summary, to model real operational scenarios, it is important to include the full physics and conserve all variables. Simplifications/Assumptions When Creating the Model Data centers are very complex. It is impossible to model them in full detail. The big questions are, what do we include and how do we include it? There are many potential simplifications: Obstructions cables, pipes, Cooling devices controls, discharge/ fan characteristics Airflow distribution perforated tile characteristics, cable penetrations, containment details Equipment configuration rack construction, installation details, operational characteristics, In conceptual models, many of these are grossly simplified. For example, cables modeled as distributed resistance. Another classic simplification is to simply lump the IT equipment in a rack/cabinet together and simply capture the total heat load and bulk airflow. But is this reasonable when hoping to predict the true performance in operation? Page 10

14 Consider a rack/cabinet which is to have a blade center and three 1U servers installed. Does it matter how the two equipment types are installed in this cabinet? Option 1 Three 1U servers placed on top of a blade center, Figure 12 (p.12). The configuration is poorly blanked and hot air recirculates under the blade center. This results in recirculated air that is over 27 C entering the IT equipment inlets. Option 2 - The alternative is to install the 1U servers at the bottom with the blade center above, Figure 13 (p.12). In this configuration, while the there is still recirculation, the temperature of the recirculated air is less than 22 C. Clearly, using a simplified model cannot capture both conditions the detail is essential. Page 11

15 Figure 12. Temperatures When 1U Server are Above Blade Center Figure 13. Temperatures When 1U Server are Below Blade Center Page 12

16 Modeling for Operation So, when we use CFD for operational deployment decisions, it is critical that we must: Ensure the model includes the key details Survey/monitor the facility to check the model (modern monitoring lends itself to more frequent checking) i.e. perform a calibration Check if the model reflects reality sufficiently for engineering decisions. If it does not, we MUST review and update the model to complete the calibration Calibration is the process of measuring/monitoring in the data center to ensure that the modelling simplifications adopted still allow the model to predict reality. The calibration process warrants a separate document to fully describe it and so is not fully documented here. However, for the purpose of this paper it is sufficient to recognise that the calibration process is not a matter of actions like fixing grille flows to measured data settings rather than predicting them, or any similar artificial adjustments of the model. To make such adjustments fundamentally flaws the predictive methodology, since by fixing the flows, the consequences of any change to the configuration can no longer be predicted for any other scenario where the flow may vary. Calibration is instead the use of the measured/monitored data to establish whether, or not, the model of the current configuration predicts reality sufficiently accurately to be confident it is an adequate reference model to use as a basis to consider future changes to the installation. If it does not, the measured/monitored data can be used as indicator of where to look refine the model representation to capture the necessary physics. Once calibrated, the model can be used to test impact of new deployments in order to check the impact on Availability, Capacity and Efficiency. Page 13

17 In Conclusion Data centers are prone to losing capacity compared with design because no design can allow for the infinite number of potential installations; Sufficient detail for successful calibration must be added for effective operational predictive modeling; Predictive modeling can and should be used to avoid availability, capacity and efficiency losses alongside traditional DCIM. Page 14