Application Performance for High Performance Computing Environments

Application Performance for High Performance Computing Environments Leveraging the strengths of Computationally intensive applications With high performance scale out file serving In data storage modules Intel, Mellanox, Dot Hill for HPC Overview 1

OVERVIEW Solving the world s toughest problems requires extreme data analysis. Medical researchers model and simulate genomic structures for the betterment of human health. Climatologists simulate and interpret weather patterns to better understand climate change and more accurately predict devastating storms. Simulations and analysis use data and create data at an immense scale. Computing requirements include massively parallel compute power, high bandwidth interconnects, and fast, scalable storage. Consider the oil and gas industry as an example. The collection and examination of seismic data requires robust and scalable processing and storage to provide analysis and reporting to a broad range of users including geologists, geophysicists, and field personnel. New techniques in marine seismic acquisition and analysis (3D, 4C and 4D), along with evolving seismic technologies such as Time Lapse, Wide Azimuth and Full Azimuth data acquisition are responsible for a sharp increase in the quantity of data that must be stored and processed. Key applications include HPC applications like seismic interpretation, reservoir characterization, prospect evaluation systems, and petrophysical analysis. Data storage systems must adapt to this dynamic landscape by offering solutions that provide density, scalability and performance, and yet are simple to manage and affordable. Dot Hill Systems, Intel and Mellanox have partnered to offer a solution to provide a massively scalable and high performance infrastructure for use in a broad range of high performance computing platforms. THE CHALLENGE Speed When storage needs to be accessed by many computing nodes, such as in large analytical environments with interpretation and characterization of geospatial information, it is important that the storage subsystem delivers superior performance in order to enable the system as a whole to deliver results in a timely manner. Typical workloads involve many nodes or threads performing sequential access of very large files. While the workload of the individual node or thread is sequential in nature, the composite workload of many nodes or threads causes the underlying access pattern to become highly randomized. The storage subsystem needs to respond to this environment by maintaining high sequential throughput as the number of simultaneous streams increases. Capacity Geophysical data consumes tremendous amounts of storage capacity. Typical systems require hundreds of terabytes to petabytes of storage, with much of it needing to be readily accessible for long periods of time. These requirements drive the need for storage solutions that provide high capacity, high density, low cost and low overhead. Intel, Mellanox, Dot Hill for HPC Overview 2

Scale To achieve greater speed and capacity, storage systems can either scale up or scale out. It is understood that scale up architectures are ultimately limited by the monolithic design of the system. On the other hand, scale out architectures offer much more opportunity to scale both capacity and performance. Namespace The objective of any storage subsystem is to have a global namespace to access all storage from. This greatly simplifies the management of the storage, since there is no need to manually balance and manage data among multiple storage namespaces. THE SOLUTION The Dot Hill Scale Out, High Performance File Serving solution consists of hardware components, software elements and infrastructure, all architected to provide balance and scalability. This means that the components of the architecture have been sized and selected to match the other components and that the solution is designed with growth in mind. Customers can begin by deploying a modest implementation and then easily add storage modules as their needs expand. Storage. The backbone of the solution is the AssuredSAN 4004 Storage Array. This workhorse includes many key features that allow it to deliver high performance and reliability. Adaptive caching technologies are employed to accommodate dozens of independent streams of data without degrading overall throughput. This is a critical feature for applications that depend upon reliable, high performance data streaming. In addition, the proven 99.999% availability of AssuredSAN products virtually eliminates downtime, allowing the datacenter to operate smoothly. Individual arrays support up to 96 Large Form Factor (LFF) disk drives. Using 4TB 7K SAS disk drives, a single system can provide 384TB of raw capacity. File Serving. The Intel Enterprise Edition for Lustre* software, hereafter referred to as Lustre*, provides the high performance distributed file system and namespace component of the solution. Lustre* is an increasingly popular solution in High Performance Computing environments because it enables bandwidth, performance and scaling well beyond the limits of traditional storage technology. In fact, Lustre* is the file system that is most widely used by the world s top 500 supercomputing sites. The Lustre* architecture consists of Object Storage Servers (OSS), Metadata Servers (MDS) and client nodes connected over a high speed network. Lustre* offers highly scalable network connectivity to the Lustre* clients, with higher per stream performance than that of NFS or CIFS. The Intel Enterprise Edition for Lustre* software includes a rich set of manageability components designed to simplify many tasks associated with configuring, deploying, tuning, managing and maintaining Lustre*. Networking. Mellanox is an industry leader in providing high speed networking components and infrastructure. To realize the benefits of high performance storage subsystems, it is important to couple them with a high speed network to the end clients. Mellanox InfiniBand solutions support the FDR 56 Gbps rate, which is the highest throughput and lowest latency interconnect available on the market today. In addition, Mellanox network solutions utilize Remote Direct Memory Access (RDMA) protocols to make more efficient use of compute resources. Using RDMA over InfiniBand, data transfer latencies can be reduced by over 90% and CPU efficiencies can by elevated up to 96%. Architecture. The solution comes together as described in Figure 1. Intel, Mellanox, Dot Hill for HPC Overview 3

Storage Modules. One or more Storage Modules form the basis of the data repository. Each module consists of a Dot Hill AssuredSAN array and two industry standard x64 servers. The servers are connected to the array with common 12 Gbps SAS interconnects, and connect to the file serving network via high speed InfiniBand. These servers host Linux as the OS and the Lustre* File Server (OSS) component. Each server pair is clustered together for high availability. The maximum raw capacity of a single Storage Module is 1344TB (4 chassis of 56 drives with 6TB drives). Metadata Module. One Metadata Module is required to provide the necessary file locking and file integrity of the distributed environment. This module consists of a pair of clustered servers, loaded with Linux and the Lustre* Metadata (MDS) component. The servers connect to a dedicated AssuredSAN array via SAS interconnects, and to the file serving network via high speed InfiniBand. Client Connectivity. The client nodes in the network host the end user applications that process the data. These nodes include client software for Lustre* file serving, and connect to the network via high speed InfiniBand. Networking. InfiniBand switches and adapter cards complete the solution. This infrastructure is necessary to provide the high bandwidth streaming required in many HPC solutions in general, and seismic analysis and processing within the Energy Industry in particular. Intel, Mellanox, Dot Hill for HPC Overview 4

Figure 1 Lustre* Solution Architecture SOLUTION BEST PRACTICES Each building block of the solution contributes to the performance. A successful configuration requires careful planning. In particular, these steps will step over common pitfalls, and ensure success. Intel, Mellanox, Dot Hill for HPC Overview 5

Storage Configuration Redundant Cabling: The 4004 SAS controllers provide 8 SAS host ports among 2 active active controllers in a single chassis. There are two key steps. First, ensure each OSS/MDS server is connected with a minimum of 2 cables to the storage array. This ensures storage resources are presented identically to each server in the OSS/MDS clustered pair. Second, ensure each 2 cable pair is balanced across the A and B controllers. This ensures that if one controller or path to the controller fails, the OSS/MDS has an alternate path to the LUNs. Figure 2 depicts this layout. Figure 2 Redundant cabling Segregate Metadata and Object storage: The I/O patterns for metadata and object storage are quite different. Consider how Lustre* clients interact with the MDS and OSS. A Lustre* Client must interact with the MDS before any file operations can occur. Client to MDS interactions consist of small transactions to open, close, create, and delete files. The MDS is the single coordination point, and a Lustre* Client may not access a file on an OSS until it has coordinated via the MDS. The MDS will hit the storage controller with lots of random, small block transfers of information. High IOPS is the key requirement for an MDS. After gaining access to a file via the MDS, Lustre* clients interact with OSS(s) to read/write files. Client to OSS interactions consist of large, 1MB transfers of file data. In contrast to the MDS, the OSS will hit the storage controllers with many streams of sequential, large block transfers to read/write the customer s data. Sustained throughput is the key requirement for an OSS. High IOPS and sustained throughput are competing I/O patterns. At scale, a single set of controllers could be a choke point. To ensure robust performance, utilize unique arrays for metadata and object storage. o Metadata Arrays: Arrays for metadata should consist of faster rotating disks (10K/15K RPM) or SSDs. Faster rotating disks or SSDs are appropriate for applications requiring high IOPS. o Object Storage Arrays: Arrays for object storage typically consist of higher capacity, slowerrotating disks usually 7K RPM with capacities of 3, 4, or 6TB. Higher capacity drives provide Intel, Mellanox, Dot Hill for HPC Overview 6

more storage for the large file sizes produced in HPC environments; however, there may be instances where SFF 10K 1TB drives could be more appropriate in an environment requiring a balance of density and speed. Disk groups/luns: Dot Hill terminology refers to physical raid sets as disk groups. One or more LUNs can be carved off of disk groups and presented to hosts. The Disk groups for object storage and metadata have different requirements; thus, they require different configuration. o Metadata Targets (MDT): MDTs must rapidly service random I/O requests for small pieces of information (essentially file records). Additionally, they are the single source of file information; thus, they need a high level of fault tolerance. RAID 10 provides outstanding I/O rates with excellent redundancy. o Object Storage Targets (OST): A single Lustre* file can grow to 32 PB. OST s need a balance between capacity utilization and redundancy. RAID 6 provides excellent redundancy and capacity utilization. An important consideration with RAID 6 is ensuring writes are as efficient as possible. The important data point here is that Lustre* utilizes a 1MB block size. RAID 6 stripes data across the Disk group. Assume N is the total amount of disks in a RAID 6 disk group. Each stripe consists of N 2 data disks and 2 parity disks. Parity is calculated from the data disks that comprise a stripe using XOR. You want a disk group configuration that splits the Lustre* 1MB block transfer evenly across every data disks in the disk group. This is called a full stripe write, and they critical to performance in parity based RAID. To ensure full stripe writes, create disk groups that have data disk totals equal to a power of 2. For instance, a RAID6 disk group of 10 disks (8 data & 2 parity) and a chunk size of 128Kb will perfectly split the 1MB Lustre* block size evenly across the 8 data disks. o o o Management Storage Target (MGT): The MGT has much smaller needs when compared to an OST; it can be as small as 100MB; however, it is an important component to the solution so redundancy is still a requirement. Create the MGT on the same disk group utilized by the MDT. The MGT would gain the performance and redundancy benefits of RAID 10, and the small capacity used by the MGT won t impact the MDT. An alternative configuration could dedicate a single RAID1 disk group to the MGT. LUNs: Create one LUN on top of each Disk group, and utilize the entire capacity of the disk group. The exception noted above is a single MGT LUN and MDT LUN combined on the MDT disk group. Give the LUNs descriptive names to ease management i.e. ( OSS1 _OST4 ). Host access control: Direct attach SAS is typically used between OSS/MDS servers and the storage arrays; this avoids the costs of extraneous switches and the complex management of typical of fibre channel zones. Use default mapping with the Dot Hill array; this gives full access to all LUNs across all ports on both controllers to any initiator that connects to the array. This ensures full multipath discovery from the OSS/MDS hosts, and simplifies overall management. Server Configuration Intel, Mellanox, Dot Hill for HPC Overview 7

Multipath: Multiple, redundant paths between the OSS/MDS hosts and the storage array ensure the Lustre* file system remains online in the event of a node or path failure. However, incorrect multipath configuration can be a subtle form of poor performance. It s critical to get it right. Dot Hill arrays comply with the SCSI 3 standard for Asymmetrical Logical Unit Access (ALUA). ALUA compliant arrays will provide optimal and non optimal path information to the host during device discovery, but the operating system must be directed to use ALUA. Here are the key steps: 1. Ensure the multipath daemon is installed and set to start at run time. a. Linux command: chkconfig multipathd on 2. Ensure the correct entries exist the /etc/multipath.conf file on each OSS/MDS host. Create a separate device entry for the Dot Hill array. There are 4 key attributes that should be set. a. prio=alua b. failback=immediate c. vendor= DotHill d. product= <the correct product ID> i. Run the Linux command multipath v3 to obtain the exact vendor and product IDs. 3. Instruct the multipath daemon to reload the multipath.conf file or reboot the server a. Linux command: service multipathd reload. 4. Determine if the multipath daemon used ALUA to obtain the optimal/non optimal paths a. Linux command: multipath v3 grep alua b. You should see output stating that ALUA was used to configure the path priorities. Example Oct 01 14:28:43 sdb: prio = alua (controller setting) Oct 01 14:28:43 sdb: alua prio = 130 Linux I/O Tuning: The Linux OS can certainly have an impact on the performance of the Lustre* solution. Tunable settings should be altered and investigated to obtain optimal performance o Linux Block I/O scheduler: There are 3 common schedulers. Completely Fair Queuing (CFQ): CFQ is the default setting, and attempts to provide fairness to I/O scheduling via user defined classes and priorities. The CFQ setting can interfere with the storage arrays ability to read incoming I/O patterns and perform its own optimizations. Don t use this setting. Deadline: Deadline scheduling attempts to provide guaranteed latencies. Latency measurements start at the time the I/O arrives at the scheduler (consider how many I/O hops happen before and after the scheduler). Furthermore, deadline scheduling performs I/O interleaving, temporarily blocking some I/O in order to provide combine batches in increasing logical block address (LBA) order. This scheduler can also interfere with the storage arrays ability to read and optimize I/O patterns. Don t use this setting. Intel, Mellanox, Dot Hill for HPC Overview 8

o Noop: The Noop scheduler implements a simple, first in first out (FIFO) queue such that the storage controller can correctly optimize on I/O. Use this setting. File system tuning: There are a number of tunable settings that can be experimented with utilizing the Linux blockdev command. One of them is the block read ahead size. The default read ahead size is 256 sectors. This equates to 256 * 512 byte sector sizes. Experiment with larger settings to determine if this improves performance. Benchmarking Test each unit in isolation: While it s tempting to immediately run applications from the Lustre* clients, a more methodical approach will expose the source of performance bottlenecks before the solution scales. Test first at the lowest layer, verify results are satisfactory, and then add another layer and test. Sequentially add layers and test, and verify results are expected at each layer. Usually, you ll see a slight drop in performance as each layer is added. An exception is when you start testing at the highest layers the Lustre* OSS/MDS and clients. Testing at these layers measures aggregated performance; thus, the overall performance will look better. 1. Individual LUNs: The Sgpdd_survey tool from the Lustre* IO kit can be used to send I/O to the MDT/OST LUNs. Other tools include aio stress, or iometer. CAUTION: These tests are destructive, and should not be run against LUNs containing user data. Raw LUN I/O: Send direct I/O to each OST and MDT. Run I/O to one LUN at a time, and look for obvious differences in throughput and latencies between like LUNs (i.e. compare all OST LUNs). Obvious differences may signal a single disk error within a particular disk group. Tip: Use the Linux raw command bind a block device to a raw character device. This ensures no kernel/os caches or buffers are used Block I/O: Next, run the same tests against the same LUNs using the block device. You will see a slight performance drop because the OS is buffering. Remember to choose a block device in which the path flows through the owning controller. Again, compare I/O between LUNs, and look for obvious differences. Multipath I/O: Next, run I/O through the multipath device. Once more, compare I/O between LUNs, and look for obvious differences. 2. OSS/MDS layer: obdfilter survey: This tool, from the Lustre* I/O kit can be executed from each OSS server in local file system mode. This mode tests the local file system on each OSS that sits on top of the OSTs. It tests above the hardware, but below the RPC layer. CAUTION: Premature termination of an execution of obdfilter survey could leave your filesystem in a non pristine state. mds survey: This tool, also from the Lustre* I/O kit tests the MDS metadata layer. The tool allows the tester to specify a number of threads, and measures the latencies for file create, stat, and delete. CAUTION: Premature termination of an execution of mds survey could leave your filesystem in a non pristine state. 3. Lustre* clients: Intel, Mellanox, Dot Hill for HPC Overview 9

IOZone: IOZone can test many different I/O patterns against the Lustre* File system for a single client. Also, it can operate in distributed mode to test many clients simultaneously. 4. LNET Self test: The LNet self test provides two utilities that can be used to performance test the Lustre* network. These tests should be run after all lower layers have been thoroughly tested, and validated. Utilize trusted solutions Intel Enterprise Edition for Lustre* (IEEL): This package provides 2 main benefits o Ease of installation: Lustre* is complex. Significant expertise with Linux, networking, and storage are required to successfully download and deploy the open source kit. The Intel Enterprise Edition software simplifies this complexity. The Inte l Manger for Lustre* (IML) software guides you through installation and configuration two of the most difficult aspects of Lustre*. Standing up a Lustre* configuration is a much more attainable goal with Intel s software o Excellent management: The IML provides useful graphs representing OST heat maps, I/O read/write patterns across OSTs, metadata patterns, and memory/cpu patterns of the OSS. A REST API allows remote scripting/monitoring using wget, curl, or python Mellanox switch: Mellanox switches and HBA s provide an end to end, high speed, reliable network for HPC networks. o Ease of installation: The network was up and running in less than 30 minutes. We used a serial connection via one of Lustre* clients to configure the intial IP address and turn on management. After that, we pointed a browser to it, turned on the subnet manager, and our network was up, and running. Mellanox provides a rich CLI for command line configuration and monitoring. SOLUTION BENEFITS Availability. Redundant Fail over components provide high availability of storage subsystem. Protection. Enterprise class components and drives, along with RAID technology protect valuable data sets Scalability. Add Storage Modules as needed to expand to petabytes of usable capacity. Performance. Individual Storage Modules offer 6 GB/s throughput, scaled linearly with added modules. InfiniBand network paths can deliver up to 56Gbps throughput to each compute node. Neutrality. Avoid vendor lock in by deploying Open Source solutions. Manageability. The Intel Enterprise Edition for Lustre* software includes a rich set of management tools. Namespace. The unified namespace offered by Lustre* eliminates the need to micromanage storage pools. Intel, Mellanox, Dot Hill for HPC Overview 10

Capacity. Each individual Storage Module can be configured with as much as 1344 TB of raw capacity. Scale out the solution with multiple Storage Modules to obtain petabytes of usable capacity. SOLUTION COMPONENTS Dot Hill AssuredSAN 4854 SAS RAID Storage Arrays for primary data Dot Hill AssuredSAN 4524 SAS RAID Storage Array for Metadata LSI SAS9300 8e SAS HBAs Industry Standard x64 Servers for File Serving and Metadata Enterprise Linux Intel Enterprise Edition for Lustre* software Mellanox ConnectX 3 InfiniBand adapter cards Mellanox InfiniBand Switches About Dot Hill Systems Dot Hill has been delivering smart, simple, storage solutions for 29 years and has shipped over 600,000 units world wide. Dot Hill's solutions combine a flexible and extensive hardware platform with an easy to use management interface to deliver highly available and scalable SAN solutions. AssuredSAN arrays provide a high performance storage solution ideal for large datasets and multiple compute nodes common to the oil and gas exploration industry. The AssuredSAN 4004 features 12Gb SAS host connections, 99.999% availability, and scale up to 384 terabytes in a single array. Visit Dot Hill at www.dothill.com and our partner portal at partners.dothill.com. About Intel Intel (NASDAQ: INTC) is a world leader in computing innovation. The company designs and builds the essential technologies that serve as the foundation for the world s computing devices. Visit Intel at lustre.intel.com. About Mellanox Technologies Mellanox Technologies is a leading supplier of end to end InfiniBand and Ethernet interconnect solutions and services for servers and storage. Mellanox interconnect solutions increase data center efficiency by providing the highest throughput and lowest latency, delivering data faster to applications and unlocking system performance capability. Mellanox offers a choice of fast interconnect products: adapters, switches, software, cables and silicon that accelerate application runtime and maximize business results for a wide range of markets including high performance computing, enterprise data centers, Web 2.0, cloud, storage and financial services. Visit Mellanox at www.mellanox.com. 2014 Copyright Dot Hill Systems Corporation. All rights reserved. Dot Hill Systems Corp., Dot Hill, the Dot Hill logo, and AssuredSAN are trademarks or registered trademarks of Dot Hill Systems. All other trademarks are the property of their respective companies in the United States and/or other countries. 10.14.14 Intel, Mellanox, Dot Hill for HPC Overview 11