Cray DVS: Data Virtualization Service



Similar documents
Performance, Reliability, and Operational Issues for High Performance NAS Storage on Cray Platforms. Cray User Group Meeting June 2007

New Storage System Solutions

Cluster Implementation and Management; Scheduling

Distributed File System Choices: Red Hat Storage, GFS2 & pnfs

Current Status of FEFS for the K computer

Best Practices for Data Sharing in a Grid Distributed SAS Environment. Updated July 2010


Quantum StorNext. Product Brief: Distributed LAN Client

How to Choose your Red Hat Enterprise Linux Filesystem

Lessons learned from parallel file system operation

Scala Storage Scale-Out Clustered Storage White Paper

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

GPFS Storage Server. Concepts and Setup in Lemanicus BG/Q system" Christian Clémençon (EPFL-DIT)" " 4 April 2013"

Network Attached Storage. Jinfeng Yang Oct/19/2015

Architecting a High Performance Storage System

POWER ALL GLOBAL FILE SYSTEM (PGFS)

Implementing Network Attached Storage. Ken Fallon Bill Bullers Impactdata

A Survey of Shared File Systems

A Comparison on Current Distributed File Systems for Beowulf Clusters

NetApp High-Performance Computing Solution for Lustre: Solution Guide

Introduction to Gluster. Versions 3.0.x

Storage Architectures for Big Data in the Cloud

Sun Storage Perspective & Lustre Architecture. Dr. Peter Braam VP Sun Microsystems

High Availability with Windows Server 2012 Release Candidate

HPC Software Requirements to Support an HPC Cluster Supercomputer

Initial Performance Evaluation of the Cray SeaStar

PADS GPFS Filesystem: Crash Root Cause Analysis. Computation Institute

ALPS Supercomputing System A Scalable Supercomputer with Flexible Services

Introduction to Cray Data Virtualization Service S

Moving Virtual Storage to the Cloud

760 Veterans Circle, Warminster, PA Technical Proposal. Submitted by: ACT/Technico 760 Veterans Circle Warminster, PA

short introduction to linux high availability description of problem and solution possibilities linux tools

Moving Virtual Storage to the Cloud. Guidelines for Hosters Who Want to Enhance Their Cloud Offerings with Cloud Storage

Achieving Performance Isolation with Lightweight Co-Kernels

Distributed RAID Architectures for Cluster I/O Computing. Kai Hwang

Post-production Video Editing Solution Guide with Quantum StorNext File System AssuredSAN 4000

THE EXPAND PARALLEL FILE SYSTEM A FILE SYSTEM FOR CLUSTER AND GRID COMPUTING. José Daniel García Sánchez ARCOS Group University Carlos III of Madrid

Configuration Maximums

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

Lustre Networking BY PETER J. BRAAM

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Ultimate Guide to Oracle Storage

LUSTRE FILE SYSTEM High-Performance Storage Architecture and Scalable Cluster File System White Paper December Abstract

Kriterien für ein PetaFlop System

High Performance Computing OpenStack Options. September 22, 2015

Violin: A Framework for Extensible Block-level Storage

Oracle Database Scalability in VMware ESX VMware ESX 3.5

CS 377: Operating Systems. Outline. A review of what you ve learned, and how it applies to a real operating system. Lecture 25 - Linux Case Study

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

Using Multipathing Technology to Achieve a High Availability Solution

Lustre: A Scalable, High-Performance File System Cluster File Systems, Inc.

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

Datacenter Operating Systems

Commoditisation of the High-End Research Storage Market with the Dell MD3460 & Intel Enterprise Edition Lustre

Big data management with IBM General Parallel File System

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Windows TCP Chimney: Network Protocol Offload for Optimal Application Scalability and Manageability

Accelerating and Simplifying Apache

Petascale Software Challenges. Piyush Chaudhary High Performance Computing

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

Features of AnyShare

Data Center Performance Insurance

Scalable NAS for Oracle: Gateway to the (NFS) future

SMB Direct for SQL Server and Private Cloud

HP reference configuration for entry-level SAS Grid Manager solutions

PRIMERGY server-based High Performance Computing solutions

Post Production Video Editing Solution Guide with Apple Xsan File System AssuredSAN 4000

InfoScale Storage & Media Server Workloads

UCS M-Series Modular Servers

I/O Virtualization Using Mellanox InfiniBand And Channel I/O Virtualization (CIOV) Technology

I3: Maximizing Packet Capture Performance. Andrew Brown

Performance and scalability of a large OLTP workload

How To Virtualize A Storage Area Network (San) With Virtualization

QNAP in vsphere Environment

LSI SAS inside 60% of servers. 21 million LSI SAS & MegaRAID solutions shipped over last 3 years. 9 out of 10 top server vendors use MegaRAID

Zadara Storage Cloud A

Performance Beyond PCI Express: Moving Storage to The Memory Bus A Technical Whitepaper

Enabling Technologies for Distributed Computing

Data Movement and Storage. Drew Dolgert and previous contributors

Storage Networking Foundations Certification Workshop

Where IT perceptions are reality. Test Report. OCe14000 Performance. Featuring Emulex OCe14102 Network Adapters Emulex XE100 Offload Engine

Hyper-V over SMB Remote File Storage support in Windows Server 8 Hyper-V. Jose Barreto Principal Program Manager Microsoft Corporation

Tyche: An efficient Ethernet-based protocol for converged networked storage

A Filesystem Layer Data Replication Method for Cloud Computing

Cray XT3 Supercomputer Scalable by Design CRAY XT3 DATASHEET

The PHI solution. Fujitsu Industry Ready Intel XEON-PHI based solution. SC Denver

CERN Cloud Storage Evaluation Geoffray Adde, Dirk Duellmann, Maitane Zotes CERN IT

A Survey of Shared File Systems

PCI Express and Storage. Ron Emerick, Sun Microsystems

Hadoop Architecture. Part 1

Chapter 11 Distributed File Systems. Distributed File Systems

VIDEO SURVEILLANCE WITH SURVEILLUS VMS AND EMC ISILON STORAGE ARRAYS

Software Defined Storage: Changing the Rules for Storage Architects Ric Wheeler Red Hat

PCI Express Impact on Storage Architectures and Future Data Centers. Ron Emerick, Oracle Corporation

Filesystems Performance in GNU/Linux Multi-Disk Data Storage

GlusterFS Distributed Replicated Parallel File System

Cray Gemini Interconnect. Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak

Sonexion GridRAID Characteristics

SAS System for Windows: Integrating with a Network Appliance Filer Mark Hayakawa, Technical Marketing Engineer, Network Appliance, Inc.

Physical Security EMC Storage with ISS SecurOS

Transcription:

Cray : Data Virtualization Service Stephen Sugiyama and David Wallace, Cray Inc. ABSTRACT: Cray, the Cray Data Virtualization Service, is a new capability being added to the XT software environment with the Unicos/lc 2.1 release. is a configurable service that provides compute-node access to a variety of file systems across the Cray high-speed network. The flexibility of makes it a useful solution for many common situations at XT sites, such as providing I/O to compute nodes from NFS file systems. A limited set of use cases will be supported in the initial release but additional features will be added in the future. KEYWORDS:, file systems, network 1. Introduction The Cray Data Virtualization Service (Cray ) is a network service that provides compute nodes transparent access to file systems mounted on service nodes. projects file systems mounted on service nodes to the compute nodes: the remote file system appears to be local to applications running on the compute nodes. This solves several common problems: Access to multiple file systems in the data center Resource overhead of file system clients on the compute nodes File system access scaling to many thousands of nodes is the standard file system on the Cray XT platform, and is optimized for high-performance parallel I/O [Sun 2008]. Cray s implementation is intended to be used for temporary scratch storage. Most sites have one or more external file systems that contain more permanent data. Rather than copying data between file systems, can be used to provide compute-node access to that data. was developed with high-performance I/O in mind. The client is very lightweight in its use of system resources, minimizing the impact to the compute node memory footprint and contribution to OS jitter. The protocol used in the communication layer takes advantage of the high-performance network. Most file systems are not capable of supporting hundreds or thousands or tens of thousands file system clients. With, there may be thousands of clients that connect to a small number of servers; the underlying file system sees only the aggregated requests from the servers. 2. Architecture The Cray XT series of supercomputers contain compute nodes and service nodes. Compute Partition Service Partition Figure 1. Node Specialization Compute Login Network System I/O 1

Compute nodes do not have direct support for peripheral devices, such as disk storage, and run a limited number of operating system services. Service nodes have adapter slots for connections to external devices and run a full Linux distribution. All nodes in the system are connected through a high-speed network. This division of nodes and OS functionality helps minimize OS jitter [Wallace 2007] and maximize application performance. is composed of two main components, the client, which runs on compute or service nodes, and the server, which runs on I/O service nodes. The file systems configured on the server are projected to the nodes with the client. The client is lightweight. is not a file system. packages up client I/O requests and forwards them to the server. When is configured without data caching then file data does not compete with the application for compute node memory. The server component provides a distribution layer on top of the local file system or file systems. The local file system can be located on direct-attached disk, such as the Linux ext3 or XFS file systems; the local file system can be remotely located when mounted on the service node via a service like NFS or GPFS. can be configured in several ways to suit various needs. with Simple File System (ext3, NFS) with Cluster File System (, GPFS) Serial Serial Serial Parallel Parallel Clustered Parallel Figure 2. Configurations Parallel means that multiple servers project a portion (stripe) of a file system to all clients using that file system These configurations are described in more detail below. 2.1 Serial Figure 3. Serial A simple Serial configuration is illustrated in Figure 3. There are multiple clients, each communicating to a server that projects a file system mounted on the server. The file system could be a simple (non-clustered) file system such as XFS; the file system could be a remote file system mounted via NFS. Multiple file system can be projected to the clients through one or more servers. For example, two separate NFS file systems could be made available to the clients either by mounting both on a single server or by using two servers, each mounting one of the NFS file systems. Of course, the approach using multiple servers will have better performance. The table in Figure 2 describes the main possible configurations: Serial means that a single server projects a file system to all clients using that file system. Clustered Parallel means that multiple servers project a common cluster file system to all clients using that file system. 2

2.2 Clustered Parallel 2.3 Parallel OSS OSS MDS Figure 5. Parallel Figure 4. Clustered Parallel A Clustered Parallel configuration is illustrated in Figure 4. In this arrangement multiple servers project a single clustered file system to the clients. The clients can spread their I/O traffic between the servers using a deterministic mapping. The servers talk to the cluster file system. It is also possible to configure the clients to use a subset of the available servers. This may be useful to balance the client load on the servers when there are a large number of clients. One benefit of Clustered Parallel is that it reduces the number of clients that communicate with the backing file system. This is important for file systems such as GPFS, which only supports a limited number of clients. A Parallel configuration is illustrated in Figure 5. In this configuration, multiple servers are each backed by a simple file system. There are two ways to operate a Parallel configuration: whole file and striped file. In whole file mode, files are distributed among the servers. Any given file is located on one of the servers. Whole file mode spreads the I/O load among the servers. In striped file mode, divides the I/O into stripes that are transferred to the servers and written to the backing file system; when reading data the client reassembles the stripes transferred from each server. Striped file mode spreads the I/O across multiple storage devices for parallel data movement. Parallel with striped files can achieve very high performance when configured with many servers. Another benefit of using on top of a cluster file system is that coherency and lock traffic is eliminated because any given file is always accessed from the same file system client ( server). 3

3. Implementation Application has two main components: client and server. The client fits under the Linux Virtual File System (VFS) on the application node; the server fits over the VFS on the service node; the client and server communicate over the Cray SeaStar high-speed network. Compute Node Service Node I/O Stream S0-0 S1-0 S0-1 S1-1 S0-2... I/O Stream S0-0 S0-1 S0-2... I/O Stream S1-0 S1-1... User Space User Application Linux VFS Local File System Local File System Kernel Space Portals Portals Linux VFS NFS SeaStar Driver SS Driver NIC Driver Hardware SeaStar SeaStar NIC Figure 7. Parallel I/O SeaStar Network Figure 6. Software Stack External Network The main software components are shown in Figure 6. Applications perform I/O through the Linux VFS. The VFS allows Linux to support many file systems through a common interface. Applications use standard Linux I/O calls, such as open(), read(), and write(); the VFS vectors those I/O requests to the appropriate file system code. Under the VFS is the client. The client is a Linux kernel module that appears as a file system to the Linux VFS. The client packages up the I/O requests into messages that are sent across the network to the server. The communication subsystem is layered on the transport mechanism of the host system. On Cray XT systems a RDMA interface to the Portals protocol is used to interface with the SeaStar network [Brightwell 2005]. Large requests, such as those greater than 1 MB, are efficiently handled. The server also operates in the Linux kernel space. It communicates with the client and performs I/O through the Linux VFS. Any file system mounted via the Linux VFS can be used by. Figure 6 shows NFS but it could have been ext3, XFS, or any other Linux file system. Metadata, such as file names, access times, etc., are managed by the underlying file system. In a Parallel configuration with striped files manages the striping of data across multiple servers. This operation is illustrated in Figure 7. The application reads and writes a stream of data. The client divides the stream into stripes and sends the stripes to the appropriate server to be stored in the local file system. On reading data the client requests the stripes from the servers and reassembles the stream. A few items to note: Striping is a fixed mapping computed by without any metadata The server with the starting stripe is selected by a hash to avoid hot-spots server nodes do read-ahead and write aggregation in parallel Files on the local file system only contain partial data: a client is needed to reassemble stripes into whole files Although has the ability to stripe files across multiple servers to perform parallel I/O, it is not a cluster file system. It does not maintain any metadata. It does not perform any locking. It does not do any client cache coherency. All disk allocation and disk I/O is performed by the underlying file system. This simplicity allows to scale and to achieve high performance. 4

4. Current Results has been running in field trials at several sites over the past six months. It has demonstrated scaling to over 7000 client nodes projecting a NFS file system from a single server. Obviously, the bandwidth supporting these clients is restricted by the single communication channel but this illustrates the capability of to operate at a scale that is out of reach for NFS. The chart in Figure 9 shows the performance of a single client performing I/O that is forwarded to the server and to the mounted ext2 file system. I/O performance runs from approximately 60 MB/s for small 4 KB blocks up to approximately 130 MB/s for block sizes of 128 KB and larger. The following charts are recent performance tests of. Tests were performed using the bringsel benchmark [Kaitschuck 2007]. The test system was a Cray XT4 running Unicos/lc 2.1.07 pre-release software. The storage hardware is a DDN S2A85000 RAID connected to the system via 2 Gb/s Fibre Channel links. The servers on the XT4 service nodes were mounted with ext2 file systems. These directly-mounted file systems were used to eliminate the overhead of NFS from the measurements. Figure 10. Multi- Performance Figure 8. Performance The chart in Figure 8 shows the performance of the server performing I/O to the mounted ext2 file system. This measures I/O without the client or network. The server achieves approximately 160 MB/s for data sizes ranging from 4 KB to 1 MB. This transfer rate is close to the limit of the 2 Gb/s Fibre Channel connection. The chart in Figure 10 shows the performance of multiple clients forwarding to a single server that has an ext2 file system mounted. Each point on the chart shows the transfer rate of that particular client, so the aggregate transfer rate is the sum of each set. For example, the 4 s line shows that each of the four clients achieves approximately 40 MB/s. Note that the aggregate transfer rate is approximately 160 MB/s for all client counts from 4 to 20. MB/s 1200 1000 800 600 400 200 0 2 KB 4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1 MB Block Size 1 stripe write 1 stripe read 2 stripe write 2 stripe read Figure 11. Striped Parallel Performance Figure 9. - Performance The chart in Figure 11 shows the performance of a single client to one and two servers set up in a striped Parallel configuration. The application was 5

ioperf, a simple program that reads and writes blocks of data. The data block size was varied from 2 KB to 1 MB for both reads and writes. At the larger block sizes the configuration with data striping across two servers shows a significant performance increase over the single server configuration for both reads and writes. Some of the very high rates achieved are due to the read-ahead and caching done on the server. 5. Future Directions is very flexible and has many possible uses. The upcoming Unicos/lc 2.1 software release with the Cray Linux Environment (CLE) will be the first release to support. In this first release only the Serial to NFS file systems configuration will be supported. But has many capabilities that can be further developed in the future. Expanded file system support Anyone who has worked with a variety of file systems will realize that different file systems have widely different characteristics. It isn t possible to treat NFS just like GPFS, for example. needs to be tuned to work well with each file system. Future releases may provide support for more file systems, such as GPFS,, QFS, PanFS, etc. Parallel and Clustered Parallel These configurations are available but not yet supported. failover Adding support for server failover will increase file system availability. External servers Instead of porting a file system to the XT system the server could be ported to an external system. This would require an intermediate network module to run on a service node and forward between the internal and external fabrics. network. Early performance measurements show good results. Although the first release of supports a limited set of features is flexible and can be configured or enhanced to address many file system issues. Acknowledgments The authors would like to thank Kitrick Sheets and Tim Cullen for their work on and John Kaitschuck for his efforts in benchmarking. References [Brightwell 2005] Implementation and Performance of Portals 3.3 on the Cray XT3, R. Brightwell, T. Hudson, K. Pedretti, R. Riesen, K.D. Underwood, Cluster Computing 2005. [Kaitschuck 2007] Bringsel: A Tool for Measuring Storage System Reliability, Uniformity, Performance and Scalability, John Kaitschuck and Mathew O'Keefe, CUG 2007. [Sun 2008] Operations Manual, Sun Microsystems, 2008. [Wallace 2007] Compute Node Linux: Overview, Progress to Date & Roadmap, David Wallace, CUG 2007. About the Authors Stephen Sugiyama is an Engineering Manager with Cray Inc. and can be reached at sss@cray.com. David Wallace is a Technical Program Lead with Cray Inc. and can be reached at dbw@cray.com. Stacked servers can support a large number of clients with a small number of servers. Looking farther out into the future, stacking on top of is a potential way to increase scalability for exascale systems. 6. Summary is a high-performance I/O forwarding service that provides transparent compute-node access to servicenode mounted file systems. can solve the problem of accessing a diverse set of file systems on XT compute nodes. The implementation is lightweight and scales to many thousands of nodes. On Cray XT systems takes advantage of the high-performance SeaStar 6