Improving Time to Results for Seismic Processing with Paradigm and DDN. ddn.com. DDN Whitepaper. James Coomer and Laurent Thiers



Similar documents
HadoopTM Analytics DDN

With DDN Big Data Storage

ANY SURVEILLANCE, ANYWHERE, ANYTIME

Modernizing Hadoop Architecture for Superior Scalability, Efficiency & Productive Throughput. ddn.com

Introduction. Need for ever-increasing storage scalability. Arista and Panasas provide a unique Cloud Storage solution

Integrated Grid Solutions. and Greenplum

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA

MAKING THE BUSINESS CASE

WOS Cloud. ddn.com. Personal Storage for the Enterprise. DDN Solution Brief

Flash Memory Arrays Enabling the Virtualized Data Center. July 2010

ECMWF HPC Workshop: Accelerating Data Management

MaxDeploy Ready. Hyper- Converged Virtualization Solution. With SanDisk Fusion iomemory products

Scala Storage Scale-Out Clustered Storage White Paper

Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database

Accelerate > Converged Storage Infrastructure. DDN Case Study. ddn.com DataDirect Networks. All Rights Reserved

Panasas High Performance Storage Powers the First Petaflop Supercomputer at Los Alamos National Laboratory

Object storage in Cloud Computing and Embedded Processing

Performance in a Gluster System. Versions 3.1.x

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer

Benchmarking Cassandra on Violin

ANY THREAT, ANYWHERE, ANYTIME Scalable.Infrastructure.to.Enable.the.Warfi.ghter

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Isilon IQ Scale-out NAS for High-Performance Applications

BlueArc unified network storage systems 7th TF-Storage Meeting. Scale Bigger, Store Smarter, Accelerate Everything

EMC XTREMIO EXECUTIVE OVERVIEW

10th TF-Storage Meeting

HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

EXPLORATION TECHNOLOGY REQUIRES A RADICAL CHANGE IN DATA ANALYSIS

STORAGE HIGH SPEED INTERCONNECTS HIGH PERFORMANCE COMPUTING VISUALISATION GPU COMPUTING


Zadara Storage Cloud A

IOmark- VDI. Nimbus Data Gemini Test Report: VDI a Test Report Date: 6, September

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

Software-defined Storage Architecture for Analytics Computing

How To Store Data On An Ocora Nosql Database On A Flash Memory Device On A Microsoft Flash Memory 2 (Iomemory)

Application Performance for High Performance Computing Environments

Direct Scale-out Flash Storage: Data Path Evolution for the Flash Storage Era

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

Hadoop on the Gordon Data Intensive Cluster

Software-defined Storage at the Speed of Flash

Microsoft SQL Server 2014 Fast Track

Commoditisation of the High-End Research Storage Market with the Dell MD3460 & Intel Enterprise Edition Lustre

Geospatial Imaging Cloud Storage Capturing the World at Scale with WOS TM. ddn.com. DDN Whitepaper DataDirect Networks. All Rights Reserved.

ALPS Supercomputing System A Scalable Supercomputer with Flexible Services

Maginatics Cloud Storage Platform for Elastic NAS Workloads

EMC SOLUTION FOR SPLUNK

Essentials Guide CONSIDERATIONS FOR SELECTING ALL-FLASH STORAGE ARRAYS

Milestone Solution Partner IT Infrastructure Components Certification Summary

IEEE Mass Storage Conference Vendor Reception Lake Tahoe, NV

Understanding Enterprise NAS

HPC Advisory Council

Deploying Affordable, High Performance Hybrid Flash Storage for Clustered SQL Server

IBM Spectrum Scale vs EMC Isilon for IBM Spectrum Protect Workloads

Panasas: High Performance Storage for the Engineering Workflow

Platfora Big Data Analytics

Optimizing Large Arrays with StoneFly Storage Concentrators

Pivot3 Desktop Virtualization Appliances. vstac VDI Technology Overview

Deploying Flash- Accelerated Hadoop with InfiniFlash from SanDisk

ioscale: The Holy Grail for Hyperscale

POWER ALL GLOBAL FILE SYSTEM (PGFS)

How To Get The Most Out Of A Large Data Set

Collaborative Research Infrastructure Deployments. ddn.com. Accelerate > DDN Case Study

(Scale Out NAS System)

Netapp HPC Solution for Lustre. Rich Fenton UK Solutions Architect

NetApp High-Performance Computing Solution for Lustre: Solution Guide

High Performance MySQL Cluster Cloud Reference Architecture using 16 Gbps Fibre Channel and Solid State Storage Technology

LSI MegaRAID CacheCade Performance Evaluation in a Web Server Environment

Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle

High Performance Server SAN using Micron M500DC SSDs and Sanbolic Software

How To Speed Up A Flash Flash Storage System With The Hyperq Memory Router

The Ultimate in Scale-Out Storage for HPC and Big Data

Everything you need to know about flash storage performance

THE EMC ISILON STORY. Big Data In The Enterprise. Copyright 2012 EMC Corporation. All rights reserved.

WOS. High Performance Object Storage

Flash Memory Technology in Enterprise Storage

SMB Direct for SQL Server and Private Cloud

An Oracle White Paper May Exadata Smart Flash Cache and the Oracle Exadata Database Machine

Data management challenges in todays Healthcare and Life Sciences ecosystems

A Survey of Shared File Systems

Oracle Exadata Database Machine for SAP Systems - Innovation Provided by SAP and Oracle for Joint Customers

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales

Solving Agencies Big Data Challenges: PED for On-the-Fly Decisions

Data Center Solutions

Data Center Storage Solutions

Easier - Faster - Better

Mellanox Accelerated Storage Solutions

NEXSAN NST STORAGE FOR THE VIRTUAL DESKTOP

Block based, file-based, combination. Component based, solution based

Data Center Performance Insurance

How to Choose your Red Hat Enterprise Linux Filesystem

Introduction to Gluster. Versions 3.0.x

Big data management with IBM General Parallel File System

Colgate-Palmolive selects SAP HANA to improve the speed of business analytics with IBM and SAP

VMware Virtual SAN Backup Using VMware vsphere Data Protection Advanced SEPTEMBER 2014

Maximum performance, minimal risk for data warehousing

GPFS Storage Server. Concepts and Setup in Lemanicus BG/Q system" Christian Clémençon (EPFL-DIT)" " 4 April 2013"

Transcription:

DDN Whitepaper Improving Time to Results for Seismic Processing with Paradigm and DDN James Coomer and Laurent Thiers 2014 DataDirect Networks. All Rights Reserved.

Executive Summary Companies in the oil and gas industry continue to push the limits of technology to gain a competitive advantage in their quest for the richest and most profitable reservoirs. For fastest time to results, Paradigm Echos and DataDirect Networks provide cost-effective processing and storage solutions that deliver the highest speed, precision, capacity, and scalability. This whitepaper summarizes Paradigm Echos workload testing performed on DDN storage, taking advantage of SSDs and high performance networks to improve the time to value for a seismic processing problem at lower cost. Targeted workload testing enables analysis of the bottlenecks in the interactions between the software, the network fabric and the storage systems; and removing these bottlenecks leads to improved efficiency. For example, if the number of jobs per compute node can be increased, we can either improve overall performance for a given number of systems or produce the same overall performance with fewer systems. The Need for I/O Improvements in Seismic Processing Environments Companies and partners engaged in oil and gas exploration and production recognize the value of speed and precision, investing tens of billions of dollars in new projects. With so much money at stake, there's little room for error when collecting the most precise and accurate information on where to drill, how much to bid on a site, and how to maximize reservoir performance. The third critical piece of the process requires companies to apply the most sophisticated modeling and simulation technology for the transformation of field data into valuable geophysics. While technology advances within the industry are adding much improved precision, they are also significantly increasing data and processing volumes. Wide/Multi/Rich-azimuth methods are using multi-sensor arrays and sophisticated methods to produce higher-fidelity images. The deployment of these richer formats results in a vast increase for the amount of data the company must process and manage. Furthermore, new analytics techniques allow for continued advancement in the interpretation of seismic data for new data coming in and for historical oil field data as well. This means more data needs to be analyzed at greater speeds. In the field of seismic data acquisition and processing, the conversion of raw field recordings to geological images is a highly intensive process in both data and compute. The requirement for processing more data with less resources (power and capital cost) tends to push the compute threads ever more densely onto the available compute clients, and this in-turn impacts on the I/O subsystems both network and storage. As core counts and memory bandwidths steadily rise, the challenge of shifting data to feed the compute nodes becomes increasingly important. Paradigm Echos: The Standard for Seismic Processing and Data Analysis For over 25 years, Paradigm Echos has been the oil and gas industry's leading system for seismic processing. Echos is a software suite that links seismic data and processing parameters and sequences, with both interactive and 2014 DataDirect Networks. All Rights Reserved. 2

batch components. The suite of seismic processing modules are provided to transform converted wave recordings to interpretable images of the subsurface, which are generally run in an embarrassingly parallel-like fashion with many simultaneous instances running on a large seismic cluster. Widely used, companies prefer Echos because of its breadth of geophysical applications, unique combination of production and interactive seismic data processing, maturity and stability, and versatile programming environment for client-guided customization. DDN: Driving Big Data Storage Solutions for Seismic Processing High-performance storage solutions from DDN meet the most pressing storage challenges of the seismic processing industry, helping oil and gas companies achieve both their scientific goals and their long-term business goals. To support the most demanding environments, DDN solutions deliver game-changing performance, capacity and scalability for ingest, processing and visualization. For example, DDN's Storage Fusion Architecture (SFA) combines SATA, SAS and solid-state disks into a simply managed, massively scalable, multi-petabyte platform that can be tailored to a balance of throughput and capacity. The combination of modularity and performance feature allows users in seismic processing environments to configure systems that scale up and/or scale out in independent dimensions. This is why over half the top producers worldwide are leveraging DDN storage solutions in their big data oil and gas workflows. PARADIGM and DDN Seismic data sets can be very large, with the raw recording data ranging into the 10s or 100s of Terabytes. The ability to efficiently view a particular seismic data set in different orders or domains, without creating multiple copies of that data set is a key design feature of the seismic file handling capabilities of the Echos system. Initial loading of raw seismic recording data is usually done in the order of which the data is recorded, based on sequential shot numbering and a secondary order associated with the receivers associated with each shot. However, this primary and secondary key ordering, shot-receiver ordered data may not be the most effective order for certain subsurface analysis techniques. The Common Depth Point (CDP) Method is a particular seismic data acquisition and processing technique that transforms field recordings from seismic surveys into pseudo-cross-sectional images of the earth's geologic layering beneath the surface locations. Such images are key to geophysical analysis used to pinpoint likely drilling locations for oil and gas reservoirs. Three micro-workloads are used in this study, which are components of the Paradigm 2011 - Echos. All three involve heavy I/O, but in differing ways. Reading and writing a shot-receiver ordered data set in sequential order (SEQIN) efficiently is a critical I/O workload for conditioning of seismic data for further analysis. Another important I/O workload is reading a very large shotreceiver ordered data set in CDP order (STRIDE) which can result in file system access pattern that are challenging for 2014 DataDirect Networks. All Rights Reserved. 3

general-purpose storage systems. An un-tuned system can limit the read performance to just a few MB/sec. The third micro-workload (SEQOUT) performs sequential writes rather than reads. The seismic file handling codes underlying the Echos software are multi threaded, with different tasks for reading, processing, timing etc. In the Echos processing system, a logical seismic file consists of multiple actual files on the file systems that enable specialized handling of different logical elements that make up the seismic file. The SFA Advantage Currently supporting 70% of the Top10 supercomputers in the world, DDN s SFA technology combines 15 years of hardware optimization in High Performance Computing with sophisticated I/O software (SFAOS). The result is a highly efficient system that delivers the highest throughput, density and IOPs of any system on the market. The SFA product line is divided into the high-end SFA12KX and the mid-range SFA7700. Both run SFAOS, which leverages Storage Fusion Xcelerator (SFX) for hybrid deployments of traditional SAS drives with SSDs. SFX integrates application-centric intelligence with flash media to accelerate awkward read intensive workloads. The GS12K Appliance is the highest performing file system appliance in the world, with a single appliance delivering over 20GB/s to the clients. The system uses the SFAOS embedded technology to run complete parallel filesystem services within the controller. This approach removes the need for external file servers typically required for parallel file systems, allowing for far better single client performance than can be reached with scale-out NAS offerings. GS12K supports very large file systems with extreme density. A single appliance can start with just 60 drives, but be scaled up to 10PB in just two racks. If a larger file system is required, further appliances can be added into the single namespace. The GS12K also supports advanced Enterprise features such as windows support, snapshots, replication, HSM, and NFS/CIFS exports. 2014 DataDirect Networks. All Rights Reserved. 4

GS12K Features Client Access and connectivity Native Linux and Windows Clients Highly Available NFS v3 Native Linux and Windows Clients Client Connectivity 16 x 10/40GbE Host Port or 16 x 56Gb In-finiBand Host Port Support for InfiniBand RDMA I/O Client to Gateway Load Balancing Many to One, and One to Many Client to Gateway Access Volume Scalability Up to 3.45PBs of usable capacity (840 drives) Optional Metadata accelerator drives Up to 3.45PBs of usable capacity (840 drives) Over 10PB+ of Aggregated Capacity Performance 20GB/s per GS12K System (read & write) Aggregated Performance up to 200GB/sec 100,000s of File Operations/sec Data Protection Data Protection Data: RAID 6 (8+2) Metadata: RAID 60 256 Snapshots per file system DirectProtect Data Integrity Synchronous File System Replication Optimization of SFA with Paradigm DDN s history in deploying systems for seismic processing has included significant benchmarking and tuning with the largest oil and gas companies in the world. This report discusses the results of some particular major work with one such company where a large GS12K environment is deployed with 40Gigabit Ethernet. We present three sets of results, each of which are of significance to O&G seismic processing systems: 1. GS12KE baseline. Maximum data throughput of the GS12K in 40GE environment as measured by synthetic benchmarks 2. The potential for 40GE vs 10GE in heavy I/O seismic environments 3. Efficiency of Paradigm Echos I/O in large scale systems Data and Systems Centralized con figuration and Management The benchmark system comprised a single monitoring solution Online File System Growth GS12K system connected via 8 40GE ports to a Object-Based File Locking Monitoring and Event Notification System 40GE backbone. Each GS12K uses 5 SS8460 Disk Enclosures with between 300 and 400 NL-SAS drives for data. Data protection was provided via RAID 6. GRIDScaler metadata was held on a small dedicated set of drives separate from the data drives. The clients used varied in the test, but all were a mid-range clock speed Intel system connected, with either 10Gigabit Ethernet or 40Gigabit Ethernet. SAS 6Gb x4 2014 DataDirect Networks. All Rights Reserved. 5

GS12KE Baseline The IOR benchmark (http://sourceforge.net/projects/ior-sio/) is a popular parallel I/O benchmark used to characterise storage systems and file systems. IOR allows the benchmark engineer to run multiple threads, reading or writing from many clients simultaneously. A number of parameters are tuneable to avoid caching, alter the size of the data transfers issued from the IOR tasks and change the nature of the I/O topology. In the following case, we present results from executing IOR across 16 clients simultaneously. Each client is connected to the network with 40GE and runs between 8 and 32 threads per client, each thread read or writing with varying transfer sizes to a single shared file. 2014 DataDirect Networks. All Rights Reserved. 6

Paradigm Echos Microbenchmarks The three I/O intensive benchmarks were compared, over time to assess the optimal value configuration. In particular, a single-socket system was compared with a dual-socket Intel system. Secondly, a system configured with dual-port 10GE was compared with a system fitted with 40Gigabit IO. The relative merits of the configurations in terms of I/O performance could then be balanced against other factors of CPU/memory bandwidth bottlenecks to arrive at the ideal system. For all benchmarks, we also show a dotted line indicating the throughput achieved by a single-threaded IOZONE sequential I/O benchmark on the systems. Note that with GRIDScaler, the single thread results are actually quite representative of maximum throughput as GRIDScaler delivers extremely strong single threaded I/O particularly for writes. In the results below for the write-benchmark, it is clear that the 10Gigabit node is achieving a good proportion of available bandwidth, with job counts above eight per node hitting the I/O ceiling. For 40Gigabit Ethernet the efficiency is still very strong, but of-course allows higher scaling. Only with 2 socket nodes, does the I/O performance really merit the additional cost of the 40G interface, where throughputs above 3GB/s are see to the application. Figure 1: ECHOs Write Benchmark showing the measured write performance when loading a single client with multiple copies of the I/O benchmark. The sequential read benchmark shows a similar overall pattern with the 2-socket node, clearly delivering benefits over a single socket node. The maximum read performance is achieved with just eight concurrent jobs per node and reaches very close to 4GB/s within 80% of the IOZONE measurements. 2014 DataDirect Networks. All Rights Reserved. 7

Finally the Strided Read benchmarks were tested. In this case, the added dimension of a more onerous I/O task for the file system and storage subsystem has an impact on the efficiency of the application I/O for both 40G and 10G networked systems. In this case, the application benchmark still achieves over 2.5GB/s for the 40G connected client. The above throughput tests demonstrate strong scaling to around eight concurrent jobs before the I/O and network bottleneck significantly constrains the workload. In fact, for other reasons including CPU/memory bandwidth considerations, the chosen optimal value was three. Thus with three concurrent jobs per node we further investigated the number of nodes supported efficiently via a single GS12K appliance. It was found that a strong compromise involved 15 nodes each running three jobs per node. In this case, the measured throughput on the storage system indicated that the I/O benchmarks attained near the maximum throughput of ~20GB/s for Sequential Reads. 2014 DataDirect Networks. All Rights Reserved. 8

Sequential Write Sequential Read Strided Read Measured throughput for 45 concurrent jobs across 15 nodes 13,155 MB/s 18,153 MB/s 6,756 MB/s One more preliminary test was performed to establish the impact of SSDs particularly on the Strided Read benchmark, where the I/O pattern is evidently more onerous on the storage system. To this end, we took advantage of the SFX read cache facility within the GS12K. A smaller test system was used and compared the performance of 100 spindles alone, with 100 spindles accelerated with 12 SSDs. Strided Reads No SFX 1639MB/s SFX Accelerated 5110 MB/s The smaller test system used 4 clients, with each running the 3 instances of the Strided Read benchmark. It is clear that the use of SFX read caching on the SSDs eliminates the disk contention that was the major bottleneck for the benchmark on spinning disk. This result is important as SFX allows a relatively small expenditure on a handful of SSDs to have a large impact on runtimes. Summary The requirement for increasing levels of I/O performance in large seismic processing environments is clear. The GS12K delivers very strong, scalable performance with a small footprint and enterprise features. Each GS12K appliance delivers 20GB/s to the clients and each single client can achieve over 4 GB/s. A good portion of the theoretical bandwidth is attainable by Paradigm Echos I/O intensive regions particularly alongside a strong 40GE interconnect. We show that synthetic benchmarks are a good indicator of Echos performance and can be used to assess the suitability of a good I/O subsystem in a seismic environment. In addition to providing a system that helps minimize the hardware environment, it was also of benefit to reduce the number of file systems. In such a large-scale production environment, there is an always a balance between the manageability of an extremely large file system, and the usability of multiple file systems. Today we provide file systems comprising of four GS12K appliances for production systems, each housing 400 drives. The economics at the procurement time drives the choice of disk capacity, but currently with 4TB drives, this would provide single file systems of over 5PBs. The key benefit of a true parallel file system over scale-out NAS is that very high single client performance can be delivered and sustained when many hundreds of clients are working concurrently with intensive seismic I/O. The single client I/O performance is particularly important because with increasing core counts per system, this I/O 2014 DataDirect Networks. All Rights Reserved. 9

quickly becomes the major bottleneck. If a client is limited to 10Gigabit Ethernet or the files system does not achieve near-wire speed access, the number of concurrent jobs that can be efficiently run per node decreases, hugely impacting the total cost of ownership of the system and increasing the time to value. This report concludes that the GS12K provides a highly efficient platform for the I/O-heavy aspects of Paradigm Echos, demonstrating an ability to scale to large file systems over 5PB with only four GS12K appliances. This can support real-world production workloads involving dense computation with high bandwidth networks. We also introduce SFX as a cost-effective way to gain benefits to application throughput without major investment on a large SSD estate. As evidenced with the testing and results presented in this paper, oil and gas companies can be confident in leveraging the Paradigm Echos and DDN solution to accelerate discovery and time to results for seismic processing in their own implementations. DDN About Us DataDirect Networks (DDN) is the world leader in massively scalable storage. Our data storage and processing solutions and professional services enable content-rich and high growth IT environments to achieve the highest levels of systems scalability, efficiency and simplicity. DDN enables enterprises to extract value and deliver business results from their information. Our customers include the world s leading online content and social networking providers, high performance cloud and grid computing, life sciences, media production, and security and intelligence organizations. Deployed in thousands of mission critical environments worldwide, DDN s solutions have been designed, engineered and proven in the world s most scalable data centers to ensure competitive business advantage for today s information powered enterprise. For more information, go to www. or call +1.800.837.2298 2014 DataDirect Networks, Inc. All Rights Reserved. DDN, GS12K, SFA7700, SFA, SFX, Storage Fusion Architecture, Storage Fusion Xcelerator, is a trademark of DataDirect Networks. Other Names and Brands May Be Claimed as the Property of Others. Version-1 10/14 2014 DataDirect Networks. All Rights Reserved. 10