Performance Analysis of Flash Storage Devices and their Application in High Performance Computing

Similar documents

A Close Look at PCI Express SSDs. Shirish Jamthe Director of System Engineering Virident Systems, Inc. August 2011

An Overview of Flash Storage for Databases

ebay Storage, From Good to Great

Data Center Storage Solutions

Hadoop on the Gordon Data Intensive Cluster

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

LLNL Redefines High Performance Computing with Fusion Powered I/O

NAND Flash Architecture and Specification Trends

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Solid State Storage in Massive Data Environments Erik Eyberg

Scaling from Datacenter to Client

Flash 101. Violin Memory Switzerland. Violin Memory Inc. Proprietary 1

Data Center Solutions

Flash s Role in Big Data, Past Present, and Future OBJECTIVE ANALYSIS. Jim Handy

Commoditisation of the High-End Research Storage Market with the Dell MD3460 & Intel Enterprise Edition Lustre

Sistemas Operativos: Input/Output Disks

File System & Device Drive. Overview of Mass Storage Structure. Moving head Disk Mechanism. HDD Pictures 11/13/2014. CS341: Operating System

Indexing on Solid State Drives based on Flash Memory

Benchmarking Cassandra on Violin

Comparing Dynamic Disk Pools (DDP) with RAID-6 using IOR

Data Center Solutions

Intel Solid- State Drive Data Center P3700 Series NVMe Hybrid Storage Performance

Flash for Databases. September 22, 2015 Peter Zaitsev Percona

Understanding Flash SSD Performance

High-Performance SSD-Based RAID Storage. Madhukar Gunjan Chakhaiyar Product Test Architect

PACE Predictive Analytics Center of San Diego Supercomputer Center, UCSD. Natasha Balac, Ph.D.

Flash Performance in Storage Systems. Bill Moore Chief Engineer, Storage Systems Sun Microsystems

Large File System Backup NERSC Global File System Experience

LEVERAGING FLASH MEMORY in ENTERPRISE STORAGE. Matt Kixmoeller, Pure Storage

COS 318: Operating Systems. Storage Devices. Kai Li and Andy Bavier Computer Science Department Princeton University

How To Get A Memory Memory Device From A Flash Flash To A Memory Card (Iomemory) For A Microsoft Flash Memory Card From A Microsable Memory Card For A Flash (Ios) For An Iomemories Memory

enabling Ultra-High Bandwidth Scalable SSDs with HLnand

The Data Placement Challenge

How To Build A Cloud Computing Datacenter

HP Z Turbo Drive PCIe SSD

Performance Beyond PCI Express: Moving Storage to The Memory Bus A Technical Whitepaper

Current Status of FEFS for the K computer

High Performance Tier Implementation Guideline

Solid State Technology What s New?

Accelerating Server Storage Performance on Lenovo ThinkServer

Disks and RAID. Profs. Bracy and Van Renesse. based on slides by Prof. Sirer

Fusion iomemory iodrive PCIe Application Accelerator Performance Testing

Virtualization of the MS Exchange Server Environment

RAID. RAID 0 No redundancy ( AID?) Just stripe data over multiple disks But it does improve performance. Chapter 6 Storage and Other I/O Topics 29

Boost Database Performance with the Cisco UCS Storage Accelerator

WITH A FUSION POWERED SQL SERVER 2014 IN-MEMORY OLTP DATABASE

COS 318: Operating Systems. Storage Devices. Kai Li Computer Science Department Princeton University. (

Speeding Up Cloud/Server Applications Using Flash Memory

The Shortcut Guide to Balancing Storage Costs and Performance with Hybrid Storage

NAND Flash Architecture and Specification Trends

Converged storage architecture for Oracle RAC based on NVMe SSDs and standard x86 servers

MS EXCHANGE SERVER ACCELERATION IN VMWARE ENVIRONMENTS WITH SANRAD VXL

Flash In The Enterprise

Application and Micro-benchmark Performance using MVAPICH2-X on SDSC Gordon Cluster

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database

OBJECTIVE ANALYSIS WHITE PAPER MATCH FLASH. TO THE PROCESSOR Why Multithreading Requires Parallelized Flash ATCHING

Application-Focused Flash Acceleration

Chapter Introduction. Storage and Other I/O Topics. p. 570( 頁 585) Fig I/O devices can be characterized by. I/O bus connections

LSI MegaRAID CacheCade Performance Evaluation in a Web Server Environment

Flash-optimized Data Progression

SOLID STATE DRIVES AND PARALLEL STORAGE

New Dimensions in Configurable Computing at runtime simultaneously allows Big Data and fine Grain HPC

Parallel Programming Survey

The Use of Flash in Large-Scale Storage Systems.

N /150/151/160 RAID Controller. N MegaRAID CacheCade. Feature Overview

Deep Dive: Maximizing EC2 & EBS Performance

THE EXPAND PARALLEL FILE SYSTEM A FILE SYSTEM FOR CLUSTER AND GRID COMPUTING. José Daniel García Sánchez ARCOS Group University Carlos III of Madrid

ioscale: The Holy Grail for Hyperscale

Infrastructure Matters: POWER8 vs. Xeon x86

Data storage services at CC-IN2P3

IBM System x SAP HANA

Benefits of Solid-State Storage

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

How To Write On A Flash Memory Flash Memory (Mlc) On A Solid State Drive (Samsung)

Architecting High-Speed Data Streaming Systems. Sujit Basu

SALSA Flash-Optimized Software-Defined Storage

SUPERTALENT PCI EXPRESS RAIDDRIVE PERFORMANCE WHITEPAPER

Comparison of NAND Flash Technologies Used in Solid- State Storage

Price/performance Modern Memory Hierarchy

Performance in a Gluster System. Versions 3.1.x

PCIe Over Cable Provides Greater Performance for Less Cost for High Performance Computing (HPC) Clusters. from One Stop Systems (OSS)

Performance Monitoring of Parallel Scientific Applications

Optimizing SQL Server Storage Performance with the PowerEdge R720

HITACHI VIRTUAL STORAGE PLATFORM FAMILY MATRIX

Technologies Supporting Evolution of SSDs

NetApp High-Performance Computing Solution for Lustre: Solution Guide

Transcription:

Performance Analysis of Flash Storage Devices and their Application in High Performance Computing Nicholas J. Wright With contributions from R. Shane Canon, Neal M. Master, Matthew Andrews, and Jason Hick

Outline Why look at flash memory? Performance evaluation of individual flash devices What applications is flash going to be good for? Flash in a parallel filesystem Summary 2

Data Driven Science - Ability to generate data is challenging our ability to store, analyze, & archive it. - Some observational devices grow in capability with Moore s Law. - Data sets are growing exponentially. Petabyte (PB) data sets soon will be common: Climate: next IPCC estimates 10s of PBs Genome: JGI alone will have.5 PB this year and double each year Particle physics: LHC projects 16 PB / yr Astrophysics: LSST, others, estimate 5 PB / yr Redefine the way science is done? One group generates data, different group analyzes 1st Climate 100 paper from a different group than the one collected the data 3

Data Trends at NERSC 4

I/O Performance Challenges Performance Crisis #1 Disks are outpaced by compute speed. To achieve reasonable aggregate bandwidth many spindles needed 10 3 spindles = 1PB but only ~0.1 TB/s! Performance Crisis #2 Data Motion on an Exascale Machine will be expensive both in terms of energy & performance! 5 PicoJoules 10000 1000 100 10 1 DP FLOP now 2018 Register 1mm on-chip 5mm on-chip Off-chip/DRAM local interconnect Cross system

Memory Capacity Trends Technology trends: Memory density 2X every 3 yrs; processor logic every 2 Storage costs ($/MB) drops more gradually than logic costs Cost of Computation vs. Memory Source: David Turek, IBM Source: IBM 6

Hardware Trends are exacerbating the issue Data volumes exploding! Memory Capacity per Flop decreasing I/O Bandwidths not keeping pace Data movement is expensive Will NVRAM save the day? Let s evaluate Flash Storage! 7

Flash Memory - Ubiquitous 8

Flash What is it good for? Read - Word level ~20 us Write - Erase/Write - block level ~200 us $/GB Finite Number of Erase Cycles Lots of open Q s: PCI vs SATA vs? SLC vs MLC Write requires block erase - performance dependent upon previous IO pattern Correct algorithm in software at all levels. 9

Devices Evaluated 3 PCI-e SLC Virident tachion 400GB 8x FusionIO iodrive Duo 2x 160GB 4x Texas Memory Systems RamSan-20 450GB 4x 2 SATA MLC Intel X-25M 160GB OCZ Colossus 250GB Performance Analysis of Commodity and Enterprise Class Flash Devices. Neal M. Master, Matthew Andrews, Jason Hick, Shane Canon & Nicholas J. Wright PDSI Workshop, Supercomputing 2010, New Orleans. 10

IOZone Experiments Bandwidth Vary block size: 2 n KB, n =2-8 Vary concurrency: 2 n threads, n=0-7 (1-128) Vary IO Patterns: Sequential Write/Re-write, Sequential Read/Re-read, Random Write, Mixed Random Write/Read, Random Read IOPS 4KB block size Vary concurrency: 2 n threads, n=0-7 (1-128) 11

PCI-Bandwidths 0-100 100-200 200-300 300-400 400-500 500-600 600-700 700-800 800-900 900-1000 1000-1100 1100-1200 Virident tachion READ Bandwidth (MB/s) 1200 1100 1000 900 800 700 600 500 400 300 200 100 0 1 2 4 8 16 32 64 128 4 16 64 256 IO Block Size (KB) Number of Threads 12

Bandwidth Summary 1400 Read Bandwidth Write Bandwidth Company Reported Read Bandwidth Company Reported Write Bandwidth 1200 Bandwidth (MB/s) 1000 800 600 400 200 0 TMS RamSan 20 (450GB) Virident tachion (400GB) Fusion IO iodrive Duo (Single Slot, 160GB) Intel X-25M (160GB) OCZ Colossus (250 GB) 13

IOPS - READ Virident tachion (400GB) Fusion IO iodrive Duo (Single Slot, 160GB) OCZ Colossus (250 GB) 180 TMS RamSan 20 (450GB) Intel X-25M (160GB) 160 140 IO/s (thousands) 120 100 80 60 40 20 0 1 2 4 8 16 32 64 128 Number of Threads 14

IOPS - Write Virident tachion (400GB) Fusion IO iodrive Duo (Single Slot, 160GB) OCZ Colossus (250 GB) 180 TMS RamSan 20 (450GB) Intel X-25M (160GB) 160 140 IO/s (thousands) 120 100 80 60 40 20 0 1 2 4 8 16 32 64 128 Number of Threads 15

Flash Device Evaluation - IOPS 180 160 140 Peak Read Peak Write Thousands IOPs 120 100 80 60 40 20 0 TMS RamSan 20 (450GB) Virident tachion (400GB) Fusion IO iodrive Duo (Single Slot, 160GB) Intel X-25M (160GB) OCZ Colossus (250 GB) 16

Degradation Experiment Create a file using Cat /dev/urandom dd that fills X% of the drive X=30,50,70,90 Using FIO randomly write to the file Using 4KB blocks - IOPS Using 128KB blocks - BW 17

Degradation - IOPS 35 Virident tachion (400GB) Fusion IO iodrive Duo (Single Slot, 160GB) OCZ Colossus (250 GB) TMS RamSan 20 (450GB) Intel X-25M (160GB) 30 IO/s (thousands) 25 20 15 10 5 0 0 10 20 30 40 50 60 Minutes 18

Degradation IOPS Summary 120% 30% 50% 70% 90% Percentage of Peak Write IO/s 100% 80% 60% 40% 20% 0% Virident tachion (400GB) TMS RamSan 20 (450GB) Fusion IO iodrive Duo (Single Slot, 160GB) Intel X-25M (160GB) OCZ Colossus (250 GB) 19

Degradation - Bandwidth Virident tachion (400GB) Fusion IO iodrive Duo (Single Slot, 160GB) OCZ Colossus (250 GB) 1200 TMS RamSan 20 (450GB) Intel X-25M (160GB) 1000 800 MB/s 600 400 200 0 0 10 20 30 40 50 Minutes 20

Degradation BW Summary 90% 30% Capacity 50% Capacity 70% Capacity 90% Capacity Percentage of Peak Write Bandwidth 80% 70% 60% 50% 40% 30% 20% 10% 0% Virident tachion (400GB) TMS RamSan 20 (450GB) Fusion IO iodrive Duo (Single Slot, 160GB) Intel X-25M (160GB) OCZ Colossus (250 GB) 21

Parallel Filesystems 22

GPFS Flash Filesystem 8 Virident Devices 2 per node dual socket Nehalem 2.67 GHz 24 GB QBR-IB v.1.0 Virident Driver Software GPFS v 3.2 4 NSD servers 2 cards per server 256K block size (default) Metadata stored with data Scatter block assignment algorithm All measurements made with IOR 23

GPFS: Bandwidth Measurements 8000 Sequential Read Sequential Write Random Read Random Write 7000 6000 Bandwidth (MB/s) 5000 4000 3000 2000 1000 0 8 16 32 64 128 256 24 MPI tasks

GPFS: IOPS Measurements 2.5E+05 Sequential Read Sequential Write Random Read Random Write 2.0E+05 1.5E+05 IOPS 1.0E+05 5.0E+04 0.0E+00 8 16 32 64 96 256 MPI tasks 25

GPFS block size variation: Read 9000 16K Sequential 256K Sequential 1M Sequential 8000 7000 Bandwidth (MB/s) 6000 5000 4000 3000 2000 1000 0 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M 8M 16M Block Size 26

GPFS block size variation: write 5000 16K Sequential 256K Sequential 1M Sequential 4500 4000 Bandwidth (MB/s) 3500 3000 2500 2000 1500 1000 500 0 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M 8M 16M Block Size 27

Performance for Two Devices Simultaneously 2.5 One Device Two Simultaneously Two RAID0 2 ~38% Degradation! Performance Ratio 1.5 1 0.5 0 Read 28 Write

GPFS Unaligned I/O Performance 100% 90% Read Write Percentage of Aligned Performance 80% 70% 60% 50% 40% 30% 20% 10% 0% Single Device 29 GPFS

Lustre - GPFS Comparison Lustre GPFS Lustre GPFS 14000 3.E+05 12000 3.E+05 10000 2.E+05 BW (MB/s) 8000 6000 IOPs 2.E+05 4000 1.E+05 2000 5.E+04 0 Read Write 0.E+00 Read Write 30

Applications for Flash? Capacity in GB Bandwidth / GB/s IOPS $/GB $/GB/s $/IOP Device Price in $ SATA MLC FLASH 120 64 0.2 8600 $1.88 $600 $0.01 SATA SLC FLASH 740 64 0.2 5000 $11.56 $3,700 $0.15 PCI SLC FLASH 11500 640 1.2 140000 $17.97 $9,583 $0.08 SATA HDD 80 2000 0.07 90 $0.04 $1,143 $0.89 High-Perf Array 250000 240000 5 100000 $1.04 $50,000 $2.50 31

Graph500 with NAND Flash Graph500: Traversing massive graphs with NAND Flash Roger Pearce 12 Maya Gokhale 1 Nancy M. Amato 2 1 Center for Applied Scientific Computing Lawrence Livermore National Laboratory 2 Department of Computer Science and Engineering Texas A&M University June 2011 This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE- AC52-07NA27344. LLNL-PRES-487136. Graph500: Traversing massive graphs with NAND Flash Roger Pearce 1/6 32

Graph500 with NAND Flash Graph500 Implementation Semi-External O( V ) data can fit into main memory, G =(V, E) Read-only from NAND Flash, output and algorithm data kept in-memory e.g. store graph in external memory, keep BFS data in memory Asynchronous Traversal Technique [SC 2010] Exploit fine-grained path parallelism Tolerate data latencies to graph storage (NAND Flash) Re-order vertex visitation to improve page-level locality Allow over-subscription of thread-level parallelism (256 threads) to maximize NAND Flash IOPS Graph stored as CSR on Flash device; read-only Graph500: Traversing massive graphs with NAND Flash Roger Pearce 3/6 33

Graph500 with NAND Flash Experimental Setup Kraken s hardware: single node, 32-core Opteron(tm) Processor 6128 @ 2.0Ghz 512 GB DRAM Graph500 with NAND Flash 4x 640GB Fusion-io MLC; Software RAID0 Red Hat Graph500 Enterprise Linux Results 5.6 Approximate system cost $71K $25K for base system $46K for 2.56TB of Fusion-io NAND Flash Using Fusion-io: 8x larger with 50% performance loss over DRAM only Result: DRAM + Fusion-io: Scale 34, 55.6 MTEPS DRAM Only: Scale 31, 104.6 MTEPS Graph500: Traversing massive graphs with NAND Flash Roger Pearce 4/6 34

Summary Bandwidths per device are impressive But cf. RAID d set of regular HDD s IOP s numbers are very impressive 100x HDD RAID ing won t help here Large numbers of threads needed to saturate (Pearce, Gokhale & Amato SC10) Previous I/O pattern can effect performance Parallel Filesystem Software needs Tweaking to use with Flash Read BW - OK - Write BW 40% peak Read IOPS 18% peak - Write IOPS 13% peak 35

Going Forward When you ve got a hammer in your hand, everything looks like a nail. THE FIRST Law of Technology says we invariably overestimate the short-term impact of new technologies while underestimating their longer-term effects - Dr. Francis Collins 36

Flash Going Forward $/GB unlikely to match regular disk $/IOP already significantly better Energy costs will be less that a regular HDD Usefulness with depend upon the data access pattern 37

Reliability Too Few Electrons Per Gate Source: The Inevitable Rise of NVM in Computing, Jim Handy, Nonvolatile Memory Seminar, Hot Chips Conference August 22, 2010 Memorial Auditorium Stanford University 38

Flash Technology Trends Source: Ed DollerV.P. Chief Memory Systems Architect, Non-Volatile Memory Seminar Hot Chips Conference August 22, 2010 Memorial Auditorium Stanford University 39