Performance Analysis of Flash Storage Devices and their Application in High Performance Computing Nicholas J. Wright With contributions from R. Shane Canon, Neal M. Master, Matthew Andrews, and Jason Hick
Outline Why look at flash memory? Performance evaluation of individual flash devices What applications is flash going to be good for? Flash in a parallel filesystem Summary 2
Data Driven Science - Ability to generate data is challenging our ability to store, analyze, & archive it. - Some observational devices grow in capability with Moore s Law. - Data sets are growing exponentially. Petabyte (PB) data sets soon will be common: Climate: next IPCC estimates 10s of PBs Genome: JGI alone will have.5 PB this year and double each year Particle physics: LHC projects 16 PB / yr Astrophysics: LSST, others, estimate 5 PB / yr Redefine the way science is done? One group generates data, different group analyzes 1st Climate 100 paper from a different group than the one collected the data 3
Data Trends at NERSC 4
I/O Performance Challenges Performance Crisis #1 Disks are outpaced by compute speed. To achieve reasonable aggregate bandwidth many spindles needed 10 3 spindles = 1PB but only ~0.1 TB/s! Performance Crisis #2 Data Motion on an Exascale Machine will be expensive both in terms of energy & performance! 5 PicoJoules 10000 1000 100 10 1 DP FLOP now 2018 Register 1mm on-chip 5mm on-chip Off-chip/DRAM local interconnect Cross system
Memory Capacity Trends Technology trends: Memory density 2X every 3 yrs; processor logic every 2 Storage costs ($/MB) drops more gradually than logic costs Cost of Computation vs. Memory Source: David Turek, IBM Source: IBM 6
Hardware Trends are exacerbating the issue Data volumes exploding! Memory Capacity per Flop decreasing I/O Bandwidths not keeping pace Data movement is expensive Will NVRAM save the day? Let s evaluate Flash Storage! 7
Flash Memory - Ubiquitous 8
Flash What is it good for? Read - Word level ~20 us Write - Erase/Write - block level ~200 us $/GB Finite Number of Erase Cycles Lots of open Q s: PCI vs SATA vs? SLC vs MLC Write requires block erase - performance dependent upon previous IO pattern Correct algorithm in software at all levels. 9
Devices Evaluated 3 PCI-e SLC Virident tachion 400GB 8x FusionIO iodrive Duo 2x 160GB 4x Texas Memory Systems RamSan-20 450GB 4x 2 SATA MLC Intel X-25M 160GB OCZ Colossus 250GB Performance Analysis of Commodity and Enterprise Class Flash Devices. Neal M. Master, Matthew Andrews, Jason Hick, Shane Canon & Nicholas J. Wright PDSI Workshop, Supercomputing 2010, New Orleans. 10
IOZone Experiments Bandwidth Vary block size: 2 n KB, n =2-8 Vary concurrency: 2 n threads, n=0-7 (1-128) Vary IO Patterns: Sequential Write/Re-write, Sequential Read/Re-read, Random Write, Mixed Random Write/Read, Random Read IOPS 4KB block size Vary concurrency: 2 n threads, n=0-7 (1-128) 11
PCI-Bandwidths 0-100 100-200 200-300 300-400 400-500 500-600 600-700 700-800 800-900 900-1000 1000-1100 1100-1200 Virident tachion READ Bandwidth (MB/s) 1200 1100 1000 900 800 700 600 500 400 300 200 100 0 1 2 4 8 16 32 64 128 4 16 64 256 IO Block Size (KB) Number of Threads 12
Bandwidth Summary 1400 Read Bandwidth Write Bandwidth Company Reported Read Bandwidth Company Reported Write Bandwidth 1200 Bandwidth (MB/s) 1000 800 600 400 200 0 TMS RamSan 20 (450GB) Virident tachion (400GB) Fusion IO iodrive Duo (Single Slot, 160GB) Intel X-25M (160GB) OCZ Colossus (250 GB) 13
IOPS - READ Virident tachion (400GB) Fusion IO iodrive Duo (Single Slot, 160GB) OCZ Colossus (250 GB) 180 TMS RamSan 20 (450GB) Intel X-25M (160GB) 160 140 IO/s (thousands) 120 100 80 60 40 20 0 1 2 4 8 16 32 64 128 Number of Threads 14
IOPS - Write Virident tachion (400GB) Fusion IO iodrive Duo (Single Slot, 160GB) OCZ Colossus (250 GB) 180 TMS RamSan 20 (450GB) Intel X-25M (160GB) 160 140 IO/s (thousands) 120 100 80 60 40 20 0 1 2 4 8 16 32 64 128 Number of Threads 15
Flash Device Evaluation - IOPS 180 160 140 Peak Read Peak Write Thousands IOPs 120 100 80 60 40 20 0 TMS RamSan 20 (450GB) Virident tachion (400GB) Fusion IO iodrive Duo (Single Slot, 160GB) Intel X-25M (160GB) OCZ Colossus (250 GB) 16
Degradation Experiment Create a file using Cat /dev/urandom dd that fills X% of the drive X=30,50,70,90 Using FIO randomly write to the file Using 4KB blocks - IOPS Using 128KB blocks - BW 17
Degradation - IOPS 35 Virident tachion (400GB) Fusion IO iodrive Duo (Single Slot, 160GB) OCZ Colossus (250 GB) TMS RamSan 20 (450GB) Intel X-25M (160GB) 30 IO/s (thousands) 25 20 15 10 5 0 0 10 20 30 40 50 60 Minutes 18
Degradation IOPS Summary 120% 30% 50% 70% 90% Percentage of Peak Write IO/s 100% 80% 60% 40% 20% 0% Virident tachion (400GB) TMS RamSan 20 (450GB) Fusion IO iodrive Duo (Single Slot, 160GB) Intel X-25M (160GB) OCZ Colossus (250 GB) 19
Degradation - Bandwidth Virident tachion (400GB) Fusion IO iodrive Duo (Single Slot, 160GB) OCZ Colossus (250 GB) 1200 TMS RamSan 20 (450GB) Intel X-25M (160GB) 1000 800 MB/s 600 400 200 0 0 10 20 30 40 50 Minutes 20
Degradation BW Summary 90% 30% Capacity 50% Capacity 70% Capacity 90% Capacity Percentage of Peak Write Bandwidth 80% 70% 60% 50% 40% 30% 20% 10% 0% Virident tachion (400GB) TMS RamSan 20 (450GB) Fusion IO iodrive Duo (Single Slot, 160GB) Intel X-25M (160GB) OCZ Colossus (250 GB) 21
Parallel Filesystems 22
GPFS Flash Filesystem 8 Virident Devices 2 per node dual socket Nehalem 2.67 GHz 24 GB QBR-IB v.1.0 Virident Driver Software GPFS v 3.2 4 NSD servers 2 cards per server 256K block size (default) Metadata stored with data Scatter block assignment algorithm All measurements made with IOR 23
GPFS: Bandwidth Measurements 8000 Sequential Read Sequential Write Random Read Random Write 7000 6000 Bandwidth (MB/s) 5000 4000 3000 2000 1000 0 8 16 32 64 128 256 24 MPI tasks
GPFS: IOPS Measurements 2.5E+05 Sequential Read Sequential Write Random Read Random Write 2.0E+05 1.5E+05 IOPS 1.0E+05 5.0E+04 0.0E+00 8 16 32 64 96 256 MPI tasks 25
GPFS block size variation: Read 9000 16K Sequential 256K Sequential 1M Sequential 8000 7000 Bandwidth (MB/s) 6000 5000 4000 3000 2000 1000 0 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M 8M 16M Block Size 26
GPFS block size variation: write 5000 16K Sequential 256K Sequential 1M Sequential 4500 4000 Bandwidth (MB/s) 3500 3000 2500 2000 1500 1000 500 0 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M 8M 16M Block Size 27
Performance for Two Devices Simultaneously 2.5 One Device Two Simultaneously Two RAID0 2 ~38% Degradation! Performance Ratio 1.5 1 0.5 0 Read 28 Write
GPFS Unaligned I/O Performance 100% 90% Read Write Percentage of Aligned Performance 80% 70% 60% 50% 40% 30% 20% 10% 0% Single Device 29 GPFS
Lustre - GPFS Comparison Lustre GPFS Lustre GPFS 14000 3.E+05 12000 3.E+05 10000 2.E+05 BW (MB/s) 8000 6000 IOPs 2.E+05 4000 1.E+05 2000 5.E+04 0 Read Write 0.E+00 Read Write 30
Applications for Flash? Capacity in GB Bandwidth / GB/s IOPS $/GB $/GB/s $/IOP Device Price in $ SATA MLC FLASH 120 64 0.2 8600 $1.88 $600 $0.01 SATA SLC FLASH 740 64 0.2 5000 $11.56 $3,700 $0.15 PCI SLC FLASH 11500 640 1.2 140000 $17.97 $9,583 $0.08 SATA HDD 80 2000 0.07 90 $0.04 $1,143 $0.89 High-Perf Array 250000 240000 5 100000 $1.04 $50,000 $2.50 31
Graph500 with NAND Flash Graph500: Traversing massive graphs with NAND Flash Roger Pearce 12 Maya Gokhale 1 Nancy M. Amato 2 1 Center for Applied Scientific Computing Lawrence Livermore National Laboratory 2 Department of Computer Science and Engineering Texas A&M University June 2011 This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE- AC52-07NA27344. LLNL-PRES-487136. Graph500: Traversing massive graphs with NAND Flash Roger Pearce 1/6 32
Graph500 with NAND Flash Graph500 Implementation Semi-External O( V ) data can fit into main memory, G =(V, E) Read-only from NAND Flash, output and algorithm data kept in-memory e.g. store graph in external memory, keep BFS data in memory Asynchronous Traversal Technique [SC 2010] Exploit fine-grained path parallelism Tolerate data latencies to graph storage (NAND Flash) Re-order vertex visitation to improve page-level locality Allow over-subscription of thread-level parallelism (256 threads) to maximize NAND Flash IOPS Graph stored as CSR on Flash device; read-only Graph500: Traversing massive graphs with NAND Flash Roger Pearce 3/6 33
Graph500 with NAND Flash Experimental Setup Kraken s hardware: single node, 32-core Opteron(tm) Processor 6128 @ 2.0Ghz 512 GB DRAM Graph500 with NAND Flash 4x 640GB Fusion-io MLC; Software RAID0 Red Hat Graph500 Enterprise Linux Results 5.6 Approximate system cost $71K $25K for base system $46K for 2.56TB of Fusion-io NAND Flash Using Fusion-io: 8x larger with 50% performance loss over DRAM only Result: DRAM + Fusion-io: Scale 34, 55.6 MTEPS DRAM Only: Scale 31, 104.6 MTEPS Graph500: Traversing massive graphs with NAND Flash Roger Pearce 4/6 34
Summary Bandwidths per device are impressive But cf. RAID d set of regular HDD s IOP s numbers are very impressive 100x HDD RAID ing won t help here Large numbers of threads needed to saturate (Pearce, Gokhale & Amato SC10) Previous I/O pattern can effect performance Parallel Filesystem Software needs Tweaking to use with Flash Read BW - OK - Write BW 40% peak Read IOPS 18% peak - Write IOPS 13% peak 35
Going Forward When you ve got a hammer in your hand, everything looks like a nail. THE FIRST Law of Technology says we invariably overestimate the short-term impact of new technologies while underestimating their longer-term effects - Dr. Francis Collins 36
Flash Going Forward $/GB unlikely to match regular disk $/IOP already significantly better Energy costs will be less that a regular HDD Usefulness with depend upon the data access pattern 37
Reliability Too Few Electrons Per Gate Source: The Inevitable Rise of NVM in Computing, Jim Handy, Nonvolatile Memory Seminar, Hot Chips Conference August 22, 2010 Memorial Auditorium Stanford University 38
Flash Technology Trends Source: Ed DollerV.P. Chief Memory Systems Architect, Non-Volatile Memory Seminar Hot Chips Conference August 22, 2010 Memorial Auditorium Stanford University 39