Reliability of NVM devices for I/O Acceleration on Supercomputing Systems

Similar documents
Lustre Monitoring with OpenTSDB

Hadoop on the Gordon Data Intensive Cluster

Scaling from Datacenter to Client

Accelerating I/O- Intensive Applications in IT Infrastructure with Innodisk FlexiArray Flash Appliance. Alex Ho, Product Manager Innodisk Corporation

SMB Direct for SQL Server and Private Cloud

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

CORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER

Benchmarking Hadoop & HBase on Violin

Lustre* is designed to achieve the maximum performance and scalability for POSIX applications that need outstanding streamed I/O.

Scientific Computing Data Management Visions

Current Status of FEFS for the K computer

Oracle Exadata: The World s Fastest Database Machine Exadata Database Machine Architecture

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

New to servers. Are you new to servers? Consider these HP ProLiant Essentials servers. Family guide HP ProLiant rack and tower servers

InfiniBand Strengthens Leadership as the High-Speed Interconnect Of Choice

Hyperscale. The new frontier for HPC. Philippe Trautmann. HPC/POD Sales Manager EMEA March 13th, 2011

MaxDeploy Ready. Hyper- Converged Virtualization Solution. With SanDisk Fusion iomemory products

HPC and Big Data. EPCC The University of Edinburgh. Adrian Jackson Technical Architect

Architecting a High Performance Storage System

HPC Growing Pains. Lessons learned from building a Top500 supercomputer

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

HPC Update: Engagement Model

Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

Parallel Large-Scale Visualization

Interconnect Your Future Enabling the Best Datacenter Return on Investment. TOP500 Supercomputers, November 2015

Michael Kagan.

HP ProLiant Gen8 vs Gen9 Server Blades on Data Warehouse Workloads

TUT NoSQL Seminar (Oracle) Big Data

Purchase of High Performance Computing (HPC) Central Compute Resources by Northwestern Researchers

SQL Server Business Intelligence on HP ProLiant DL785 Server

HP ProLiant DL580 Gen8 and HP LE PCIe Workload WHITE PAPER Accelerator 90TB Microsoft SQL Server Data Warehouse Fast Track Reference Architecture

Getting the Most Out of Flash Storage

IEEE Mass Storage Conference Vendor Reception Lake Tahoe, NV

ebay Storage, From Good to Great

Brainlab Node TM Technical Specifications

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales

Capacity planning for IBM Power Systems using LPAR2RRD.

Interconnect Your Future Enabling the Best Datacenter Return on Investment. TOP500 Supercomputers, June 2016

Open Cirrus: Towards an Open Source Cloud Stack

Clusters: Mainstream Technology for CAE

Intel Solid- State Drive Data Center P3700 Series NVMe Hybrid Storage Performance

Benchmarking Cassandra on Violin

Converged storage architecture for Oracle RAC based on NVMe SSDs and standard x86 servers

Why Hybrid Storage Strategies Give the Best Bang for the Buck

Diablo and VMware TM powering SQL Server TM in Virtual SAN TM. A Diablo Technologies Whitepaper. May 2015

Mellanox Accelerated Storage Solutions

Testing SSD s vs. HDD s. The Same But Different. August 11, 2009 Santa Clara, CA Anthony Lavia President & CEO

A Close Look at PCI Express SSDs. Shirish Jamthe Director of System Engineering Virident Systems, Inc. August 2011

Oracle Maximum Availability Architecture with Exadata Database Machine. Morana Kobal Butković Principal Sales Consultant Oracle Hrvatska

Flash Memory Arrays Enabling the Virtualized Data Center. July 2010

Evoluzione dell Infrastruttura di Calcolo e Data Analytics per la ricerca

Flash 101. Violin Memory Switzerland. Violin Memory Inc. Proprietary 1

Express5800 Scalable Enterprise Server Reference Architecture. For NEC PCIe SSD Appliance for Microsoft SQL Server

Evaluation Report: Accelerating SQL Server Database Performance with the Lenovo Storage S3200 SAN Array

VTrak SATA RAID Storage System

New Storage System Solutions

How To Speed Up A Flash Flash Storage System With The Hyperq Memory Router

Can High-Performance Interconnects Benefit Memcached and Hadoop?

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

Advancing Applications Performance With InfiniBand

The Use of Flash in Large-Scale Storage Systems.

Performance characterization report for Microsoft Hyper-V R2 on HP StorageWorks P4500 SAN storage

DELL TM PowerEdge TM T Mailbox Resiliency Exchange 2010 Storage Solution

Realizing the next step in storage/converged architectures

Performance Measurement and monitoring in TSUBAME2.5 towards next generation supercomputers

IBM System x SAP HANA

NetApp High-Performance Computing Solution for Lustre: Solution Guide

Lecture 2 Parallel Programming Platforms

Data Center Solutions

Hyperscale Use Cases for Scaling Out with Flash. David Olszewski

StarWind Virtual SAN for Microsoft SOFS

QuickSpecs. HP Solid State Drives (SSDs) for Workstations. Overview

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

Solid State Drive Architecture

Enabling High performance Big Data platform with RDMA

ioscale: The Holy Grail for Hyperscale

FLOW-3D Performance Benchmark and Profiling. September 2012

Non Volatile Memory Invades the Memory Bus: Performance and Versatility is the Result

Exadata Database Machine

Headline in Arial Bold 30pt. The Need For Speed. Rick Reid Principal Engineer SGI

A Holistic Model of the Energy-Efficiency of Hypervisors

HA Certification Document Armari BrontaStor 822R 07/03/2013. Open-E High Availability Certification report for Armari BrontaStor 822R

Running Oracle s PeopleSoft Human Capital Management on Oracle SuperCluster T5-8 O R A C L E W H I T E P A P E R L A S T U P D A T E D J U N E

Capacity Management for Oracle Database Machine Exadata v2

High Performance Oracle RAC Clusters A study of SSD SAN storage A Datapipe White Paper

Building All-Flash Software Defined Storages for Datacenters. Ji Hyuck Yun Storage Tech. Lab SK Telecom

io3 Enterprise Mainstream Flash Adapters Product Guide

Where IT perceptions are reality. Test Report. OCe14000 Performance. Featuring Emulex OCe14102 Network Adapters Emulex XE100 Offload Engine

TSUBAME-KFC : a Modern Liquid Submersion Cooling Prototype Towards Exascale

Oracle Exadata Database Machine for SAP Systems - Innovation Provided by SAP and Oracle for Joint Customers

Cloud Data Center Acceleration 2015

Comparison of NAND Flash Technologies Used in Solid- State Storage

Introducing NetApp FAS2500 series. Marek Stopka Senior System Engineer ALEF Distribution CZ s.r.o.

Windows 8 SMB 2.2 File Sharing Performance

Research opportunities in Singapore

SSDs tend to be more rugged than hard drives with respect to shock and vibration because SSDs have no moving parts.

Transforming the UL into a Big Data University. Current status and planned evolutions

HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief

VMware Virtual SAN Hardware Guidance. TECHNICAL MARKETING DOCUMENTATION v 1.0

SUN HARDWARE FROM ORACLE: PRICING FOR EDUCATION

Transcription:

Reliability of NVM devices for I/O Acceleration on Supercomputing Systems Hitoshi Sato* 1, Shuichi Ihara 2), Satoshi Matsuoka 1) *1 Tokyo Institute of Technology * 2 Data Direct Networks Japan

Requirements for Next-gen I/O Subsystem TSUBAME2 Pioneer in the use of local flash storage 200TB of s for productive operation since 2010 Tired and hybrid storage environments, combining local flash with external Lustre FSs Industry Status Various flash devices emerging Available at reasonable cost New Approaches Flat Buffers Burst Buffers etc.,

Issues for Introducing Flash Devices on Next-gen Supercomputers Flash Devices have various performance characteristics How much reliability of flash devices is required for burst buffers for next-gen supercomputers? Maximize Throughput, IOPS, Capacity, Reliability, etc. Minimize Cost for introducing flash devices on large-scale systems

Today s Talk Preliminary Evaluation of the endurance for devices to support supercomputing I/O workloads Lustre I/O Monitoring for I/O workload analysis to evaluate the endurance

Towards Extreme-scale Supercomputing Memory/Storage Architecture NVM(Non-Volatile Memory, Flash) is a key device for I/O subsystems Capacity, Non-Volatility, Low energy consumption (vs. DRAM) High throughput, Low latency, Low energy consumption (vs. HDD) Various Flash Devices Throughput: 200MB/s over 1GB/s IOPS: 10,000 over 100,000 Interfaces : PCI-e attched, SATA3, msata, m.2, etc.

Reliability Evaluation of Flash Devices Power on Hours Should support for operation duration 4 5 years Endurance Peta Byte Written (PBW), Tera Byte Written (TBW) Deterioration Progress (%) = Total Write / PBW or TBW Average Erase Count (AEC) Upper limit of erasure count (EC) is determined for a flash device Deterioration Progress (%) = AEC / Upper Limit of EC Incidence of Bad Blocks Impact for wear-leveling, GC

Reliability Evaluation of Flash Devices Power on Hours Should support for operation duration 4 5 years Endurance Peta Byte Written (PBW), Tera Byte Written (TBW) Deterioration Progress (%) = Total Write / PBW or TBW Average Erase Count (AEC) Upper limit of erasure count (EC) is determined for a flash device Deterioration Progress (%) = AEC / Upper Limit of EC Incidence of Bad Blocks Impact for wear-leveling, GC

Endurance of Existing Flash Devices 18000 16000 14000 12000 17000 15000 Tera Byte 10000 8000 6000 4000 2000 0 A: 1.2TB (PCIe) B: 2.4TB (PCIe) 450 73 72 72 C: 800GB (SATA3) D: 512GB (SATA3.0) E: 480GB (msata) F: 480GB (m.2)

Configuration on Supercomputers Flat Buffers (Node Local) Burst Buffers (Intermediate Servers) BB Server BB Server BB Server BB Server Client Client Client Client BB Server BB Server BB Server BB Server Client Client Client Client BB Server BB Server BB Server BB Server Parallel File System Parallel File System

Configuration on Supercomputers Flat Buffers (Node Local) Burst Buffers (Intermediate Servers) BB Server BB Server BB Server BB Server Client Client Client Client BB Server BB Server BB Server BB Server Client Client Client Client BB Server BB Server BB Server BB Server How many devices can we use to support the endurance of the volumes during the operation duration? Parallel File System Parallel File System We analyze the required endurance based on Lustre I/O monitoring

Lustre Lustre I/O Monitoring Bufrst Buffer Storage H/W Flexible Definition File XML XML XML... Virtualization tool Detailed I/O Analysis and Forecast Feedback for User/Job-based Stats for Users Data Analysis Collector Collector runs on any type of machines Router s Storag e Server s Clients OpenTSDB OpenTSDB... Scalable TimeSeries database Hbase OpenTSDB

Lustre I/O Monitoring Scalable and Flexible Many Lustre stats, more than, 10M, 100M... stats #OST * #client * stats * sampling Future : #OST * #client * stats * JOBID(UID, etc) * sampling Today, collected client s exported stats from Lustre servers Lustre-2.1 is still running on server, no jobstats Store data into scalable backend database OpenTSDB and HBase Selectable OpenSource frontend All Lustre version support /proc/fs/lustre structure changes on several lustre versions A shadow definition XML of each version s /proc/fs/lustre

Scheme Endurance Evaluation Estimate Total Write Size [PB] during operation years (1 5 years ) based on Lustre Write Data Rate [GB/s] (Avg, Max) Map the Total Write Size to aggregated volumes Evaluate the endurance of the aggregated volumes based on the Total Write Size

TSUBAME2 System Overview 11PB (7PB HDD, 4PB Tape, 200TB ) Computing Nodes: 17.1PFlops(SFP), 5.76PFlops(DFP), 224.69TFlops(CPU), ~100TB MEM, ~200TB Thin nodes 1408nodes (32nodes x44 Racks) Medium nodes Fat nodes Local s HP Proliant SL390s G7 1408nodes CPU: Intel Westmere-EP 2.93GHz 6cores 2 = 12cores/node GPU: NVIDIA Tesla K20X, 3GPUs/node Mem: 54GB (96GB) : 60GB x 2 = 120GB (120GB x 2 = 240GB) Interconnets: Full-bisection Optical QDR Infiniband Network Infiniband QDR Networks HP Proliant DL580 G7 24nodes CPU: Intel Nehalem-EX 2.0GHz 8cores 2 = 32cores/node GPU: NVIDIA Tesla S1070, NextIO vcore Express 2070 Mem:128GB : 120GB x 4 = 480GB Core Switches Edge Switches Edge Switches /w 10GbE ports 12switches Voltaire Grid Director 4700 12 IB QDR: 324 ports 179switches Voltaire Grid Director 4036 179 IB QDR : 36 ports 6switches Voltaire Grid Director 4036E 6 IB QDR:34ports 10GbE: 2port HP Proliant DL580 G7 10nodes CPU: Intel Nehalem-EX 2.0GHz 8cores 2 = 32cores/node GPU: NVIDIA Tesla S1070 Mem: 256GB (512GB) : 120GB x 4 = 480GB QDR IB( 4) 20 QDR IB ( 4) 8 10GbE 2 GPFS#1 GPFS#2 GPFS#3 GPFS#4 SFA10k #1 /data0 SFA10k #2 /data1 SFA10k #3 /work0 SFA10k #4 /work1 /gscr SFA10k #5 GPFS+Tape Lustre Home HOME System application SFA10k #6 HOME iscsi Global Work Space #1 2.4 PB HDD + Global Work Space #2 Global Work Space #3 3.6 PB Parallel File System Volumes cnfs/clusterd Samba w/ GPFS NFS/CIFS/iSCSI by BlueARC Home Volumes 1.2PB 14

Configuration of Volumes Configuration Same as TSUBAME2 2 devices per node * 1408 nodes = 2816 devices Assumption Written Data are equally distributed to the aggregated volumes The endurance (TBW/PBW) of the aggregated volumes is equal to the aggregation of the endurance of each device

Target Devices 18000 16000 14000 12000 17000 15000 Tera Byte 10000 8000 6000 4000 2000 0 A: 1.2TB (PCIe) B: 2.4TB (PCIe) 450 73 72 72 C: 800GB (SATA3) D: 512GB (SATA3.0) E: 480GB (msata) F: 480GB (m.2) R 1.5 GB/s W 1.3 GB/s R 3.2 GB/s W 2.8 GB/s R 500 MB/s W 450 MB/s R 540 MB/s W 520 MB/s R 500 MB/s W 400 MB/s R 500 MB/s W 400 MB/s

Aggregate Data Rate for TSUBAME2 s Lustre Volumes MAX 16.5 GB/s AVE 2.07 GB/s

Estimated Write Based on Lustre I/O Workload on TSUBAME2 311PB for 5years

Endurance Based on TSUBAE2 Configuration (Using 2816 devices) Aggregate Endurance of volumes using devices (PBW/TBW) Better

Endurance Based on TSUBAE2 Configuration (Using 2816 devices) Device A (1.7 PBW, PCIe) Device B (1.5PBW, PCIe) Device C (450 TBW, SATA3) Aggregate Endurance of volumes (PBW/TBW) Device D (73TBW, SATA3) Device E (72TBW, msata) Device F (72TBW, m.2)

# of Required Devices to Support Estimated Total Write Size during Operation Duration (311PB for 5 years) 5000 4500 4265 4265 4324 4000 3500 3000 # of Devices 2500 2000 1500 1000 500 0 19 21 A: 1.2TB (PCIe) B: 2.4TB (PCIe) 692 C: 800GB (SATA3) D: 512GB (SATA3.0) E: 480GB (msata) F: 480GB (m.2)

Aggregate Throughput 2500 Throughput [GB/s] 2000 1500 1000 500 READ WRITE 0 A: 1.2TB (PCIe) B: 2.4TB (PCIe) C: 800GB (SATA3) D: 512GB (SATA3.0) E: 480GB (msata) F: 480GB (m.2) R 1.5 GB/s W 1.3 GB/s R 3.2 GB/s W 2.8 GB/s R 500 MB/s W 450 MB/s R 540 MB/s W 520 MB/s R 500 MB/s W 400 MB/s R 500 MB/s W 400 MB/s

Discussion We can configure -based buffer volumes using commodity-based devices Based on Lustre Write I/O workload Several thousand of devices needed If we use high end devices (PCI-e attached Flash), we can consolidate the devices Required Performance under the limited endurance of the volumes Further evaluation by using more detailed I/O behavior is needed Emulation of Burst Buffers etc.

Summary Preliminary Evaluation of Endurance for devices to support supercomputing I/O workloads Based on Lustre I/O monitoring High-performance Lustre I/O Monitoring for detailed I/O workload analysis