Current Status of FEFS for the K computer



Similar documents
New Storage System Solutions

The K computer: Project overview

GPFS Storage Server. Concepts and Setup in Lemanicus BG/Q system" Christian Clémençon (EPFL-DIT)" " 4 April 2013"

Commoditisation of the High-End Research Storage Market with the Dell MD3460 & Intel Enterprise Edition Lustre

Architecting a High Performance Storage System

HPC Update: Engagement Model

Cray DVS: Data Virtualization Service

Operating System for the K computer

Sun Constellation System: The Open Petascale Computing Architecture

High Performance Computing OpenStack Options. September 22, 2015

Cray Lustre File System Monitoring

New and Improved Lustre Performance Monitoring Tool. Torben Kling Petersen, PhD Principal Engineer. Chris Bloxham Principal Architect

NetApp High-Performance Computing Solution for Lustre: Solution Guide

Lustre & Cluster. - monitoring the whole thing Erich Focht

The PHI solution. Fujitsu Industry Ready Intel XEON-PHI based solution. SC Denver

An Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

ALPS Supercomputing System A Scalable Supercomputer with Flexible Services

Kriterien für ein PetaFlop System

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales

JUROPA Linux Cluster An Overview. 19 May 2014 Ulrich Detert

Network File System (NFS) Pradipta De

PRIMERGY server-based High Performance Computing solutions

Cluster Implementation and Management; Scheduling

Scala Storage Scale-Out Clustered Storage White Paper

Highly-Available Distributed Storage. UF HPC Center Research Computing University of Florida

Investigation of storage options for scientific computing on Grid and Cloud facilities

Lessons learned from parallel file system operation

Building a Top500-class Supercomputing Cluster at LNS-BUAP


Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications

Performance, Reliability, and Operational Issues for High Performance NAS Storage on Cray Platforms. Cray User Group Meeting June 2007

The Use of Flash in Large-Scale Storage Systems.

Technical Computing Suite Job Management Software

SPARC64 VIIIfx: CPU for the K computer

PSAM, NEC PCIe SSD Appliance for Microsoft SQL Server (Reference Architecture) September 11 th, 2014 NEC Corporation

Application Performance for High Performance Computing Environments

Lustre failover experience

Hadoop on the Gordon Data Intensive Cluster

Intel Solid- State Drive Data Center P3700 Series NVMe Hybrid Storage Performance

Sonexion GridRAID Characteristics

Xyratex Update. Michael K. Connolly. Partner and Alliances Development

FLOW-3D Performance Benchmark and Profiling. September 2012

High Performance Computing Specialists. ZFS Storage as a Solution for Big Data and Flexibility

February, 2015 Bill Loewe

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

A Shared File System on SAS Grid Manger in a Cloud Environment

THE EXPAND PARALLEL FILE SYSTEM A FILE SYSTEM FOR CLUSTER AND GRID COMPUTING. José Daniel García Sánchez ARCOS Group University Carlos III of Madrid

Copyright 2013, Oracle and/or its affiliates. All rights reserved.

Monitoring Tools for Large Scale Systems

Sun Storage Perspective & Lustre Architecture. Dr. Peter Braam VP Sun Microsystems

Express5800 Scalable Enterprise Server Reference Architecture. For NEC PCIe SSD Appliance for Microsoft SQL Server

Investigation of storage options for scientific computing on Grid and Cloud facilities


NERSC File Systems and How to Use Them

SAM-FS - Advanced Storage Management Solutions for High Performance Computing Environments

Easier - Faster - Better

Purchase of High Performance Computing (HPC) Central Compute Resources by Northwestern Researchers

Intel Cluster Ready Appro Xtreme-X Computers with Mellanox QDR Infiniband

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

Lustre SMB Gateway. Integrating Lustre with Windows

PCI Express Impact on Storage Architectures and Future Data Centers. Ron Emerick, Oracle Corporation

Netapp HPC Solution for Lustre. Rich Fenton UK Solutions Architect

HP reference configuration for entry-level SAS Grid Manager solutions

Clusters: Mainstream Technology for CAE

Lustre* is designed to achieve the maximum performance and scalability for POSIX applications that need outstanding streamed I/O.

IT of SPIM Data Storage and Compression. EMBO Course - August 27th! Jeff Oegema, Peter Steinbach, Oscar Gonzalez

SQL Server Business Intelligence on HP ProLiant DL785 Server

Building All-Flash Software Defined Storages for Datacenters. Ji Hyuck Yun Storage Tech. Lab SK Telecom

LS DYNA Performance Benchmarks and Profiling. January 2009

A Survey of Shared File Systems

How SSDs Fit in Different Data Center Applications

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Developing High-Performance, Scalable, cost effective storage solutions with Intel Cloud Edition Lustre* and Amazon Web Services

Fujitsu HPC Cluster Suite

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

JuRoPA. Jülich Research on Petaflop Architecture. One Year on. Hugo R. Falter, COO Lee J Porter, Engineering

CONFIGURATION GUIDELINES: EMC STORAGE FOR PHYSICAL SECURITY

A Shared File System on SAS Grid Manager * in a Cloud Environment

CMS Tier-3 cluster at NISER. Dr. Tania Moulik

PADS GPFS Filesystem: Crash Root Cause Analysis. Computation Institute

Large Scale Storage. Orlando Richards, Information Services LCFG Users Day, University of Edinburgh 18 th January 2013

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

ebay Storage, From Good to Great

The Lustre File System. Eric Barton Lead Engineer, Lustre Group Sun Microsystems

Scientific Storage at FNAL. Gerard Bernabeu Altayo Dmitry Litvintsev Gene Oleynik 14/10/2015

Lustre * Filesystem for Cloud and Hadoop *

Parallels Cloud Storage

Virtuoso and Database Scalability

DELL TM PowerEdge TM T Mailbox Resiliency Exchange 2010 Storage Solution

Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle

ECLIPSE Performance Benchmarks and Profiling. January 2009

Transcription:

Current Status of FEFS for the K computer Shinji Sumimoto Fujitsu Limited Apr.24 2012 LUG2012@Austin

Outline RIKEN and Fujitsu are jointly developing the K computer * Development continues with system software tuning for completion in June 2012. Outline of This Talk K computer and FEFS Overview Development Status of FEFS Performance * Nickname of the Next Generation Supercomputer developed by RIKEN and Fujitsu 1

World s No.1 at SC11 Again on TOP500 List Performance of over 10 Peta*flops 4 of HPC Challenge Awards SC11 Gordon Bell Prize SC11 TOP500 #1 TOP500 awarding in SC11 2 *10 Peta = 10,000,000,000,000,000

System Overview of K computer Processor: SPARC64 TM VIIIfx Fujitsu s 45nm technology 8 Core, 6MB Cache Memory and MAC on Single Chip High Performance and High Reliability with Low Power Consumpition Interconnect Controller:ICC 6 dims-torus/mesh (Tofu Interconnect) System Board: High Efficient Cooling With 4 Computing Nodes Water Cooling: Processors, ICCs etc Increasing component lifetime and reducing electric leak current by low temperature water cooling Our Goals Challenging to Realize World s Top 1 Performance Keeping Stable System Operation over 80K Node System Rack:High Density 102 Nodes on Single Rack 24 System Boards 6 IO System Boards System Disk Power Units (10PFlops: >800 Racks) 3 System Image Copyright 2011 FUJITSU LIMITED

System Configuration 8Core Processor Shared Cache Core Core Core Core SX Ctrl MAC Core Core Core Core Tofu IF Memory BW: 64GB/s DDR3 DDR3 DDR3 Memory Tofu Hardware Interconnect BW: 100GB/s $ FMA FMA Computing Node Hardware Barrier Control Management Maintenance Server s Potal Tofu Interconnect(Comp) IO Network Tofu Interconnect(IO) Frontend Cluster Interconnect File Server Local File System 4 Global File System

IO Architecture of K computer IO System Architecture Local Storage for JOB Execution: ETERNUS(2.5inch, RAID5) Global Storage for Shared Use: ETERNUS(3.5inch, RAID6) Compute Nodes I/O Node QDR IB Tofu PCIe FC RAID Configurations of each file system is optimized for each. Local File System: Over 2400-OSS (for Highly Parallel) Global File System: Over 80-OSS (for Big Capacity) Z 5 X Y Compute Node Tofu Interconnect IO node Local Disk Local File System IB SW Global File Server Global Disk Global File System

Staging Overview of FEFS Goals: To realize World Top Class Capacity and Performance File system 100PB, 1TB/s Based on Lustre File System with several extensions These extensions will be contributed to Lustre community. Introducing Layered File system for each file layer characteristics Temporary Fast Scratch FS(Local) and Permanent Shared FS(Global) Staging Function which transfers between Local FS and Global FS is controlled by Batch Scheduler File Server File Server Local File System Local File System (work temporary) For Performance File Cluster File System FEFS File Server Global File System Configuration of FEFS for K computer For Easy Use For Capacity and Reliability Global File System (data permanent) 6

Lustre Specification and Goal of FEFS System Limits Node Scalability Features Max file system size Max file size Max #files Max OST size Max stripe count Max ACL entries Max #OSSs Max #OSTs Max #Clients Block Size of ldiskfs (Backend File System) Current Lustre 64PB 320TB 4G 16TB 160 32 1020 8150 128K 4KB Our 2012 Goals 100PB (8EB) 1PB (8EB) 32G (8E) 100TB (1PB) 20k 8191 20k 20k 1M ~512KB These were contributed to OpenSFS 2/2011 7

Lustre Extension of FEFS We have extended several Lustre specifications and functions Extended Spec. Scalability Max File Size Max # of Files Max # of Clients Max # of Stripes 512KB Blocks Communication Tofu Support IB Multi-rail Existing-Lustre Extended Functions IB/GbE LNET Routing Interoperability File distribution Prallel IO Client Caches Lustre Functions NFS export Server Cache Reliablity New High Performance Auto. Recovery Journaling/fsck Functions Extended MDS response IO BW Cont. IO Zoning OS Noise Reduction For Operation ACL Disk Quota Dyn. Config Chages RAS QoS Directory Quota Reuse 8

FEFS Development Status on the K computer Currently Almost of All the functions are implemented and in testing phase. Current Base Version: 1.8.5 + α 8PF Class Scalability Testing Main Testing Target: User s available memory over 90% of physical memory Minimizing impact of OS jitter to application performance IO performance RAS We will discuss memory issues and OS jitter 9

To Solve Memory Issues Goal: System memory usage < 1.6GB Minimizing memory usage for buffer cache on each FEFS client Added some functionality to limit dirty pages on write operation. Minimizing memory consumption of FEFS for ultra large scale system Investigating the reasons and fix them. We found that there are several issues on current basic Lustre designs 10

Memory Issue: Request Buffer (1) Issue: Request buffer on client is pre-allocated by #OSTs in Lustre. Buffer size = 8KB x 10 x #OSTs / request #OST=1,000 #OST=10,000 80MB / request 800MB / request Our Approach: On demand allocation: Allocate request buffer when it required. Current Lustre FEFS Used for requests to be sent Used =Allocated Pre-allocated Unused Allocate when it needed 11

Memory Issue: Request Buffer (2) Issue: When create a file, client allocates 24B x Max. OST index size of request buffer. OST index = 1,000 23KB / request OST index = 10,000 234KB / request Our Approach Step-1: Reduce buffer size to 24B x #Existing OSTs. Step-2: Minimize to 24B x #Striped OSTs. Current Lustre Step-1 FEFS Step-2 OST 0 OST 1 Request Request Request OST 0 OST 1 OST n Request Request Request OST 0 OST 1 In the case of stripe count =2 Request Request Request OST m 12

Memory Issue: Granted Cache Issue Dirty pages on the clients have to be written to target OST. Default is 32MB (max_dirty_mb) Each OST keeps free disk space for all clients. Required free space = 32MB/client x #Clients / OST 10,000 Clients 320GB/OST 100,000 Client 3200GB/OST (3.2TB!!) Our Approach Shrinking dirty pages limitation: 32MB 1~4MB This causes serious degradation of I/O performance. (Not acceptable) It s supposed to be fixed in Lustre 2.x 13

To Solve the OS Jitter Problem Minimizing Network Traffic Congestion ll_ping Minimizing Background Daemons ll_ping ldlm_poold 14

Minimizing Network Traffic Congestion by ll_ping Network congestion and request timeout cause: #of monitoring pings #of clients x #of servers MPI and file I/O communication degradation. Application performance degradation by OS jitter. Our Solution: Stopping interval ll_ping Client Compute nodes I/O Network MDS OSS File Servers 15

Minimizing Background Daemons ll_ping Problem: All clients broadcast monitoring pings to all OSTs (not OSS) at regular intervals of 25 seconds. 100K Clients x 10K OSTs 1G pings every 25 seconds. Sending Ping and receiving Ping become OS zitter. Our Solution: Stopping broadcasting pings on clients. Other pings, for recovery and for I/O confirmation, etc., are kept Node failure is detected by system function of K computer ldlm_poold Problem: Operation time of ldlm_poold on client increases in proportion to the number of OSTs. It manages the pool of LDLM locks. It wakes up regular interval of 1sec. Our Solution: Reduce the processing time per operation of ldlm_poold by divide the deamon s internal operation. 16

Preliminary Benchmark Results on 8PF Local File System GB/s GB/s Over 1TB/s IOR Bandwidth on the K computer. 1.3TB/s IOR Read on 80% of K computer IOR Write (direct, file per proc) IOR Read (direct, file per proc) 1,400 1,400 1,200 IOR Distributed Write 1,200 IOR Distributed Read 1,000 1,000 800 800 600 600 400 400 200 200 0 0 500 1,000 1,500 2,000 2,500 # of OSS 0 0 500 1,000 1,500 2,000 2,500 # of OSS Measured (2007 OSS) Full Estimated(2592 OSS) Read 1.31 TB/s 1.68 TB/s Write 0.83 TB/s 1.06 TB/s 17

Performance Evaluation of FEFS (2) (Collaborative work with RIKEN on K computer) Metadata performance of mdtest. (unique directory) FEFS (K computer) MDS:RX300S6 (X5680 3.33 GHz 6core x2, 48GB, IB(QDR)x2) FEFS, Lustre (IA) MDS:RX200S5 (E5520 2.27GHz 4core x2, 48GB, IB(QDR)x1) FEFS Lustre IOPS K computer IA IA FEFS FEFS 1.8.5 2.0.0.1 create 34697.6 31803.9 24628.1 17672.2 unlink 39660.5 26049.5 26419.5 20231.5 mkdir 87741.6 77931.3 38015.5 22846.8 rmdir 28153.8 24671.4 17565.1 13973.4 *We will evaluate latest Lustre 2.1.0 performance. 18

Summary and Future Work We described overview of FEFS for the K computer developed by RIKEN and Fujitsu. High-speed file I/O and MPI-IO with low time impact on the job execution. Huge file system capacity, scalable capacity & speed by adding hardware. High-reliability (service continuity, data integrity), Usability (easy operation & management) Future Work Stable Operation on K computer Rebase to newer version of Lustre (2.x) Contribution our extension to Lustre Community 19

Press Release at SC11 20

21 Copyright 2012 2011 2010 FUJITSU LIMITED