Lustre & Cluster. - monitoring the whole thing Erich Focht



Similar documents
Commoditisation of the High-End Research Storage Market with the Dell MD3460 & Intel Enterprise Edition Lustre

Lustre Monitoring with OpenTSDB

Architecting a High Performance Storage System

Current Status of FEFS for the K computer

Building a Top500-class Supercomputing Cluster at LNS-BUAP

Application Performance for High Performance Computing Environments

High-Availability and Scalable Cluster-in-a-Box HPC Storage Solution

New Storage System Solutions

An Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing

Post-production Video Editing Solution Guide with Microsoft SMB 3 File Serving AssuredSAN 4000

Lustre failover experience

CRIBI. Calcolo Scientifico e Bioinformatica oggi Università di Padova 13 gennaio 2012

PSAM, NEC PCIe SSD Appliance for Microsoft SQL Server (Reference Architecture) September 11 th, 2014 NEC Corporation

Sun Constellation System: The Open Petascale Computing Architecture

Highly-Available Distributed Storage. UF HPC Center Research Computing University of Florida

New and Improved Lustre Performance Monitoring Tool. Torben Kling Petersen, PhD Principal Engineer. Chris Bloxham Principal Architect

NetApp High-Performance Computing Solution for Lustre: Solution Guide

Lustre * Filesystem for Cloud and Hadoop *

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

Kriterien für ein PetaFlop System

FLOW-3D Performance Benchmark and Profiling. September 2012

How To Monitor Infiniband Network Data From A Network On A Leaf Switch (Wired) On A Microsoft Powerbook (Wired Or Microsoft) On An Ipa (Wired/Wired) Or Ipa V2 (Wired V2)

Configuration Maximums

SMB Direct for SQL Server and Private Cloud

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

Hadoop on the Gordon Data Intensive Cluster

HPC Update: Engagement Model

Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications

GPFS Storage Server. Concepts and Setup in Lemanicus BG/Q system" Christian Clémençon (EPFL-DIT)" " 4 April 2013"

Jason Hill HPC Operations Group ORNL Cray User s Group 2011, Fairbanks, AK

Sun Storage Perspective & Lustre Architecture. Dr. Peter Braam VP Sun Microsystems

BRIDGING EMC ISILON NAS ON IP TO INFINIBAND NETWORKS WITH MELLANOX SWITCHX

High Availability in Lustre. John Fragalla Principal Solutions Architect High Performance Computing

Developing High-Performance, Scalable, cost effective storage solutions with Intel Cloud Edition Lustre* and Amazon Web Services

Network Contention and Congestion Control: Lustre FineGrained Routing

Lessons learned from parallel file system operation

Scala Storage Scale-Out Clustered Storage White Paper

Lustre SMB Gateway. Integrating Lustre with Windows

Picking the right number of targets per server for BeeGFS. Jan Heichler March 2015 v1.2

High Performance Computing OpenStack Options. September 22, 2015

Preparation Guide. How to prepare your environment for an OnApp Cloud v3.0 (beta) deployment.

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Building a cost-effective and high-performing public cloud. Sander Cruiming, founder Cloud Provider

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

PRIMERGY server-based High Performance Computing solutions

Cluster Implementation and Management; Scheduling

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

HPC Growing Pains. Lessons learned from building a Top500 supercomputer

Configuration Maximums

Michael Kagan.

Florida Site Report. US CMS Tier-2 Facilities Workshop. April 7, Bockjoo Kim University of Florida

Monitoring Tools for Large Scale Systems

LSI SAS inside 60% of servers. 21 million LSI SAS & MegaRAID solutions shipped over last 3 years. 9 out of 10 top server vendors use MegaRAID

The Future of Computing Cisco Unified Computing System. Markus Kunstmann Channels Systems Engineer

LLNL Production Plans and Best Practices

Configuring LNet Routers for File Systems based on Intel Enterprise Edition for Lustre* Software

Intel Cluster Ready Appro Xtreme-X Computers with Mellanox QDR Infiniband

Lustre Networking BY PETER J. BRAAM

Cray Lustre File System Monitoring

JUROPA Linux Cluster An Overview. 19 May 2014 Ulrich Detert

LMT Lustre Monitoring Tools

Scientific Computing Data Management Visions

SPC BENCHMARK 2/ENERGY EXECUTIVE SUMMARY ORACLE CORPORATION ORACLE ZFS STORAGE ZS3-2 APPLIANCE (2-NODE CLUSTER) SPC-2/E V1.5

Determining the health of Lustre filesystems at scale

Bright Idea: GE s Storage Performance Best Practices Brian W. Walker

ECLIPSE Best Practices Performance, Productivity, Efficiency. March 2009

A Holistic Model of the Energy-Efficiency of Hypervisors

Preview of a Novel Architecture for Large Scale Storage

The Use of Flash in Large-Scale Storage Systems.

Deploying Hadoop on Lustre Sto rage:

Introduction to Infiniband. Hussein N. Harake, Performance U! Winter School

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA

PCIe Over Cable Provides Greater Performance for Less Cost for High Performance Computing (HPC) Clusters. from One Stop Systems (OSS)

Installing Hadoop over Ceph, Using High Performance Networking

JuRoPA. Jülich Research on Petaflop Architecture. One Year on. Hugo R. Falter, COO Lee J Porter, Engineering

Scripts for InfiniBand Monitoring

Can High-Performance Interconnects Benefit Memcached and Hadoop?

High Throughput File Servers with SMB Direct, Using the 3 Flavors of RDMA network adapters

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

Building All-Flash Software Defined Storages for Datacenters. Ji Hyuck Yun Storage Tech. Lab SK Telecom

Advancing Applications Performance With InfiniBand

A Shared File System on SAS Grid Manger in a Cloud Environment

ALPS Supercomputing System A Scalable Supercomputer with Flexible Services

Scalable filesystems boosting Linux storage solutions

BIG DATA The Petabyte Age: More Isn't Just More More Is Different wired magazine Issue 16/.07

IBM Spectrum Scale vs EMC Isilon for IBM Spectrum Protect Workloads

SMB Advanced Networking for Fault Tolerance and Performance. Jose Barreto Principal Program Managers Microsoft Corporation

SAN Conceptual and Design Basics

ECLIPSE Performance Benchmarks and Profiling. January 2009

Open Cirrus: Towards an Open Source Cloud Stack

Data Management Best Practices

高 通 量 科 学 计 算 集 群 及 Lustre 文 件 系 统. High Throughput Scientific Computing Clusters And Lustre Filesystem In Tsinghua University

Springpath Data Platform with Cisco UCS Servers

Lustre* is designed to achieve the maximum performance and scalability for POSIX applications that need outstanding streamed I/O.

Performance of the JMA NWP models on the PC cluster TSUBAME.

SX1012: High Performance Small Scale Top-of-Rack Switch

Mellanox Cloud and Database Acceleration Solution over Windows Server 2012 SMB Direct

Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013

I D C O P I N I O N S I T U A T I O N O V E R V I E W. Sponsored by: NEC. Kazuhiko Hayashi October 2014

Transcription:

Lustre & Cluster - monitoring the whole thing Erich Focht NEC HPC Europe LAD 2014, Reims, September 22-23, 2014 1

Overview Introduction LXFS Lustre in a Data Center IBviz: Infiniband Fabric visualization Monitoring Lustre & other metrics Aggregators Percentiles Jobs 2

Introduction NEC Corporation HQ in Tokyo, Japan Established 1899 100000 employees Business activities in over 140 countries NEC HPC Europe Part of NEC Deutschland Offices in Düsseldorf, Paris, Stuttgart. Scalar clusters Parallel FS and Storage Vector systems Solutions, Services, R&D 3

Parallel Filesystem Product: LXFS Built with Lustre Since 2007: parallel filesystem solution integrating validated HW (servers, networks, high performance storage), deployment, management, monitoring SW. Two flavours: LXFS standard Community edition Lustre LXFS Enterprise Intel(R) Enterprise Edition for Lustre* Software Level 3 support from Intel 4

Lustre Based LXFS Appliance Each LXFS block is a pre-installed appliance Entire storage cluster configuration is in one place: LxDaemon Describes the setup, hierarchy Contains all configurable parameters Single data source used for deployment, management, monitoring Deploying the storage cluster is very easy and reliable: Simple and reliable provisioning Auto-generate configuration of components from data in LxDaemon Storage devices setup, configuration Servers setup, configuration HA setup, services, dependencies, STONITH Formatting, Lustre mounting Managing & monitoring LXFS Simple CLI, hiding complexity of underlying setup Health monitoring: pro-active, sends messages to admins, reacts to issues Performance monitoring: time-history, GUI, discover issues and bottlenecks 5

Eg.: Performance Optimized OSS Block Specs > 6GB/s write, > 7.5GB/s read, up to 384TB in 8U (storage) 2 Servers: highly available, active-active 2 x 6core E5-2620(-v2) 2 x LSI9207 HBA, 1 x Mellanox ConnectX 3 FDR HCA Storage NEC SNA460 + SNA060 (built on NetApp E5500)... Ost1-6 OSS1 Ost7-12 OSS2 RAID EBOD Performance example 80 NL-SAS 7.2k IOR In noisy environment 6

Concrete Setup: Automotive Company Datacenter Top level switch 2 x 2 x 12 x 12x4GB/s...... 2 x FDR 8x6GB/s > 20GB/s aggregated bandwidth... High-bandwidth island FDR High-bandwidth island Some 3000+ nodes 7

Demonstrate 20GB/s on 8 or 16 Nodes! Check all layers! OSS to Block Devices vdbench, sgpdd-survey, with Lustre: obdfilter-survey Read/write to controller cache! Check PCIe, HBAs (SAS) LNET Lnet-selftest Peer credits? Enough transactions in flight? Lustre client Benchmark 2 hosts, 8 volume groups, 64 threads / blk device 10000 9000 8000 MB/s write read 7000 6000 5000 4000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 time [10s] IOR, stonewalling... but... 8

The Infiniband Fabric: Routes! Top level switch Paths used by 2 nodes 2 x 2 x 12 x Max 8 paths 12x4GB/s used for write...... 8x6GB/s > 20GB/s aggregated bandwidth... High-bandwidth island 2 x FDR FDR Read and write paths differ! High-bandwidth island Some 3000+ nodes 9

Demonstrate 20GB/s on 8 or 16 Nodes! Given the routing is static: select nodes carefully Out of a pool of reserved nodes Minimize oversubscription of InfiniBand links on both paths... to OSTs/OSSes that are involved... and back to client nodes 30000 Use stonewalling Result from 16 clients Write: max 24.7GB/s Read: max 26.6GB/s 25000 20000 MB/s 15000 10000 5000 0 128 256 512 768 Threads 10 1024

Trivial Remarks Performance of storage devices is huge Performance of one OSS is O(IB link bandwidth) If overcommitting inter-switch-links we hit quickly the limit Sane way of integrating is more expensive and more complex: Lustre/Lnet routers Additional hardware Each responsible for its island Can also limit I/O bandwidth in an island 11

Integrated In Datacenter: The Right Way Top level switch 2 x 2 x Lnet Router... Lnet Router Lnet Router... At least 30+ additional servers IB cables Free IB switch ports Separate config for each island's router in order to keep scope local Lnet Router FDR 2 x FDR 8x6GB/s... High-bandwidth island High-bandwidth island Some 3000+ nodes > 20GB/s aggregated bandwidth 12

IBviz Development inspired by benchmarking experience described before Visualize the detected IB fabric, focus on inter-switch links (ISLs) Mistakes in network are visible (eg. missing links) Compute number of static routes (read/write) going over ISLs Which links are actually used in an MPI job? How much bandwidth is really available? Which Links are used for IO and what bandwidth is available) 13

Routes Between Compute Nodes: Balanced 14

Routes to Lustre Servers: Unbalanced 15

Monitoring Cluster & Lustre Together Lustre Servers... ok Ganglia standard metrics own metric generators delivering metrics into Ganglia, Nagios Cluster Compute Nodes... ok See above... Ganglia, Nagios, own metrics IB ports... ok Own metrics, in Ganglia, attached to hosts Need more and better IB routes and inter-switch-link stats... Routes: ok, ISL stats: doable Don't fit well into hierarchy... Ganglia won't work any more 16

Monitoring Cluster & Lustre Together Lustre Clients... Plenty of statistics on clients and servers often: no access to clients: use metrics on servers Scaling problem! At least 4 metrics per OST and client 1, O(10), more metrics per MDT and client 1000 clients, 50 OSTs: 200k additional metrics in monitoring system Jobs Ultimately, we want to know which jobs are performing poorly on Lustre, why, advise users, tune their code, the filesystem, use ost pools... Either map client nodes to jobs (easy, no change on cluster needed) Or use (newer) tagging mechanism provided by Lustre Store metrics per job!? Rather unusual in normal monitoring systems 17

Monitoring Cluster & Lustre Together Aggregation! Reduce number of metrics stored Do we care about each OST read/write bandwidth for each client node? Rather look at total bandwidths and transaction sizes of node Example: 50 OSTs * 4 metrics -> 4 metrics 200k metrics -> 4k metrics to store But still processing 200k metrics What about jobs? In order to see whether anything goes wrong, do we need info on each compute node? The work described on the following pages has been funded by the German Ministry of Education and Research (BMBF) within the FEPA project. 18

Aggregate Job Metrics: in FEPA Project Percentiles Instead of looking at the metric of each node in a job, look at all at once Example data: Sort your data: Get the percentiles: Value Number of observations Value intervals (bins) 10th 20th 50th Percentiles 90th 100th 19

Architecture: Data Collection / Aggregation gmetad RRDs Lustre server metrics Per OST Lustre client metrics gmond 4 2 3 MDS trigger OSS OSS 1... Agg client metrics 4 orchestrator gmond trigger OSS gmetad Cluster metrics 5 6 trigger 6 Cluster gmetad Agg job metrics 6 RRDs Resource manager Aggregated client Lustre metrics RRDs gmond Aggregated job percentile metrics 20

Architecture: Visualization gmetad Lustre server metrics RRDs LxGUI MDS OSS LxDaemon OSS... Aggregated client Lustre metrics Aggregated job percentile metrics OSS gmetad Cluster metrics LxDaemon integrates ganglia Grid and Cluster hierarchies into a single one, allowing simultaneous access to multiple gmetad data sources through an XPATH alike query language. gmetad RRDs Cluster Resource manager RRDs 21

Metrics Hierarchies 22

Cluster Metrics: DP MFLOPS Percentiles 23

MFLOPS avg vs. Lustre Write Percentiles 24

Job: Lustre Read KB/s Percentiles 25

Totalpower Percentiles in a Job 26

Metrics from different Ganglia Hierarchies 27

Conclusion Computer systems are complex: Infiniband network: even static metrics help understand system performance There are plenty of metrics to measure and watch Storing all of them is currently not an option Per Job metrics are most useful for users. Aggregation But monitoring systems not really prepared for jobs data Ganglia turned out to be quite flexible and helpful, but gmetad and RRD have their issues... 28