Correlating Multiple TB of Performance Data to User Jobs

Michael Kluge, ZIH Correlating Multiple TB of Performance Data to User Jobs Lustre User Group 2015, Denver, Colorado Zellescher Weg 12 Willers-Bau A 208 Tel. +49 351-463 34217 Michael Kluge (michael.kluge@tu-dresden.de)

About us ZIH: about us HPC and service provider Research Institute Big Data Competence Center Main research areas: Performance Analysis Tools Energy Efficiency Computer Architecture Standardization Efforts (OpenACC, OpenMP) Michael Kluge, ZIH 2

ZIH HPC and BigData Infrastructure Erlangen, Potsdam TU Dresden ZIH- Backbone 100 Gbit/s Cluster Uplink Home Archiv HTW, Chemnitz, Leipzig, Freiberg Parallel Machine BULL HPC Haswell processors 2 x 612 Nodes ~ 30.000 Cores High Throughput BULL Islands Westmere, Sandy Bridge, und Haswell processors, GPU-Nodes 700 Nodes ~15.000 Cores other clusters Megware Cluster IBM idataplex Shared Memory SGI Ultraviolet 2000 512 cores SandyBridge 8 TB RAM NUMA Storage Michael Kluge, ZIH 3

HRSK-II, Phase 2, currently installed GPU Isle 64 nodes Haswell 2 x 12 cores 64 GB RAM + 2 x Nvidia K 80 Troughput Isle 232 nodes Haswell 2 x 12 cores 64/128/256 GB RAM 128 GB SSD HPC Isle 612 nodes Haswell 2 x 12 cores 64 GB RAM 128 GB SSD HPC Isle 612 nodes Haswell 2 x 12 cores 64 GB RAM 128 GB SSD 64 GB RAM 24 Lustre Server 2 x Export 4 x Login 2 x SMP nodes 2 TB RAM, 48 cores FDR InfiniBand 5 PB NL- SATA 80 GB/s 400.000 IOPS 40 TB SSD 10 GB/s 1 Mio. IOPS to Campus 4 x 10GE Uplink, Home, Backup ~ 1,2 PF peak in CPUs, ~35.000 cores total ~ 300 TF peak in GPUs, 128 cards Michael Kluge, ZIH 4

Architecture of our storage concept (FASS) SCRATCH Switch 2 Switch 1 ZIH-Infrastructure Flexible Storage System (FASS) Login Batch-System Export HPC-Component User A User Z User A User Z Server/File systems Server 1 Transaction Checkpoint. Server 2 Scratch Server N Net Storage SSD SAS SATA Access statistics Analysis Throughput-Component Optimization and control Michael Kluge, ZIH 5

Performance Analysis for I/O Activities (1) What is of interest Application requests, Interface types Access sizes/patterns Network/Server/Storage utilization Level of detail: Record everything Record data every 5 to 10 seconds Challenge: How to analyze this? How to deal with anonymous data? Michael Kluge, ZIH 6

I/O and Job Correlation: Architecture/Software HPC File System SLURM Batch System node node node node I/O network application events node local events utilization errors MDS OSS OSS OSS SAN backend, cache RAID RAID RAID throughput, IOPS Global I/O Data Correlation System Batch System Log Files Cluster Formation Job Clusters Agent Interface Agent Interface Agent Interface Agent Interface Job Clusters with high I/O demands Job Clusters with low I/O demands Database Storage Data Processing Database Storage Database Storage Data Request s Dataheap Data Collection System https://fusionforge.zih.tu-dresden.de/projects/dataheap/ Michael Kluge, ZIH 7

Visualization of Live Data for Phase 1 Michael Kluge, ZIH 8

Visualization of Live Data for Phase 2 Michael Kluge, ZIH 9

Performance Analysis for I/O Activities (2) Amount of collected data: 240 OSTs, 20 OSS servers, Looking at stats and brw_stats (how to store a histogram in database?) ~75.000 tables about 1 GB of data per hour Analysis is supposed to cover 6 month 4 TB data No way to analyze this with any serial approach Michael Kluge, ZIH 10

Analysis Process map operations to a set of tables modify all tables (align data, scale, aggregate over time) create a new table from an input set of tables (aggregation) plot (time series, histograms) currently: single thread Haskell program Swift/T work in progress 4 TB means 50 seconds on an 80 GB/s file system for the initial read Swift/T has nice ways to describe data dependencies and to parallelize this kind of workflows (open for suggestions!) Michael Kluge, ZIH 11

Approach to look at the Raw Data analysis of the raw performance data take original data and create derived metrics look at plots and histograms correlate metrics with user jobs take job log and split jobs into classes check for correlations between job classes and performance metrics Michael Kluge, ZIH 12

Metric Example: NetApp vs. OST Bandwidth (20s) Michael Kluge, ZIH 13

Metric Example: NetApp vs. OST Bandwidth (60s) Michael Kluge, ZIH 14

Metric Example: NetApp vs. OST Bandwidth (10 minutes) Michael Kluge, ZIH 15

Metric Example: 3 Month bandwidth vs. IOPS Michael Kluge, ZIH 16

I/O and Job Correlation: Starting Point (Bandwidth) Michael Kluge, ZIH 17

Building Job Classes a class is defined by: user id, core count similar runtime similar start time (time isolation) sort all jobs by runtime, split list at gaps split these groups again if they are not overlapping (long gap between start) challenges: runtime fluctuation, large run time variations Michael Kluge, ZIH 18

I/O and Job Correlation: Preliminary Result (1) Michael Kluge, ZIH 19

I/O and Job Correlation: Preliminary Result (1) Michael Kluge, ZIH 20

I/O and Job Correlation: Correlation Factors Michael Kluge, ZIH 21

Open Topics How to store a histogram in database? Other approaches than Swift/T Deal with the zeros Cache intermediate results Michael Kluge, ZIH 22

Conclusion / Questions it is possible to map anonymous performance data back to user jobs advance towards storage controller analysis (caches, backend bandwidth, ) Michael Kluge, ZIH 23