TACC Stats I/O Performance Monitoring for the Intransigent



Similar documents
System Administration

OpenMP Programming on ScaleMP

Performance Counter. Non-Uniform Memory Access Seminar Karsten Tausche

Porting Lustre to Operating Systems other than Linux. Ken Hornstein US Naval Research Laboratory April 16, 2010

Multi-Threading Performance on Commodity Multi-Core Processors

Lecture 1: the anatomy of a supercomputer

Lecture 17: Virtual Memory II. Goals of virtual memory

Tyche: An efficient Ethernet-based protocol for converged networked storage

Performance of Software Switching

An Implementation Of Multiprocessor Linux

Configuration Maximums VMware vsphere 4.0

VIRTUALIZATION AND CPU WAIT TIMES IN A LINUX GUEST ENVIRONMENT

Locating Cache Performance Bottlenecks Using Data Profiling

Object Classes and Permissions

How To Understand How A Process Works In Unix (Shell) (Shell Shell) (Program) (Unix) (For A Non-Program) And (Shell).Orgode) (Powerpoint) (Permanent) (Processes

RDMA over Ethernet - A Preliminary Study

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

Chapter 4 System Unit Components. Discovering Computers Your Interactive Guide to the Digital World

Configuration Maximums

Linux Driver Devices. Why, When, Which, How?

How To Monitor Performance On A Microsoft Powerbook (Powerbook) On A Network (Powerbus) On An Uniden (Powergen) With A Microsatellite) On The Microsonde (Powerstation) On Your Computer (Power

Discovering Computers Living in a Digital World

Configuration Maximums

Parallel Processing using the LOTUS cluster

A Survey of Parallel Processing in Linux

Lustre & Cluster. - monitoring the whole thing Erich Focht

Binary search tree with SIMD bandwidth optimization using SSE

KVM PERFORMANCE IMPROVEMENTS AND OPTIMIZATIONS. Mark Wagner Principal SW Engineer, Red Hat August 14, 2011

Delivering Quality in Software Performance and Scalability Testing

CRIBI. Calcolo Scientifico e Bioinformatica oggi Università di Padova 13 gennaio 2012

Lecture 24 Systems Programming in C

Unix::Statgrab - System Monitoring

How To Monitor Infiniband Network Data From A Network On A Leaf Switch (Wired) On A Microsoft Powerbook (Wired Or Microsoft) On An Ipa (Wired/Wired) Or Ipa V2 (Wired V2)

GNS Managed Services Report For abcsun02

Performance Characteristics of Large SMP Machines

ELEC 377. Operating Systems. Week 1 Class 3

Architecting a High Performance Storage System

Performance Testing. Configuration Parameters for Performance Testing

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

Commoditisation of the High-End Research Storage Market with the Dell MD3460 & Intel Enterprise Edition Lustre

Sun Constellation System: The Open Petascale Computing Architecture

HPC Hardware Overview

Cray Lustre File System Monitoring

CS 377: Operating Systems. Outline. A review of what you ve learned, and how it applies to a real operating system. Lecture 25 - Linux Case Study

Performance monitoring. in the GNU/Linux environment. Linux is like a wigwam - no Windows, no Gates, Apache inside!

Comprehensive Resource Use Monitoring for HPC Systems with TACC Stats

Configuration Maximums

Process Scheduling in Linux

Distributed File Systems

The Lagopus SDN Software Switch. 3.1 SDN and OpenFlow. 3. Cloud Computing Technology

Configuration Maximums VMware vsphere 4.1

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

Putting it all together: Intel Nehalem.

High-Performance Processing of Large Data Sets via Memory Mapping A Case Study in R and C++

Performance Tuning best pracitces and performance monitoring with Zabbix

Packet Capture in 10-Gigabit Ethernet Environments Using Contemporary Commodity Hardware

Multicore Processor and GPU. Jia Rao Assistant Professor in CS

HPC Update: Engagement Model

Parallel Large-Scale Visualization

Current Status of FEFS for the K computer

Green Telnet. Making the Client/Server Model Green

PERFORMANCE TUNING ORACLE RAC ON LINUX

How System Settings Impact PCIe SSD Performance

Operating Systems Design 16. Networking: Sockets

LoadLeveler Overview. January 30-31, IBM Storage & Technology Group. IBM HPC Developer TIFR, Mumbai

Configuration Maximums VMware Infrastructure 3

Red Hat Linux Internals

Lustre Monitoring with OpenTSDB

Oracle Exadata Database Machine for SAP Systems - Innovation Provided by SAP and Oracle for Joint Customers

These sub-systems are all highly dependent on each other. Any one of them with high utilization can easily cause problems in the other.

UNIX File Management (continued)

Configuration Maximums

Stovepipes to Clouds. Rick Reid Principal Engineer SGI Federal by SGI Federal. Published by The Aerospace Corporation with permission.

Lecture 2 Parallel Programming Platforms

Near-Dedicated Scheduling

Recommended hardware system configurations for ANSYS users

Parallel Programming Survey

Installing, upgrading and troubleshooting your CLIO system under the Windows environment.

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

Oracle Database Scalability in VMware ESX VMware ESX 3.5

Running a Workflow on a PowerCenter Grid

Storage benchmarking cookbook

Cray DVS: Data Virtualization Service

Building a computer. Electronic Numerical Integrator and Computer (ENIAC)

The Linux Kernel: Process Management. CS591 (Spring 2001)

LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS Hermann Härtig

HPC and Big Data. EPCC The University of Edinburgh. Adrian Jackson Technical Architect

Alexander Reinefeld, Thorsten Schütt, ZIB 1

Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications

CFS-v: I/O Demand-driven VM Scheduler in KVM

CIT 470: Advanced Network and System Administration. Topics. Performance Monitoring. Performance Monitoring

Transcription:

TACC Stats I/O Performance Monitoring for the Intransigent John Hammond, TACC jhammond@tacc.utexas.edu IASDS11

Collaborators Bill Barth, TACC Consumer, first user John McCalpin, TACC Performance guru

Motivation Thousands of different users Hundreds of codes Users don t (in general) profile their codes NUMA understanding is low Parallel I/O is hard

Goals Monitor all jobs Lightweight No user intervention required Send up red flags Rank jobs by severity of the flag Increase the overall code performance

Approach Systematically collect data Tag with current job id Data mine for better site understanding Send report cards to users Contact users with problematic jobs

Approach Tried to modify sar (sysstat) Fixed interval, not invokable Serializes to custom format Not job oriented Unaware of Lustre, IB, newer Linux counters

tacc_stats is a single exe. Runs: at job startup (batch sched. prolog) every 10 minutes (cron) at job shutdown (batch sched. epilog) Collects data on each node Data stored locally Collected nightly to a central location

Data Collected current time, job id, host uptime core and socket perf. counters IB perf. counters Lustre (llite, mdc, osc) statistics OS memory, process statistics network, block device counters

Performance counters AMD64 Opteron, per core FLOPS, DCSF, UC, DRAM, HT Intel Nehalem, per core Intel uncore Nehalem, per socket Program /dev/cpu/*/msr in prolog Don't interfere with user

InfiniBand ib First attempt, HCA counters 32-bit saturating. WTF? ib_sw Query extended counters on switch port connected to HCA, swap rx, tx. Ranger. ib_ext 64-bit. Lonestar. All use libibmad.

Lustre llite (VFS) File operation counters open, close, read, write, seek, flock,... read(), write() tallied by bytes requested not returned L-333, landed Directory operations missing create, lookup, (un)link, rename, {mk,rm,read}dir,... LU-673

Lustre mdc RPC opcode oriented close, getattr, setattr, Missing operations MDS_REINT REINT_{CREATE,UNLINK,...} LDLM_ENQUEUE LDLM_{IBITS,FLOCK,...}

Lustre osc RPC opcode based + I/O counters Overcounting on error. Must read 300 (#OST) files to get bytes on the wire for /scratch 1800 system calls (stdio), 0.037 s. LU-334: Add OSC I/O counters to llite.

Linux memory Per NUMA node (socket) Total, Free, Used, Cached, Mapped, Anon, Page Tables, Slab, Not totally clear what's what.

Linux process/scheduler # Processes, context switches, forks 1, 5, 15 minute load averages Ticks (per core) in user, nice, system, sleep, iowait, irq, softirq, idle.

Linux counters, continued Block device (boring). Network interfaces. SysV shared memory. General VFS. VM/paging stats. Weird NUMA allocation stuff.

TACC Stats design KISS. Strip-down. Optimize for small /sys, /proc files. Collect raw counter values. Intermediate stats files. Flat text, awk-able. Self-describing, schema with names, types, units. Highly modular, disable types at build or run time.

230 main.c 283 dict.c 184 collect.c 119 schema.c 166 stats.c 253 stats_file.c 130 obd_to_mnt.c 147 llite.c 129 mdc.c 132 osc.c 53 lnet.c 177 ib_ext.c 75 block.c 93 cpu.c 134 mem.c 120 net.c 69 numa.c 113 ps.c 84 sysv_shm.c 62 tmpfs.c 78 vfs.c 46 vm.c 238 amd64_pmc.c 322 intel_pmc3.c 331 intel_uncore.c

#define KEYS \ X(rd_ios, "E", "read requests processed"), \ X(rd_merges, "E", "read requests merged with in queue requests"), \ X(rd_sectors, "E,U=512B", "sectors read"), \ X(rd_ticks, "E,U=ms", "wait time for read requests"), \ X(wr_ios, "E", "write requests processed"), \ X(wr_merges, "E", "write requests merged with in queue requests"), \ X(wr_sectors, "E,U=512B", "sectors written"), \ X(wr_ticks, "E,U=ms", "wait time for write requests"), \ X(in_flight, "", "requests in flight"), \ X(io_ticks, "E,U=ms", "time active"), \ X(time_in_queue, "E,U=ms", "wait time for all requests") static void collect_block_dev(struct stats_type *type, const char *dev) { struct stats *stats = get_current_stats(type, dev); if (stats == NULL) return; char path[80]; snprintf(path, sizeof(path), "/sys/block/%s/stat", dev); #define X(k,r...) #k collect_key_list(stats, path, KEYS, NULL); #undef X }

static void collect_block(struct stats_type *type) { const char *path = "/sys/block"; DIR *dir = opendir(path); if (dir == NULL) { ERROR("cannot open `%s': %m\n", path); return; } struct dirent *ent; while ((ent = readdir(dir))!= NULL) { if (ent >d_name[0] == '.') continue; collect_block_dev(type, ent >d_name); } } if (dir!= NULL) closedir(dir); struct stats_type block_stats_type = {.st_name = "block",.st_collect = &collect_block, #define X SCHEMA_DEF.st_schema_def = JOIN(KEYS), #undef X };

$tacc_stats 1.0.2 $hostname i101 101.ranger.tacc.utexas.edu $uname Linux x86_64 2.6.18 194.32.1.el5_TACC \ #18 SMP Mon Mar 14 22:24:19 CDT 2011 $uptime 2689049!block rd_ios,e rd_merges,e rd_sectors,e,u=512b \ rd_ticks,e,u=ms wr_ios,e wr_merges,e \ wr_sectors,e,u=512b wr_ticks,e,u=ms in_flight \ io_ticks,e,u=ms time_in_queue,e,u=ms... 1316840401 2129327 block hda 30477 2055 1826743 36044 53333 3680 \ 803120 1083072 0 425308 1119012...

Data Sizes 1 record: 3.7 KB, ~545 values 1 file (1 node day): ~160 records = 0.5+ MB 10K lines 180 KB compressed 1 node month: 16+ MB 1 Ranger day: 2 GB 1 Ranger month: 60 GB

Thanks! https://github.com/jhammond/tacc_stats