Copyright 2014 Splunk Inc. BENCHMARKING V ISUALIZATION TOOL J. Green Computer Scien<st High Performance Compu<ng Systems Los Alamos Na<onal Laboratory
Disclaimer During the course of this presenta<on, we may make forward- looking statements regarding future events or the expected performance of the company. We cau<on you that such statements reflect our current expecta<ons and es<mates based on factors currently known to us and that actual events or results could differ materially. For important factors that may cause actual results to differ from those contained in our forward- looking statements, please review our filings with the SEC. The forward- looking statements made in the this presenta<on are being made as of the <me and date of its live presenta<on. If reviewed ater its live presenta<on, this presenta<on may not contain current or accurate informa<on. We do not assume any obliga<on to update any forward- looking statements we may make. In addi<on, any informa<on about our roadmap outlines our general product direc<on and is subject to change at any <me without no<ce. It is for informa<onal purposes only, and shall not be incorporated into any contract or other commitment. Splunk undertakes no obliga<on either to develop the features or func<onality described or to include any such feature or func<onality in a future release. 2
Introduc<on: High Performance Compu<ng @ LANL 3
High Performance Computing (HPC)! A.k.a Supercompu<ng Providing super- sized computers (distributed systems) for numerically intensive / data intensive computa<ons A.k.a Supercomputing! Providing super-sized computers for numerically intensive / data intensive computations! 4
Our Nation, Our Lab, Our Mission! Ensure our goals align with Lab s mission, which aligns with the Na<onal Nuclear Security Administra<on Goals Provide state- of- the- art pladorms that sa<sfy stakeholders requirements Na<onal Security Mission LANL s Mission Nuclear Non- Prolifera<on; Na<onal Safety and Security Apply Scien<fic Excellence to Na<onal Security Missions HPC s Mission Enable Scien<fic Discovery via World Class High Performance Compu<ng Resources 5
How Fast Is Fast? Petascale 96 Cabinets ~9,000 Nodes 100,000s of cores Looking to Exascale! 6
Presenta<on Overview 7
Sec<ons Covered Sections Covered Base-lining for Rapid Intervention via Continual Testing! Systems Monitoring and Test Data Correlation! Test results analysis! 8
Introduc<on: Drivers to Automate Tes<ng 9
Why Test? Ensure that Delivered Components Match Performance Specifica<ons Test: Valida<on of Computa<onal Accuracy Sustained Performance [ Computer Life Cycle ]! 10
LANL s High Performance Compu<ng Tes<ng Strategy ProacDve TesDng to Improve Reliability Acceptance Tes<ng Integra<on Tes<ng Correctness Tes<ng Regression Tes<ng Performance Tes<ng SoTware Tes<ng Fault Tolerance Tes<ng Resilience Tes<ng Parameter Studies [ omg that s a lot of tes<ng ] 11
Do I want to rely on someone else when this thing breaks? COTS Solu<on Decision Tree Decide on Solu<on Do I really want to be responsible when this thing Extensibility? breaks? Status Quo DOESN T EXIST w/o SEVERE MODIFICATION Ease of Use? Labor Intensity? Requirements Sustainability? WRITE OWN TOOL TAILORED TO OUR NEEDS Manpower Req ts? Etc. 12 CONTINUE TO HACK ON RUN SCRIPTS Ease of Deployment? Standard Data Output?
Data Flow Diagram for New Test Harness [ DB CONNECT ] [ SPLUNK APP INTERFACE ] Initial Design Plan for Developing a more Robust Test Harness, presented to Salishan, Conference on High-Speed Computing, 2011 13
Base- Lining for Rapid Interven<on via Con<nual Tes<ng 14
Categories of Sections Covered Performance Tests Memory Bandwidth Tests IO Bandwidth Tests CPU Speed Tests Accelerator Speed Tests Infiniband (IB) Tests Mini Applica<ons (Total System Tests) 15
Memory Bandwidth Tes<ng Sections Covered Stream Memory Bandwidth Test (McAlpin, et. al) Performs 4 computa<ons Main Memory Bandwidth per Processor Triad is the money computa<on indicates performance expected with typical scien<fic computa<ons Expect Tight Performance Variances from Baseline Indicate Problem 16
CPU/GPU Sections Covered Performance Tes<ng Floa<ng Point Opera<ons Per Second (FLOP/s) is typical measure of computa<onal performance HPL - High Performance Linpack (Dongarra, et. al ) FLOPs are free, as per theme of SC 09 Enter HPCG Scalable Heterogeneous Cluster Benchmark (Spafford) 17
I/O in HPC Poten<ally the Biggest Bouleneck! Bursty File- system Performance Baseline Represented by Yellow Line 18 Parallel I/O follows pauerns of [ n to n] or [ n to 1] writes, reads Hidden in these system calls are file open, file close and stat opera<ons Can add unknown overhead to the opera<on Can create burdensome load to file- systems and overhead to applica<on if not programmed op<mally (i.e. open file handles, metadata overhead if too many files are simultaneously opened, etc. ) File- system tes<ng helps to iden<fy poten<al failures, and load impacts on running jobs
Whole Machine Performance Overview The supercomputer operates at 197 teraflops/sec. CollecDvely, it houses 9,856 compute cores and 19.7 terabytes of memory. It will give users working on unclassified projects access to 86.3 million central processing unit core hours/yr. Wolf will inidally be working on modeling the climate, materials, and astrophysical bodies and system. 1 Wolf, a New Supercomputer, Up and Running at Los Alamos Na;onal Lab h>p://machinedesign.com/ news/wolf- new- supercomputer- and- running- los- alamos- na;onal- lab 19
Systems Monitoring and Test Data Correla<on 20
Monitoring the Test Harness Sections Covered 21
Consistent Tes<ng Sections Covered [ Credit for this view: Dominic Manno ]! 22
U<liza<on Sta<s<cs Per Machine! Sections Covered 23
Test Results Analysis 24
Raw Data Parser Post Process Raw Test Data! Sections Covered DateStamp=$Date TestName=$TestName OS=$OS- Version MachineName=$MachineName NumNodes=$NumNodes TestMetric=$Measurement etc Must differen<ate data by: Test Name/version System Name Resources Used SoTware Versions Or valid results comparison is impossible! 25
Other Important Monitoring Panels! Sections Covered 26
Prototype f or N ew T est V iews! Sections Covered U<lizing a weighted radial line graph to visualize inter- nodal communica<on speeds, Prabhu Singh Khalsa, Scien<st, Los Alamos Na<onal Laboratory MPI BW Communication Visualization Tool Prototype Prabhu Khalsa 27
Future Plans 28
Going forward...! Sections Covered Integrate Fully New Test Harness Database Collec<on Into Splunk Vis. Fully Develop Custom Test Visualiza<ons to Suit Specific Teams Needs Use Monitoring (System / User) Data to Enhance Informa<on Team Specific Test Dashboards Fully Implement Monitoring Infrastructure Changes to Leverage Scalability Enhancements 29
Acknowledgements! Sections Covered Tes<ng is Crucial, Test Development is Itera<ve / Evolving Thanks for Pa<ence from Administra<ve Teams Thanks for Resources from Management / Oversight Thanks to Monitoring Team for Infrastructure Improvements Thanks to Dominic Manno / Ben Turrubiates, New Mexico Tech Excellent Work, Diligence, Valuable Contribu<ons Craig Idler, Scien<st, Enhancements to Gazebo + Pavilion Test Harnesses Mike Mason, Scien<st, Admin Assistance Splunk Guidance 30
Ques<ons? 31
Special Offer: Try Splunk MINT Express for Free! Splunk MINT offers a fast path to mobile intelligence. How fast? Find out with a 6- month trial* Register for your free trial: hup://mint.splunk.com/conf2014offer Download the Splunk MINT SDKs Add the Splunk MINT line of SDK code and publish** Start ge{ng digital intelligence at your finger<ps! *Offer valid for.conf2014 a>endees and coworkers of a>endees only. **Trial allows monitoring of up to 750,000 monthly ac;ve users (MAUs). 32
THANK YOU