Performance monitoring at CERN openlab. July 20 th 2012 Andrzej Nowak, CERN openlab

Similar documents
Performance Monitoring of the Software Frameworks for LHC Experiments

CERN openlab III. Major Review Platform CC. Sverre Jarp Alfio Lazzaro Julien Leduc Andrzej Nowak

Performance monitoring of the software frameworks for LHC experiments

Next Generation Operating Systems

Perfmon2: A leap forward in Performance Monitoring

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Full and Para Virtualization

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Xen and the Art of. Virtualization. Ian Pratt

Computing at the HL-LHC

Virtuoso and Database Scalability

High Performance or Cycle Accuracy?

HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief

SIDN Server Measurements

White Paper. Intel Sandy Bridge Brings Many Benefits to the PC/104 Form Factor

Mark Bennett. Search and the Virtual Machine

Performance of Software Switching

High Performance Computing in CST STUDIO SUITE

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

New Design and Layout Tips For Processing Multiple Tasks

2972 Linux Options and Best Practices for Scaleup Virtualization

Petascale Software Challenges. Piyush Chaudhary High Performance Computing


Datacenter Operating Systems

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Development of Monitoring and Analysis Tools for the Huawei Cloud Storage

Big Data Visualization on the MIC

ELEC 377. Operating Systems. Week 1 Class 3

12. Introduction to Virtual Machines

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

IMAGE SIGNAL PROCESSING PERFORMANCE ON 2 ND GENERATION INTEL CORE MICROARCHITECTURE PRESENTATION PETER CARLSTON, EMBEDDED & COMMUNICATIONS GROUP

FLOW-3D Performance Benchmark and Profiling. September 2012

Running Windows on a Mac. Why?

Benchmarking FreeBSD. Ivan Voras

Proposal for Virtual Private Server Provisioning

Digitale Signalverarbeitung mit FPGA (DSF) Soft Core Prozessor NIOS II Stand Mai Jens Onno Krah

Itanium 2 Platform and Technologies. Alexander Grudinski Business Solution Specialist Intel Corporation

IT of SPIM Data Storage and Compression. EMBO Course - August 27th! Jeff Oegema, Peter Steinbach, Oscar Gonzalez

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

Finding Performance and Power Issues on Android Systems. By Eric W Moore

Best Practises for LabVIEW FPGA Design Flow. uk.ni.com ireland.ni.com

The PHI solution. Fujitsu Industry Ready Intel XEON-PHI based solution. SC Denver

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey

Benchmarking Large Scale Cloud Computing in Asia Pacific

Operating System Impact on SMT Architecture

Binary search tree with SIMD bandwidth optimization using SSE

Data Centric Systems (DCS)

Performance Counter. Non-Uniform Memory Access Seminar Karsten Tausche

PostgreSQL Performance Characteristics on Joyent and Amazon EC2

I/O. Input/Output. Types of devices. Interface. Computer hardware

IOS110. Virtualization 5/27/2014 1

Business white paper. HP Process Automation. Version 7.0. Server performance

Introduction 1 Performance on Hosted Server 1. Benchmarks 2. System Requirements 7 Load Balancing 7

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

x64 Servers: Do you want 64 or 32 bit apps with that server?

New Technology Introduction: Android Studio with PushBot

Amazon EC2 XenApp Scalability Analysis

JUROPA Linux Cluster An Overview. 19 May 2014 Ulrich Detert

Configuration Maximums VMware vsphere 4.0

VMware and CPU Virtualization Technology. Jack Lo Sr. Director, R&D

Audit & Tune Deliverables

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

HP reference configuration for entry-level SAS Grid Manager solutions

Cloud Computing for Control Systems CERN Openlab Summer Student Program 9/9/2011 ARSALAAN AHMED SHAIKH

The new frontier of the DATA acquisition using 1 and 10 Gb/s Ethernet links. Filippo Costa on behalf of the ALICE DAQ group

Enterprise-Class Virtualization with Open Source Technologies

Virtualization benefits Introduction to XenSource How Xen is changing virtualization The Xen hypervisor architecture Xen paravirtualization

Software Performance and Scalability

64-Bit versus 32-Bit CPUs in Scientific Computing

Improved metrics collection and correlation for the CERN cloud storage test framework

MEASURING WORKLOAD PERFORMANCE IS THE INFRASTRUCTURE A PROBLEM?

Introduction to the NI Real-Time Hypervisor

Configuration Maximums VMware Infrastructure 3

Host Power Management in VMware vsphere 5.5

IBM Software Group. Lotus Domino 6.5 Server Enablement

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

SAS Business Analytics. Base SAS for SAS 9.2

ABB Technology Days Fall 2013 System 800xA Server and Client Virtualization. ABB Inc 3BSE en. October 29, 2013 Slide 1

LAN/Server Free* Backup Experiences

How Solace Message Routers Reduce the Cost of IT Infrastructure

Rackspace Cloud Databases and Container-based Virtualization

Delivering Quality in Software Performance and Scalability Testing

Self service for software development tools

Chapter 6. Inside the System Unit. What You Will Learn... Computers Are Your Future. What You Will Learn... Describing Hardware Performance

Intel Xeon +FPGA Platform for the Data Center

Enabling Technologies for Distributed Computing

Scaling in a Hypervisor Environment

Practical Performance Understanding the Performance of Your Application

How To Create A Cloud Based System For Aaas (Networking)

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

RESOLVING SERVER PROBLEMS WITH DELL PROSUPPORT PLUS AND SUPPORTASSIST AUTOMATED MONITORING AND RESPONSE

MOSIX: High performance Linux farm

WHITE PAPER. ClusterWorX 2.1 from Linux NetworX. Cluster Management Solution C ONTENTS INTRODUCTION

Transcription:

Performance monitoring at CERN openlab July 20 th 2012 Andrzej Nowak, CERN openlab

Data flow Reconstruction Selection and reconstruction Online triggering and filtering in detectors Raw Data (100%) Event reprocessing Event simulation Event summary data (10%) Analysis Analysis objects (1%) Batch physics analysis Processed data 7/20/2012 Andrzej Nowak - Performance monitoring at CERN openlab 2

Characteristics of CERN code (1) 2 major toolkits and 4 major frameworks built on top one per LHC experiment Large C++ frameworks with millions of lines of code Thousands of shared libraries in a distribution, gigabytes of binaries Low number of key players but high number of brief contributors Large regions of memory read only or accessed infrequently Characteristics: Significant portion of double precision floating point (10%+) Loads/stores up to 60% of instructions Unfavorable for the x86 microarchitecture (even worse for others) Low number of instructions between jumps (<10) Low number of instructions between calls (several dozen) For the most part, code not fit for accelerators at all in its current shape 7/20/2012 Andrzej Nowak - Performance monitoring at CERN openlab 3

Characteristics of CERN code (2) Memory footprint: 2-4GB per process Most of that read only Very sparse writes KSM/forking claimed to save 50% of memory (even though these are crude schemes) More advanced schemes thread-private variables save even more, >95% (per thread) Cache footprint key code fits in L2 Key code and data fit in L3 Heavy C++ virtualization and other C++ mechanism footprint War on TLB misses successfully waged in 2008, but still lots of references CPU intensive workloads have very sparse IO 1 average disk per dozen processes is enough + 1 gbit ethernet Again: For the most part, code not fit for accelerators at all in its current shape 7/20/2012 Andrzej Nowak - Performance monitoring at CERN openlab 4

Hardware landscape 10000000 1000000 100000 10000 Clock Speed (MHz) Max TDP (W) Cache (MB) Transistors Transistors (fit) Intel Processor features unavoidable PARALLELISM 1000 100 10 1 0.1 Andrzej Nowak, CERN openlab 2012 1971 1974 1977 1980 1983 1986 1989 1992 1995 1998 2001 2004 2007 2010 7/20/2012 Andrzej Nowak - Performance monitoring at CERN openlab 5

Inside a modern PC SOCKETS CORES THREADS + PORTS (SUPERSCALAR) VECTORS PIPELINING Andrzej Nowak CERN openlab 2012 7/20/2012 Andrzej Nowak - Performance monitoring at CERN openlab 6 6

Omnipresent parallelism - where are we now? SIMD ILP HW THREADS CORES SOCKETS MAX 4 4 1.35 8 4 OPT 2.5 1.43 1.25 8 2 US 1 0.80 1 6 2 SIMD ILP HW THREADS CORES SOCKETS MAX 4 16 21.6 172.8 691.2 OPT 2.5 3.57 4.46 35.71 71.43 US 1 0.80 0.80 4.80 9.60 7/20/2012 Andrzej Nowak - Performance monitoring at CERN openlab 7

Legacy applications use a low single digit percentage of raw machine power available today _% Write your percentage here 7/20/2012 Andrzej Nowak - Performance monitoring at CERN openlab 8

Techniques Event Counting Black-box studies and regression EBS IP Sampling Wide range of tuning activities Low precision on our code Time based sampling of counts Phase monitoring Instrumentation 7/20/2012 Andrzej Nowak - Performance monitoring at CERN openlab 9

Tools for performance monitoring PMU based perfmon2 perf Badly designed, painful to use De facto standard Gooda coming up from Google Intel tools (Amplifier, SEP, PTU) Instrumentation PIN (slow) Intel Amplifier Intel Inspector (low success rate) Own tools Scripts, analyzers parsing raw data 7/20/2012 Andrzej Nowak - Performance monitoring at CERN openlab 10

Intel Software tool usage at CERN Pool of licenses within the openlab agreement Parallel Studio for Linux and Windows C++ and Fortran compilers for Mac Cluster Toolkit Alpha and Beta testing Non-standard software Newly purchased licenses for CERN-wide usage Suggestion, expertise and setup came out of openlab Prompted by growing demand Linux, Windows, Mac covered 7/20/2012 Andrzej Nowak - Performance monitoring at CERN openlab 11

Internal activities focused on performance monitoring Building home-grown tools for analysis and batch reporting Two separate non-openlab collaborations on PTU frontends (CMS experiment) Numerous other smaller performance monitoring tools Recent efforts focused on most recent CPU features (some of the original concepts and support came from Intel, HP and Google) Looking forward to performance tuning API/SDK toolkits Planning to employ performance monitoring in parallel to OSS Linux solutions 7/20/2012 Andrzej Nowak - Performance monitoring at CERN openlab 12

Case 1: test40 Simple electromagnetic test in the Geant4 framework an electron traveling through a detector Similar workload to a real physics app but with a tiny footprint Control flow driven to a large extent Collection: 1. Sampled with PDIR on Sandy Bridge 2. Instrumented with PIN Expecting: The sampling profile to match the instrumented profile as closely as possible 7/20/2012 Andrzej Nowak - Performance monitoring at CERN openlab 13

Case 1: EBS view (sampling) PDIR measurement Samples Percentage RSD [.] ieee754_log 25453 21.09% 0.51% [.] RanecuEngine::flat() 9014 7.47% 0.88% [.] G4SteppingManager::DefinePhysicalStepLength() 6254 5.18% 0.78% [.] G4Tubs::Inside(Hep3Vector const&) const 6095 5.05% 1.72% [.] ieee754_exp 4783 3.96% 1.61% [.] G4SteppingManager::InvokePSDIP(unsigned long) 4599 3.81% 1.18% [.] G4SteppingManager::Stepping() 4191 3.47% 1.22% [.] _int_malloc 3579 2.97% 1.26% [.] _int_free 3547 2.94% 1.13% [.] G4SteppingManager::InvokeAlongStepDoItProcs() 3196 2.65% 1.54% [.] log 3235 2.68% 1.58% [.] G4Track::GetVelocity() const 2673 2.21% 1.72% 7/20/2012 Andrzej Nowak - Performance monitoring at CERN openlab 14

Case 1: Reference (real) counts Instrumented reference measurement Samples Percentage G4Navigator::LocateGlobalPointAndSetup( ) 39928 17.43% ieee754_log 25736 11.23% RanecuEngine::flat() 8593 3.75% G4SteppingManager::DefinePhysicalStepLength() 6229 2.72% G4Tubs::Inside(Hep3Vector const&) const 5997 2.62% ieee754_exp 4874 2.13% G4ClassicalRK4::DumbStepper( ) 4796 2.09% G4SteppingManager::InvokePSDIP(unsigned long) 4565 1.99% G4ReplicaNavigation::ComputeStep( ) 4366 1.91% G4Transportation::AlongStepGetPhysicalInteractionLength( ) 4259 1.86% G4SteppingManager::Stepping() 4121 1.80% G4Navigator::ComputeStep( ) 3718 1.62% _int_malloc 3576 1.56% _int_free 3518 1.54% 7/20/2012 Andrzej Nowak - Performance monitoring at CERN openlab 15

Case 2: DTB Microbenchmark with deep call stack emulation Representative of OO apps, especially C++ F1 calls F2, F2 calls F3, etc until F10 Each function multiplies a var by a constant Expecting 1. Uniform # instructions per function within one experiment 2. Uniform # instructions per function across all 10 experiments run 7/20/2012 Andrzej Nowak - Performance monitoring at CERN openlab 16

360 320 Average samples RSD 18.00% 16.00% 280 14.00% 240 12.00% 200 10.00% 160 8.00% 120 6.00% 80 4.00% 40 2.00% 0 [.] f1 [.] f10 [.] f2 [.] f3 [.] f4 [.] f5 [.] f6 [.] f7 [.] f8 [.] f9 [.] main 0.00% 7/20/2012 Andrzej Nowak - Performance monitoring at CERN openlab 17

Deep call stack profiling (PDIR) 6000 Samples 5000 4000 3000 2000 1000 Series1 Series2 Series3 Series4 Series5 Series6 Series7 Series8 Series9 Series10 0 Functions - main (shallow) on the left, f10 (deepest) on the right Continuous lines added for illustration 7/20/2012 Andrzej Nowak - Performance monitoring at CERN openlab 18

ROOT minimization (there are some success stories) 7/20/2012 Andrzej Nowak - Performance monitoring at CERN openlab 19

Work with the community Regular consultancy with physicists on a range of applications Assistance with porting, vectorization etc Pathfinding Work with the IT department on platform tuning and debugging Entering the online domain, joint EU projects 7/20/2012 Andrzej Nowak - Performance monitoring at CERN openlab 20

Teaching efforts 2 dedicated workshops on performance monitoring per year (in addition to 2 other workshops on multi-threading) Advanced performance tuning sessions w/ Intel experts 1-2 per year New activity: floating point workshops Very successful, lots of interest International computing schools, conferences 7/20/2012 Andrzej Nowak - Performance monitoring at CERN openlab 21

Our MIC experience One of the first Intel customers to be engaged started with ISA reviews in 2008 In-depth feedback on the OS, drivers and Xeon/KNF/KNC toolchains Ported and optimized 3 large representative benchmarks Ongoing activities Looking forward to dissemination of KNC results 7/20/2012 Andrzej Nowak - Performance monitoring at CERN openlab 22

THANK YOU Q & A

BACKUP 7/20/2012 Andrzej Nowak - Performance monitoring at CERN openlab 24

About openlab CERN openlab is a framework for evaluating and integrating cutting-edge IT technologies or services in partnership with industry: http://cern.ch/openlab We are the Platform Competence Center (PCC) of the CERN openlab, working closely with Intel since 11 years ago and addressing: many-core scalability performance tuning and optimization benchmarking and thermal optimization teaching 7/20/2012 Andrzej Nowak - Performance monitoring at CERN openlab 25