Performance Debugging: Methods and Tools. David Skinner
|
|
|
- Julius Blake
- 10 years ago
- Views:
Transcription
1 Performance Debugging: Methods and Tools David Skinner
2 Performance Debugging: Methods and Tools Principles Topics in performance scalability Examples of areas where tools can help Practice Where to find tools Specifics to NERSC and Hopper Scope & Audience Budding simulation scientist app dev Compiler/middleware dev, YMMV 2
3 One Slide about NERSC Serving all of DOE Office of Science domain breadth range of scales Science driven sustained performance Lots of users ~4K active ~500 logged in ~300 projects Architecture aware procurements driven by workload needs
4 Big Picture of Performance and Scalability 4
5 What is performance? no output July 2012 big fast computers incorrect performance one misconfig -$400M in 30min
6 App Performance: Dimensions Code Input deck Computer Concurrency Workload Person 6
7 Performance is Relative To your goals Time to solution, T q +T wall Your research agenda Efficient use of allocation To the application code input deck machine type/state Suggestion: Focus on specific use cases as opposed to making everything perform well. Bottlenecks can shift.
8 Performance, more than a single number Plan where to put effort Formulate Research Problem Queue Wait Data? Optimization in one area can de-optimize another Coding jobs jobs jobs jobs UQ VV Timings come from timers and also from your calendar, time spent coding Debug Perf Debug Understand & Publish! Sometimes a slower algorithm is simpler to verify correctness 8
9 Different Facets of Performance Serial Leverage ILP on the processor Feed the pipelines Exploit data locality Reuse data in cache Parallel Expose concurrency Minimizing latency effects Maximizing work vs. communication 9
10 Performance is Hierarchical Registers instructions & operands Caches lines Local Memory pages Remote Memory messages Disk / Filesystem blocks, files 10
11 on to specifics about HPC tools Mostly at NERSC but fairly general 11
12 Tools are Hierarchical Registers Caches PAPI Local Memory Remote Memory valgrind PMPI Craypat IPM Tau Disk / Filesystem SAR/LMT 12
13 HPC Perf Tool Mechanisms Sampling Regularly interrupt the program and record where it is Build up a statistical profile Tracing / Instrumenting Insert hooks into program to record and time events, document everything Use Hardware Event Counters Special registers count events on processor E.g. floating point instructions Many possible events Only a few (~4 counters) 13
14 Typical Tool Use Requirements (Sometimes) Modify your code with macros, API calls, timers Compile your code Transform your binary for profiling/ tracing with a tool Run the transformed binary A data file is produced Interpret the results with a tool 14
15 Performance NERSC Vendor Tools: CrayPat Community Tools : TAU (U. Oregon via ACTS) PAPI (Performance Application Programming Interface) gprof IPM: Integrated Performance Monitoring 15
16 What HPC tools can tell us? CPU and memory usage FLOP rate Memory high water mark OpenMP OMP overhead OMP scalability (finding right # threads) MPI % wall time in communication Detecting load imbalance Analyzing message sizes 16
17 Using the right tool Tools can add overhead to code execution What level can you tolerate? Tools can add overhead to scientists What level can you tolerate? Scenarios: Debugging a code that is slow Detailed performance debugging Performance monitoring in production 17
18 Introduction to CrayPat Suite of tools to provide a wide range of performance-related information Can be used for both sampling and tracing user codes with or without hardware or network performance counters Built on PAPI Supports Fortran, C, C++, UPC, MPI, Coarray Fortran, OpenMP, Pthreads, SHMEM Man pages intro_craypat(1), intro_app2(1), intro_papi(1) 18
19 Using Hopper 1. Access the tools module load perftools! 2. Build your application; keep.o files make clean! make! 3. Instrument application pat_build... a.out! Result is a new file, a.out+pat! 4. Run instrumented application to get top time consuming routines aprun... a.out+pat! Result is a new file XXXXX.xf (or a directory containing.xf files) 5. Run pat_report on that new file; view results pat_report XXXXX.xf > my_profile! vi my_profile! Result is also a new file: XXXXX.ap2 19
20 Guidelines for Optimization Derived metric Optimization needed when* * Suggested by Cray PAT_RT_HWP C Computational intensity < 0.5 ops/ref 0, 1 L1 cache hit ratio < 90% 0, 1, 2 L1 cache utilization (misses) < 1 avg hit 0, 1, 2 L1+L2 cache hit ratio < 92% 2 L1+L2 cache utilization (misses) < 1 avg hit 2 TLB utilization < 0.9 avg use 1 (FP Multiply / FP Ops) or (FP Add / FP Ops) < 25% 5 Vectorization < 1.5 for dp; 3 for sp 12 (13, 14) 20
21 Perf Debug and Production Tools Integrated Performance Monitoring MPI profiling, hardware counter metrics, POSIX IO profiling IPM requires no code modification & no instrumented binary Only a module load ipm before running your program on systems that support dynamic libraries Else link with the IPM library IPM uses hooks already in the MPI library to intercept your MPI calls and wrap them with timers and counters 21
22 IPM: Let s See 1) Do module load ipm, link with $IPM, then run normally 2) Upon completion you get ##IPM2v0.xx################################################## # # command :./fish -n # start : Tue Feb 10 11:05: host : nid06027 # stop : Tue Feb 10 11:08: wallclock : # mpi_tasks : 25 on 2 nodes %comm : 1.62 # mem [GB] : 0.24 gflop/sec : 5.06 Maybe that s enough. If so you re done. Have a nice day!
23 IPM : IPM_PROFILE=full! # host : s05601/ c00_aix mpi_tasks : 32 on 2 nodes! # start : 11/30/04/14:35:34 wallclock : sec! # stop : 11/30/04/14:36:00 %comm : 27.72! # gbytes : e-01 total gflop/sec : e+00 total! # [total] <avg> min max! # wallclock ! # user ! # system ! # mpi ! # %comm ! # gflop/sec ! # gbytes ! # PM_FPU0_CMPL e e e e+08! # PM_FPU1_CMPL e e e e+08! # PM_FPU_FMA e e e e+08! # PM_INST_CMPL e e e e+09! # PM_LD_CMPL e e e e+09! # PM_ST_CMPL e e e e+09! # PM_TLB_MISS e e e e+07! # PM_CYC e e e e+09! # [time] [calls] <%mpi> <%wall>! # MPI_Send ! # MPI_Wait ! # MPI_Irecv ! # MPI_Barrier ! # MPI_Reduce ! # MPI_Comm_rank ! # MPI_Comm_size ! 23
24 Analyzing IPM Data Communication time per type of MPI call CDF of time per MPI call over message sizes Pairwise communication volume (comm. topology)
25 Tracing: Hard but Thorough MPI Rank " Flops Exchange Sync Time "
26 Advice: Develop (some) portable approaches to app optimization There is a tradeoff between vendorspecific and vendor neutral tools Each have their roles, vendor tools can often dive deeper Portable approaches allow apples-toapples comparisons Events, counters, metrics may be incomparable across vendors You can find printf most places Put a few timers in your code? 26
27 So you run your code with a perf tool and get some numbers what do they mean? 27
28 Performance: Definitions How do we measure performance? Speed up = T s /T p (n) Efficiency = T s /(n*t p (n)) Be aware there are multiple definitions for these terms n=cores, s=serial, p=parallel Isoefficiency: contours of constant efficiency amongst all problem sizes and concurrencies 28
29 Parallel speedups for x#x, how much? # CPUs (threads via OpenMP)
30 Systematic Perf Measurement Scaling studies involve changing the degree of parallelism. Will we be changing the problem also? Strong scaling : Fixed problem size Weak scaling: Problem size grows with additional resources Optimization: Is the concurrency chosing the problem size or vice versa?
31 Sharks and Fish: Cartoon Data: n_fish is global my_fish is local fish i = {x, y, } Dynamics: 1 V ij r F = ma ij H = K + V q& = H / p p& = H / q MPI_Allgatherv(myfish_buf, len[rank],.. for (i = 0; i < my_fish; ++i) { for (j = 0; j < n_fish; ++j) { // i!=j a i += g * mass j * ( fish i fish j ) / r ij } } Move fish See a glimpse here:
32 Sharks and Fish Results Running on a NERSC machine 100 fish can move 1000 steps in 1 task " 5.459s x 1.98 speedup 32 tasks " 2.756s 1000 fish can move 1000 steps in 1 task " s x 24.6 speedup 32 tasks " s What s the best way to run? How many fish do we really have? How large a computer do we have? How much computer time i.e. allocation do we have? How quickly, in real wall time, do we need the answer?
33 Good 1 st Step: Do runtimes make sense? Running fish_sim for fish on 1-32 CPUs we see 1 Task 32 Tasks
34 Isoeffciencies
35 Too much communication 35
36 Scaling studies are not always so simple 36
37 How many perf measurements? With a particular goal in mind, we systematically vary concurrency and/or problem size Example: $ $ How large a 3D (n^3) FFT can I efficiently run on 1024 cpus? Looks good? $ $ $ 37
38 The scalability landscape " Whoa! Why so bumpy? Algorithm complexity or switching Communication protocol switching Inter-job contention ~bugs in vendor software
39 Not always so tricky Main loop in jacobi_omp.f90; ngrid=6144 and maxiter=20 39
40 Scaling Studies at Scale Time in MPI calls 10,000 1,000 % Gyrokinetic Toroidal Code (fusion simulation) OpenMP enabled 4/6 threads Scaling up to cores 3 machines Time [sec] Wallclock scaling Kraken Carver Ranger Ideal Scaling Number of Cores ,000 10,000 1, Time [sec] Time [sec] Time in OpenMP Parallel Regions Kraken Carver Ranger Ideal Scaling , Number of Cores Kraken Carver Ranger Ideal Scaling Number of Cores
41 Let s look at scaling performance in more depth. A key impediment is often load imbalance 41
42 Load Imbalance : Pitfall 101 Communication Time: 64 tasks show 200s, 960 tasks show 230s MPI ranks sorted by total communication time
43 Load imbalance is pernicious 43 keep checking it
44 Load Balance : cartoon Unbalanced: Universal App Balanced: Time saved by load balance
45 Load Balance: Summary Imbalance often a byproduct of 1) data decomposition or 2) multi-core concurrency quirks Must be addressed before further MPI tuning can happen For regular grids consider padding or contracting Good software exists for graph partitioning / remeshing Dynamical load balance may be required for adaptive codes
46 Other performance scenarios
47 Simple Stuff: What s wrong here? IPM Profile This is why we need perf tools that are easy to use
48 Not so simple: Comm. topology MILC MAESTRO GTC PARATEC IMPACT-T CAM 48
49 Application Topology
50 Performance in Batch Queue Space 50
51 A few notes on queue optimization Consider your schedule Consider the queue constraints Charge factor regular vs. low Scavenger queues Xfer queues Downshift concurrency Run limit Queue limit Wall limit Soft (can you checkpoint?) Jobs can submit other jobs 51
52 Marshalling your own workflow Lots of choices in general Hadoop, CondorG, MySGE On hopper it s easy #PBS -l mppwidth=4096 aprun n 512./cmd & aprun n 512./cmd & aprun n 512./cmd & wait #PBS -l mppwidth=4096 while(work_left) { if(nodes_avail) { aprun n X next_job & } wait } 52
53 Thanks! Formulate Research Problem Queue Wait Data? Coding jobs jobs jobs jobs UQ VV Contacts: Debug Perf Debug Understand & Publish! 53
Tools for Performance Debugging HPC Applications. David Skinner [email protected]
Tools for Performance Debugging HPC Applications David Skinner [email protected] Tools for Performance Debugging Practice Where to find tools Specifics to NERSC and Hopper Principles Topics in performance
RA MPI Compilers Debuggers Profiling. March 25, 2009
RA MPI Compilers Debuggers Profiling March 25, 2009 Examples and Slides To download examples on RA 1. mkdir class 2. cd class 3. wget http://geco.mines.edu/workshop/class2/examples/examples.tgz 4. tar
Performance Monitoring of Parallel Scientific Applications
Performance Monitoring of Parallel Scientific Applications Abstract. David Skinner National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory This paper introduces an infrastructure
Optimization tools. 1) Improving Overall I/O
Optimization tools After your code is compiled, debugged, and capable of running to completion or planned termination, you can begin looking for ways in which to improve execution speed. In general, the
Load Imbalance Analysis
With CrayPat Load Imbalance Analysis Imbalance time is a metric based on execution time and is dependent on the type of activity: User functions Imbalance time = Maximum time Average time Synchronization
Experiences with Tools at NERSC
Experiences with Tools at NERSC Richard Gerber NERSC User Services Programming weather, climate, and earth- system models on heterogeneous mul>- core pla?orms September 7, 2011 at the Na>onal Center for
MAQAO Performance Analysis and Optimization Tool
MAQAO Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL [email protected] Performance Evaluation Team, University of Versailles S-Q-Y http://www.maqao.org VI-HPS 18 th Grenoble 18/22
LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS 2015. Hermann Härtig
LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS 2015 Hermann Härtig ISSUES starting points independent Unix processes and block synchronous execution who does it load migration mechanism
A Brief Survery of Linux Performance Engineering. Philip J. Mucci University of Tennessee, Knoxville [email protected]
A Brief Survery of Linux Performance Engineering Philip J. Mucci University of Tennessee, Knoxville [email protected] Overview On chip Hardware Performance Counters Linux Performance Counter Infrastructure
Application Performance Analysis Tools and Techniques
Mitglied der Helmholtz-Gemeinschaft Application Performance Analysis Tools and Techniques 2012-06-27 Christian Rössel Jülich Supercomputing Centre [email protected] EU-US HPC Summer School Dublin
David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems
David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems About me David Rioja Redondo Telecommunication Engineer - Universidad de Alcalá >2 years building and managing clusters UPM
Performance Analysis and Optimization Tool
Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL [email protected] Performance Analysis Team, University of Versailles http://www.maqao.org Introduction Performance Analysis Develop
Multicore Parallel Computing with OpenMP
Multicore Parallel Computing with OpenMP Tan Chee Chiang (SVU/Academic Computing, Computer Centre) 1. OpenMP Programming The death of OpenMP was anticipated when cluster systems rapidly replaced large
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
End-user Tools for Application Performance Analysis Using Hardware Counters
1 End-user Tools for Application Performance Analysis Using Hardware Counters K. London, J. Dongarra, S. Moore, P. Mucci, K. Seymour, T. Spencer Abstract One purpose of the end-user tools described in
How To Compare Amazon Ec2 To A Supercomputer For Scientific Applications
Amazon Cloud Performance Compared David Adams Amazon EC2 performance comparison How does EC2 compare to traditional supercomputer for scientific applications? "Performance Analysis of High Performance
Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes
Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes Eric Petit, Loïc Thebault, Quang V. Dinh May 2014 EXA2CT Consortium 2 WPs Organization Proto-Applications
Multi-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
Petascale Software Challenges. William Gropp www.cs.illinois.edu/~wgropp
Petascale Software Challenges William Gropp www.cs.illinois.edu/~wgropp Petascale Software Challenges Why should you care? What are they? Which are different from non-petascale? What has changed since
How To Monitor Performance On A Microsoft Powerbook (Powerbook) On A Network (Powerbus) On An Uniden (Powergen) With A Microsatellite) On The Microsonde (Powerstation) On Your Computer (Power
A Topology-Aware Performance Monitoring Tool for Shared Resource Management in Multicore Systems TADaaM Team - Nicolas Denoyelle - Brice Goglin - Emmanuel Jeannot August 24, 2015 1. Context/Motivations
find model parameters, to validate models, and to develop inputs for models. c 1994 Raj Jain 7.1
Monitors Monitor: A tool used to observe the activities on a system. Usage: A system programmer may use a monitor to improve software performance. Find frequently used segments of the software. A systems
Petascale Software Challenges. Piyush Chaudhary [email protected] High Performance Computing
Petascale Software Challenges Piyush Chaudhary [email protected] High Performance Computing Fundamental Observations Applications are struggling to realize growth in sustained performance at scale Reasons
Parallel Computing. Benson Muite. [email protected] http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite [email protected] http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
HPC Wales Skills Academy Course Catalogue 2015
HPC Wales Skills Academy Course Catalogue 2015 Overview The HPC Wales Skills Academy provides a variety of courses and workshops aimed at building skills in High Performance Computing (HPC). Our courses
Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis
Performance Metrics and Scalability Analysis 1 Performance Metrics and Scalability Analysis Lecture Outline Following Topics will be discussed Requirements in performance and cost Performance metrics Work
<Insert Picture Here> An Experimental Model to Analyze OpenMP Applications for System Utilization
An Experimental Model to Analyze OpenMP Applications for System Utilization Mark Woodyard Principal Software Engineer 1 The following is an overview of a research project. It is intended
Capacity Planning for Microsoft SharePoint Technologies
Capacity Planning for Microsoft SharePoint Technologies Capacity Planning The process of evaluating a technology against the needs of an organization, and making an educated decision about the configuration
A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin
A Comparison Of Shared Memory Parallel Programming Models Jace A Mogill David Haglin 1 Parallel Programming Gap Not many innovations... Memory semantics unchanged for over 50 years 2010 Multi-Core x86
Chapter 3 Operating-System Structures
Contents 1. Introduction 2. Computer-System Structures 3. Operating-System Structures 4. Processes 5. Threads 6. CPU Scheduling 7. Process Synchronization 8. Deadlocks 9. Memory Management 10. Virtual
Debugging and Profiling Lab. Carlos Rosales, Kent Milfeld and Yaakoub Y. El Kharma [email protected]
Debugging and Profiling Lab Carlos Rosales, Kent Milfeld and Yaakoub Y. El Kharma [email protected] Setup Login to Ranger: - ssh -X [email protected] Make sure you can export graphics
Running applications on the Cray XC30 4/12/2015
Running applications on the Cray XC30 4/12/2015 1 Running on compute nodes By default, users do not log in and run applications on the compute nodes directly. Instead they launch jobs on compute nodes
Guideline for stresstest Page 1 of 6. Stress test
Guideline for stresstest Page 1 of 6 Stress test Objective: Show unacceptable problems with high parallel load. Crash, wrong processing, slow processing. Test Procedure: Run test cases with maximum number
Exascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation
Exascale Challenges and General Purpose Processors Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation Jun-93 Aug-94 Oct-95 Dec-96 Feb-98 Apr-99 Jun-00 Aug-01 Oct-02 Dec-03
Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster
Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster Gabriele Jost and Haoqiang Jin NAS Division, NASA Ames Research Center, Moffett Field, CA 94035-1000 {gjost,hjin}@nas.nasa.gov
Part I Courses Syllabus
Part I Courses Syllabus This document provides detailed information about the basic courses of the MHPC first part activities. The list of courses is the following 1.1 Scientific Programming Environment
Parallel and Distributed Computing Programming Assignment 1
Parallel and Distributed Computing Programming Assignment 1 Due Monday, February 7 For programming assignment 1, you should write two C programs. One should provide an estimate of the performance of ping-pong
Parallel Programming for Multi-Core, Distributed Systems, and GPUs Exercises
Parallel Programming for Multi-Core, Distributed Systems, and GPUs Exercises Pierre-Yves Taunay Research Computing and Cyberinfrastructure 224A Computer Building The Pennsylvania State University University
Getting Things Done: Practical Web/e-Commerce Application Stress Testing
Getting Things Done: Practical Web/e-Commerce Application Stress Testing Robert Sabourin President Montreal, Canada [email protected] Slide 1 Practical Web/e-Commerce Application Stress Testing Overview:
Keys to node-level performance analysis and threading in HPC applications
Keys to node-level performance analysis and threading in HPC applications Thomas GUILLET (Intel; Exascale Computing Research) IFERC seminar, 18 March 2015 Legal Disclaimer & Optimization Notice INFORMATION
Scheduling Task Parallelism" on Multi-Socket Multicore Systems"
Scheduling Task Parallelism" on Multi-Socket Multicore Systems" Stephen Olivier, UNC Chapel Hill Allan Porterfield, RENCI Kyle Wheeler, Sandia National Labs Jan Prins, UNC Chapel Hill Outline" Introduction
Parallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?
Lecture 3: Evaluating Computer Architectures Announcements - Reminder: Homework 1 due Thursday 2/2 Last Time technology back ground Computer elements Circuits and timing Virtuous cycle of the past and
Chapter 2 System Structures
Chapter 2 System Structures Operating-System Structures Goals: Provide a way to understand an operating systems Services Interface System Components The type of system desired is the basis for choices
So#ware Tools and Techniques for HPC, Clouds, and Server- Class SoCs Ron Brightwell
So#ware Tools and Techniques for HPC, Clouds, and Server- Class SoCs Ron Brightwell R&D Manager, Scalable System So#ware Department Sandia National Laboratories is a multi-program laboratory managed and
Software Development around a Millisecond
Introduction Software Development around a Millisecond Geoffrey Fox In this column we consider software development methodologies with some emphasis on those relevant for large scale scientific computing.
Mark Bennett. Search and the Virtual Machine
Mark Bennett Search and the Virtual Machine Agenda Intro / Business Drivers What to do with Search + Virtual What Makes Search Fast (or Slow!) Virtual Platforms Test Results Trends / Wrap Up / Q & A Business
A Case Study - Scaling Legacy Code on Next Generation Platforms
Available online at www.sciencedirect.com ScienceDirect Procedia Engineering 00 (2015) 000 000 www.elsevier.com/locate/procedia 24th International Meshing Roundtable (IMR24) A Case Study - Scaling Legacy
Overlapping Data Transfer With Application Execution on Clusters
Overlapping Data Transfer With Application Execution on Clusters Karen L. Reid and Michael Stumm [email protected] [email protected] Department of Computer Science Department of Electrical and Computer
Performance monitoring at CERN openlab. July 20 th 2012 Andrzej Nowak, CERN openlab
Performance monitoring at CERN openlab July 20 th 2012 Andrzej Nowak, CERN openlab Data flow Reconstruction Selection and reconstruction Online triggering and filtering in detectors Raw Data (100%) Event
Performance Analysis for GPU Accelerated Applications
Center for Information Services and High Performance Computing (ZIH) Performance Analysis for GPU Accelerated Applications Working Together for more Insight Willersbau, Room A218 Tel. +49 351-463 - 39871
Deployment Planning Guide
Deployment Planning Guide August 2011 Copyright: 2011, CCH, a Wolters Kluwer business. All rights reserved. Material in this publication may not be reproduced or transmitted in any form or by any means,
Working with HPC and HTC Apps. Abhinav Thota Research Technologies Indiana University
Working with HPC and HTC Apps Abhinav Thota Research Technologies Indiana University Outline What are HPC apps? Working with typical HPC apps Compilers - Optimizations and libraries Installation Modules
Performance tuning Xen
Performance tuning Xen Roger Pau Monné [email protected] Madrid 8th of November, 2013 Xen Architecture Control Domain NetBSD or Linux device model (qemu) Hardware Drivers toolstack netback blkback Paravirtualized
Designing and Building Applications for Extreme Scale Systems CS598 William Gropp www.cs.illinois.edu/~wgropp
Designing and Building Applications for Extreme Scale Systems CS598 William Gropp www.cs.illinois.edu/~wgropp Welcome! Who am I? William (Bill) Gropp Professor of Computer Science One of the Creators of
LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.
LS-DYNA Scalability on Cray Supercomputers Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp. WP-LS-DYNA-12213 www.cray.com Table of Contents Abstract... 3 Introduction... 3 Scalability
Automating Big Data Benchmarking for Different Architectures with ALOJA
www.bsc.es Jan 2016 Automating Big Data Benchmarking for Different Architectures with ALOJA Nicolas Poggi, Postdoc Researcher Agenda 1. Intro on Hadoop performance 1. Current scenario and problematic 2.
Sequential Performance Analysis with Callgrind and KCachegrind
Sequential Performance Analysis with Callgrind and KCachegrind 2 nd Parallel Tools Workshop, HLRS, Stuttgart, July 7/8, 2008 Josef Weidendorfer Lehrstuhl für Rechnertechnik und Rechnerorganisation Institut
PERFORMANCE MODELS FOR APACHE ACCUMULO:
Securely explore your data PERFORMANCE MODELS FOR APACHE ACCUMULO: THE HEAVY TAIL OF A SHAREDNOTHING ARCHITECTURE Chris McCubbin Director of Data Science Sqrrl Data, Inc. I M NOT ADAM FUCHS But perhaps
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,
Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga
Programming models for heterogeneous computing Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Talk outline [30 slides] 1. Introduction [5 slides] 2.
Parallelization: Binary Tree Traversal
By Aaron Weeden and Patrick Royal Shodor Education Foundation, Inc. August 2012 Introduction: According to Moore s law, the number of transistors on a computer chip doubles roughly every two years. First
Grid Scheduling Dictionary of Terms and Keywords
Grid Scheduling Dictionary Working Group M. Roehrig, Sandia National Laboratories W. Ziegler, Fraunhofer-Institute for Algorithms and Scientific Computing Document: Category: Informational June 2002 Status
The Complete Performance Solution for Microsoft SQL Server
The Complete Performance Solution for Microsoft SQL Server Powerful SSAS Performance Dashboard Innovative Workload and Bottleneck Profiling Capture of all Heavy MDX, XMLA and DMX Aggregation, Partition,
Advanced MPI. Hybrid programming, profiling and debugging of MPI applications. Hristo Iliev RZ. Rechen- und Kommunikationszentrum (RZ)
Advanced MPI Hybrid programming, profiling and debugging of MPI applications Hristo Iliev RZ Rechen- und Kommunikationszentrum (RZ) Agenda Halos (ghost cells) Hybrid programming Profiling of MPI applications
Performance Management for Cloudbased STC 2012
Performance Management for Cloudbased Applications STC 2012 1 Agenda Context Problem Statement Cloud Architecture Need for Performance in Cloud Performance Challenges in Cloud Generic IaaS / PaaS / SaaS
PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design
PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions Slide 1 Outline Principles for performance oriented design Performance testing Performance tuning General
<Insert Picture Here> Adventures in Middleware Database Abuse
Adventures in Middleware Database Abuse Graham Wood Architect, Real World Performance, Server Technologies Real World Performance Real-World Performance Who We Are Part of the Database
Parallel Algorithm Engineering
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis
Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association
Making Multicore Work and Measuring its Benefits Markus Levy, president EEMBC and Multicore Association Agenda Why Multicore? Standards and issues in the multicore community What is Multicore Association?
NVIDIA Tools For Profiling And Monitoring. David Goodwin
NVIDIA Tools For Profiling And Monitoring David Goodwin Outline CUDA Profiling and Monitoring Libraries Tools Technologies Directions CScADS Summer 2012 Workshop on Performance Tools for Extreme Scale
High Performance or Cycle Accuracy?
CHIP DESIGN High Performance or Cycle Accuracy? You can have both! Bill Neifert, Carbon Design Systems Rob Kaye, ARM ATC-100 AGENDA Modelling 101 & Programmer s View (PV) Models Cycle Accurate Models Bringing
ELEC 377. Operating Systems. Week 1 Class 3
Operating Systems Week 1 Class 3 Last Class! Computer System Structure, Controllers! Interrupts & Traps! I/O structure and device queues.! Storage Structure & Caching! Hardware Protection! Dual Mode Operation
Software and Performance Engineering of the Community Atmosphere Model
Software and Performance Engineering of the Community Atmosphere Model Patrick H. Worley Oak Ridge National Laboratory Arthur A. Mirin Lawrence Livermore National Laboratory Science Team Meeting Climate
Performance Engineering of the Community Atmosphere Model
Performance Engineering of the Community Atmosphere Model Patrick H. Worley Oak Ridge National Laboratory Arthur A. Mirin Lawrence Livermore National Laboratory 11th Annual CCSM Workshop June 20-22, 2006
Performance Tuning and Optimizing SQL Databases 2016
Performance Tuning and Optimizing SQL Databases 2016 http://www.homnick.com [email protected] +1.561.988.0567 Boca Raton, Fl USA About this course This four-day instructor-led course provides students
Large Vector-Field Visualization, Theory and Practice: Large Data and Parallel Visualization Hank Childs + D. Pugmire, D. Camp, C. Garth, G.
Large Vector-Field Visualization, Theory and Practice: Large Data and Parallel Visualization Hank Childs + D. Pugmire, D. Camp, C. Garth, G. Weber, S. Ahern, & K. Joy Lawrence Berkeley National Laboratory
D5.6 Prototype demonstration of performance monitoring tools on a system with multiple ARM boards Version 1.0
D5.6 Prototype demonstration of performance monitoring tools on a system with multiple ARM boards Document Information Contract Number 288777 Project Website www.montblanc-project.eu Contractual Deadline
GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications
GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications Harris Z. Zebrowitz Lockheed Martin Advanced Technology Laboratories 1 Federal Street Camden, NJ 08102
Sequential Performance Analysis with Callgrind and KCachegrind
Sequential Performance Analysis with Callgrind and KCachegrind 4 th Parallel Tools Workshop, HLRS, Stuttgart, September 7/8, 2010 Josef Weidendorfer Lehrstuhl für Rechnertechnik und Rechnerorganisation
Estimate Performance and Capacity Requirements for Workflow in SharePoint Server 2010
Estimate Performance and Capacity Requirements for Workflow in SharePoint Server 2010 This document is provided as-is. Information and views expressed in this document, including URL and other Internet
Architecture of Hitachi SR-8000
Architecture of Hitachi SR-8000 University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Slide 1 Most of the slides from Hitachi Slide 2 the problem modern computer are data
Optimizing Shared Resource Contention in HPC Clusters
Optimizing Shared Resource Contention in HPC Clusters Sergey Blagodurov Simon Fraser University Alexandra Fedorova Simon Fraser University Abstract Contention for shared resources in HPC clusters occurs
High Performance Computing. Course Notes 2007-2008. HPC Fundamentals
High Performance Computing Course Notes 2007-2008 2008 HPC Fundamentals Introduction What is High Performance Computing (HPC)? Difficult to define - it s a moving target. Later 1980s, a supercomputer performs
