Pekka Manninen CSC, the Finnish IT center for science PRACE Petascale Summer School Stockholm, Sweden Aug

Size: px
Start display at page:

Download "Pekka Manninen CSC, the Finnish IT center for science PRACE Petascale Summer School Stockholm, Sweden Aug"

Transcription

1 Performance Measurement Tools and Techniques Pekka Manninen CSC, the Finnish IT center for science PRACE Petascale Summer School Stockholm, Sweden Aug

2 Part I: Introduction Motivation Traditional and Petascales optimization Performance data and optimization

3 Motivation It is all about software It is of no use for building petascale machines without suitable software Single core efficiency is not the key issue anymore but the scalability Large scale parallel application development is still work on the frontier with novel challenges How to utilize thousands of cores How to deal with I/O Good performance analysis tools are mandatory for parallel program development

4 Traditional optimization process overview Parallel optimization stage -Apply parallelization -Perform parallel optimization Serial optimization stage -Perform serial optimization Application development stage -Choose algorithms -Choose data structures -Develop or port application

5 Petascales: Optimization flowchart No Choose algorithms, data structures and parallelization strategy Develop code Apply parallelization Unoptimal, correct parallel code Assess scalability Sufficient? No Reduce overhead from communication and load inbalance Yes Apply compiler optimization Tune for the processor Optimize I/O Identify performance bottlenecks No Sufficient? Yes Measure singlecore performance Yes Sufficient? Measure parallel performance No Optimized code Link optimized libraries Assess scalability Yes Converged?

6 Optimization considerations 1. Load balance 2. Minimal dedicated time for communication Minimize communication Overlap computation and communication 3. CPU utilization Optimal memory access (cache utilization) Pipeline performance (branch prediction, prefetching) SIMD operations Efficient I/O

7 Examples of relevant measurements Execution time across CPUs In order to an application to scale all tasks should be kept equally loaded MPI trace How and when the communication is carried out, hinting how to optimize communication Communication bandwidth Function call-tree and execution time profile Pinpoint the execution hotspots, i.e. where to spend the most of the effort in serial optimization Hardware counters (e.g. Cache utilization ratio, Instruction usage, Computational intensity, Flop rate) Provide insight on the potential inefficiencies of a given routine I/O statistics

8 Performance data collection Two dimensions When collection is triggered Asynchronous (sampling) or synchronous (code instrumentation) How data is recorded Profile or a trace file Acquisition Sampling Instrumentation Presentation Profile Timeline

9 Things to be kept in mind The objective of performance analysis is to understand the behavior of the whole system and apply it to improve the performance Instrumentation causes always overhead Artifacts in all measurements A performance analyst should Have an understanding of the different levels of the system architecture Be able to communicate with users as well as developers Be patient enough to explore a broad range of hypotheses and double-check them Be open-minded as to where the performance bottleneck could be

10 Part II: Cray performance analysis tools as an example Overview Usage

11 Cray performance analysis infrastructure CrayPat pat_build - an utility for application instrumentation without a need for source code modification Transparent run-time library for measurements pat_report - performance reports and visualization file pat_help - interactive help utility Cray Apprentice2 An advanced graphical performance analysis and visualization tool

12 Instrumentation with pat_build No source or makefile modification needed, link-time instrumentation Requires object files Instruments compiler-optimized code Generates stand-alone instrumented executable and preserves the original binary Automatic instrumentation at group level Supports both asynchronous and synchronous instrumentation Basic usage % module load xt-craypat % make clean; make % pat_build -g <trace groups> a.out

13 Instrumentation with pat_build Trace groups biolibs Cray bioinformatics library routines blas BLAS subroutines heap Dynamic heap io stdio + sysio tracegroups lapack LAPACK subroutines math ANSI math mpi MPI statistics omp OpenMP API omp-rtl OpenMP runtime library pthreads Posix threads shmem SHMEM API stdio All functions that accept or return the file* construct sysio I/O system calls system system calls

14 Instrumentation with pat_build Automatic profiling analysis (APA) % module load xt-craypat/4.2 % make clean; make % pat_build -O apa a.out Execution of the instrumented executable will produce a report for pat_report and an.apa file that allows for fine-tuning the analysis

15 Fine-grained instrumentation Fortran include pat_apif.h... call PAT_region_begin(id, label,ierr) <code segment> call PAT_region_end(id,ierr) C include <pat_api.h>... ierr = PAT_region_begin(id, label ); <code segment> ierr = PAT_region_end(id);

16 Collecting data Run the instrumented application (a.out+pat) as usual, then measurement data is created (.xf file) Must run on Lustre file system Optional runtime variables Optional timeline view of the program by setting PAT_RT_SUMMARY=0 Request hardware performance counter information by setting PAT_RT_HWPC=<HWPC group id> Number of files to store raw data can be customized with PAT_RT_EXPFILE_MAX

17 Collecting data Hardware performance counter groups 1 Summary with translation lookaside buffer metrics 2 L1 and L2 cache metrics 3 Bandwidth information 4 Hypertransport information 5 Floating point instruction (including SSE) information 6 Cycles stalled and resources empty 7 Cycles stalled and resources full 8 Instructions and branches 9 Instruction cache values

18 Analysis with pat_report Combines information from binary with the raw performance measurement data Generates a text report of performance results and/or formats data to be visualized with Apprentice2 Basic usage pat_report -O <keywords> data_file.xf Useful keywords profile callers calltree heap mpi Subroutine level data Function callers Calltree Heap information, instrument with -g heap MPI statistics, instrument with -g mpi load_balance Load balance information help Available options

19 Analysis with pat_report ct defaults heap io lb load_balance mpi callers callers+src Numbers calltree calltree+src heap_hiwater heap_leaks heap_program hwpc load_balance_function load_balance_group load_balance_program -O calltree Tables that would appear by default. -O heap_program,heap_hiwater,heap_leaks -O read_stats,write_stats -O load_balance -O lb_program,lb_group,lb_function -O mpi_callers Profile by Function and Callers Profile by Function and Callers, with Line Function Calltree View Calltree View with Callsite Line Numbers Heap Stats during Main Program Heap Leaks during Main Program Heap Usage at Start and End of Main Program HW Performance Counter Data Load Balance across PE's by Function Load Balance across PE's by FunctionGroup Load Balance across PE's

20 Analysis with pat_report load_balance_sm loops mpi_callers mpi_dest_bytes mpi_dest_counts mpi_rank_order mpi_sm_rank_order pgo_details profile profile+src profile_pe.th profile_pe_th profile_th_pe rogram_time read_stats samp_profile samp_profile+src thread_times write_stats Load Balance with MPI Sent Message Stats Loop Stats from -hprofile_generate MPI Sent Message Stats by Caller MPI Sent Message Stats by Destination PE MPI Sent Message Stats by Destination PE Suggested MPI Rank Order Sent Message Stats and Suggested MPI Rank Order Loop Stats detail from -hprofile_generate Profile by Function Group and Function Profile by Group, Function, and Line Profile by Function Group and Function Profile by Function Group and Function Profile by Function Group and Function Program Wall Clock Time File Input Stats by Filename Profile by Function Profile by Group, Function, and Line Program Wall Clock Time File Output Stats by Filename

21 Part III: Case study Workplan Demonstration of tools Analysis

22 Case study: CP2K performance analysis A code for ab-initio molecular dynamics simulations written in Fortran 95, uses MPI for parallelization Test job: A few time-step DFT simulation of 512 water molecules We wish to know the following Function profile - where are the hotspots Single-core efficiency Load balance and communication analysis - what are the scalability bottlenecks? Peak performance around 19 TFlop/s - still quite far away from petascales Runtimes (s) on Cray XT4 #CPUs FFTSG FFTW Not enough to calculate and so much communication overhead that the execution is slower than with 1024 cores!

23 Case study: CP2K performance analysis Workplan Do this on a longer route to reduce instrumentation overhead Instrument the code (that is build with Craypat module loaded) pat_build -O apa cp2k.pat ( cp2k.pat+pat) Execute cp2k.pat+pat with 512 cores ( an.xf file ) Obtain a sampling profile pat_report -O samp_profile+src xf-file.xf > samp_profile This will produce the profile as well as an.apa file Run a more exhaustive analysis by editing apa file and rebuilding the executable pat_build -O apa-file.apa Executing this will produce another.xf file - visualize the analysis with Apprentice2

24 Sampling profile output Table 1: Profile by Group, Function, and Line Samp % Samp Imb. Imb. Group Samp Samp % Function Source Line PE='HIDE' 100.0% Total % MPI % % MPI_Bcast 14.8% % MPI_Recv 8.6% % mpi_allreduce_ 5.9% % mpi_alltoallv_ 5.2% % MPI_Reduce 4.5% % mpi_waitall_ 1.4% % MPI_Send 1.1% % mpi_alltoall_ =============================================== 23.4% ETC % % dgemm_kernel 3.5% % dgemm_otcopy 1.5% % dgemm_oncopy =============================================== 13.5% USER % UPDATE_COST_CPU_ROW. in.distribution_optimize % % line % % line.358 ================================================

25 Sampling profile The corresponding fractions with other processor counts #cores MPI ETC USER

26 APA file # You can edit this file, if desired, and use it # to reinstrument the program for tracing like this: # # pat_build -O cp2k.pat+pat sdt.apa # # These suggested trace options are based on data from: # # /wrk/pmannin/prace/cp2k/cp2k/tests/scala/512/sg/cp2k.pat+pat sdt.ap2, /wrk/pmannin/prace/cp2k/cp2k/tests/scala/512/sg/cp 2k.pat+pat sdt.xf # # HWPC group to collect by default. # -Drtenv=PAT_RT_HWPC=0 # Summary with instructions metrics. -Drtenv=PAT_RT_HWPC=2 # # Libraries to trace. -g mpi Insert the desired tracegroups here # Select the hardware counter group here or give the PAPI calls explicitly, get group 2

27 APA file continued # User-defined functions to trace, sorted by % of samples. # Limited to top 200. A function is commented out if it has < 1% # of samples, or if a cumulative threshold of 90% has been reached, # or if it has size < 200 bytes. -w # Enable tracing of user-defined functions. # Note: -u should NOT be specified as an additional option. # 3.21% 756 bytes -T UPDATE_COST_CPU_ROW.in.DISTRIBUTION_OPTIMIZE # 0.74% 5757 bytes -T PW_NN_COMPOSE_R_WORK.in.PW_SPLINE_UTILS # 0.66% bytes -T RS_PW_TRANSFER_DISTRIBUTED.in.REALSPACE_GRID_TYPES # 0.64% 9426 bytes -T RS_PW_TRANSFER_REPLICATED.in.REALSPACE_GRID_TYPES # 0.56% 2172 bytes -T PW_COMPOSE_STRIPE.in.PW_SPLINE_UTILS # 0.49% bytes # -T CP_SM_FM_MULTIPLY_2D.in.CP_SM_FM_INTERACTIONS We may now select the functions which we want to trace and reduce the instrumentation overhead by ignoring functions with little significance on the execution time, uncomment a few

28 Performance analysis Let us get the most out of the analysis data pat_report -O defaults,mpi,io,lb,heap,mpi_dest_bytes,hwpc,mpi_sm_rank_order xffile.xf

29 Table 1: Profile by Function Group and Function Experiment=1 / Group / Function / PE='HIDE' ======================================================================== Totals for program Note how the instrumentation affects the performance Time% 100.0% Time Imb.Time -- Imb.Time% -- Calls REQUESTS_TO_L2:DATA M/sec req DATA_CACHE_REFILLS:L2_MODIFIED:L2_OWNED:L2_EXCLUSIVE:L2_SHARED 8.449M/sec fills DATA_CACHE_REFILLS_FROM_SYSTEM:ALL 3.139M/sec fills PAPI_L1_DCA M/sec refs User time (approx) secs cycles Cycles secs cycles User time (approx) secs cycles Utilization rate 86.6% LD & ST per D1 miss refs/miss D1 cache hit ratio 99.2% LD & ST per D2 miss refs/miss D2 cache hit ratio 72.9% D1+D2 cache hit ratio 99.8% Effective D1+D2 Reuse 7.57 refs/byte System to D1 refill 3.139M/sec lines System to D1 bandwidth MB/sec bytes L2 to Dcache bandwidth MB/sec bytes

30 MPI_SYNC This is all load imbalance! Time% 51.9% Time Imb.Time -- Imb.Time% -- Calls REQUESTS_TO_L2:DATA 0.474M/sec req DATA_CACHE_REFILLS:L2_MODIFIED:L2_OWNED:L2_EXCLUSIVE:L2_SHARED 0.371M/sec fills DATA_CACHE_REFILLS_FROM_SYSTEM:ALL 0.025M/sec fills PAPI_L1_DCA M/sec refs User time (approx) secs cycles Cycles secs cycles User time (approx) secs cycles Utilization rate 99.4% LD & ST per D1 miss refs/miss D1 cache hit ratio 100.0% LD & ST per D2 miss refs/miss D2 cache hit ratio 93.8% D1+D2 cache hit ratio 100.0% Effective D1+D2 Reuse refs/byte System to D1 refill 0.025M/sec lines System to D1 bandwidth 1.497MB/sec bytes L2 to Dcache bandwidth MB/sec bytes

31 USER Time% 31.1% Time Imb.Time -- Imb.Time% -- Calls REQUESTS_TO_L2:DATA M/sec req DATA_CACHE_REFILLS:L2_MODIFIED:L2_OWNED:L2_EXCLUSIVE:L2_SHARED M/sec fills DATA_CACHE_REFILLS_FROM_SYSTEM:ALL M/sec fills PAPI_L1_DCA M/sec refs User time (approx) secs cycles Cycles secs cycles User time (approx) secs cycles Utilization rate 65.0% LD & ST per D1 miss D1 cache hit ratio 97.7% LD & ST per D2 miss Cache statistics. There s room to improve both L1 and L2 utilization. D2 cache hit ratio 74.7% D1+D2 cache hit ratio 99.4% refs/miss refs/miss Effective D1+D2 Reuse 2.74 refs/byte System to D1 refill M/sec lines System to D1 bandwidth MB/sec bytes L2 to Dcache bandwidth MB/sec bytes

32 Table 2: Load Balance with MPI Sent Message Stats Time % Time Sent Sent Msg Avg Sent Experiment=1 Msg Total Bytes Msg Size Group Count PE[mmm] 100.0% Total % MPI_SYNC % pe % pe % pe.14 =================================================================== 31.1% USER % pe % pe.5 0.0% pe.405 =================================================================== 15.3% MPI % pe % pe % pe.305 =================================================================== MPI data transfer without load imbalance

33 Table 3: MPI Sent Message Stats by Caller Sent Msg Sent MsgSz 16B<= 256B<= 4KB<= 64KB<= Experiment=1 Total Bytes Msg <16B MsgSz MsgSz MsgSz MsgSz Function Count Count <256B <4KB <64KB <1MB Caller Count Count Count Count PE[mmm] Message passing profile Total mpi_isend_ MP_ISENDRECV_RM2.in.MESSAGE_PASSING CP_SM_FM_MULTIPLY_2D.in.CP_SM_FM_INTERACTIONS 4 CP_SM_FM_MULTIPLY.in.CP_SM_FM_INTERACTIONS QS_KS_BUILD_KOHN_SHAM_MATRIX.in.QS_KS_METHODS 6 QS_KS_UPDATE_QS_ENV.in.QS_KS_METHODS SCF_ENV_DO_SCF.in.QS_SCF 8 SCF.in.QS_SCF 9 QS_ENERGIES.in.QS_ENERGY 10 QS_FORCES.in.QS_FORCE 11 FORCE_ENV_CALC_ENERGY_FORCE.in.FORCE_ENV_METHODS pe pe pe.374 ============================================================ We could go into a routine and see if we could aggregate small messages

34 Table 5: Heap Stats during Main Program Tracked MBytes Total Allocs Total Tracked Tracked Experiment=1 Heap Not Allocs Not Frees Objects MBytes PE[mmm] HiWater Tracked Tracked Not Not MBytes Freed Freed Total pe pe pe.379 ===================================================================================

35 Table 7: File Input Stats by Filename Read Read MB Read Rate Reads Read Experiment=1 Time MB/sec B/Call File Name PE[mmm] File Desc Total NA pe fd pe fd pe.0 3 fd.49 ============================================================== <NA> pe.0 3 fd pe pe.242 ===============================================================

36 Table 8: File Output Stats by Filename Write Write MB Write Rate Writes Write B/Call Experiment=1 Time MB/sec File Name PE[mmm] File Desc Total out-w512-restart.wfn pe.0 3 fd pe pe.242 ======================================================================= <NA> pe.0 3 fd pe pe.242 ======================================================================= stdout pe.0 3 fd pe pe.242 ======================================================================== Rather modest I/ O, check the I/O bandwidth!

37 Table 13: Load Balance across PE's... ======================================================================== pe.491 Compare the amount of calls and time spent Time% 0.2% Time Calls REQUESTS_TO_L2:DATA 8.170M/sec req DATA_CACHE_REFILLS:L2_MODIFIED:L2_OWNED:L2_EXCLUSIVE:L2_SHARED 5.761M/sec fills DATA_CACHE_REFILLS_FROM_SYSTEM:ALL 2.079M/sec fills PAPI_L1_DCA M/sec refs User time (approx) secs cycles Cycles secs cycles User time (approx) secs cycles Utilization rate 96.9% LD & ST per D1 miss refs/miss D1 cache hit ratio 99.5% LD & ST per D2 miss refs/miss D2 cache hit ratio 73.5% D1+D2 cache hit ratio 99.9% Effective D1+D2 Reuse refs/byte System to D1 refill 2.079M/sec lines System to D1 bandwidth MB/sec bytes L2 to Dcache bandwidth MB/sec bytes ========================================================================

38 Table 13: Load Balance across PE's... ======================================================================== pe.2 Compare the amount of calls and time spent Time% 0.2% Time Calls REQUESTS_TO_L2:DATA M/sec req DATA_CACHE_REFILLS:L2_MODIFIED:L2_OWNED:L2_EXCLUSIVE:L2_SHARED M/sec fills DATA_CACHE_REFILLS_FROM_SYSTEM:ALL 5.022M/sec fills PAPI_L1_DCA M/sec refs User time (approx) secs cycles Cycles secs cycles User time (approx) secs cycles Utilization rate 72.0% LD & ST per D1 miss refs/miss D1 cache hit ratio 98.8% LD & ST per D2 miss refs/miss D2 cache hit ratio 74.1% D1+D2 cache hit ratio 99.7% Effective D1+D2 Reuse 5.03 refs/byte System to D1 refill 5.022M/sec lines System to D1 bandwidth MB/sec bytes L2 to Dcache bandwidth MB/sec bytes ========================================================================

39 Table 15: Load Balance across PE's by Function... ======================================================================== USER / PW_NN_COMPOSE_R_WORK.in.PW_SPLINE_UTILS / pe Time% 0.0% Time Calls 576 REQUESTS_TO_L2:DATA DATA_CACHE_REFILLS:L2_MODIFIED:L2_OWNED:L2_EXCLUSIVE:L2_SHARED DATA_CACHE_REFILLS_FROM_SYSTEM:ALL PAPI_L1_DCA User time (approx) User time (approx) ========================================================================... ======================================================================== USER / PW_NN_COMPOSE_R_WORK.in.PW_SPLINE_UTILS / pe Time% 0.0% Time Calls 576 REQUESTS_TO_L2:DATA DATA_CACHE_REFILLS:L2_MODIFIED:L2_OWNED:L2_EXCLUSIVE:L2_SHARED DATA_CACHE_REFILLS_FROM_SYSTEM:ALL PAPI_L1_DCA User time (approx) User time (approx) req fills fills refs cycles cycles req fills 1474 fills refs 0 cycles 0 cycles Why does not this routine employ all the processes?

40 Table 15: Load Balance across PE's by Function... ======================================================================== USER / PW_COMPOSE_STRIPE.in.PW_SPLINE_UTILS / pe Time% 0.0% Time Calls REQUESTS_TO_L2:DATA req DATA_CACHE_REFILLS:L2_MODIFIED:L2_OWNED:L2_EXCLUSIVE:L2_SHARED fills DATA_CACHE_REFILLS_FROM_SYSTEM:ALL fills PAPI_L1_DCA refs User time (approx) cycles User time (approx) cycles ======================================================================== USER / PW_COMPOSE_STRIPE.in.PW_SPLINE_UTILS / pe Time ======================================================================== USER / PW_COMPOSE_STRIPE.in.PW_SPLINE_UTILS / pe Time ======================================================================== Or this?

41 Table 20: HW Performance Counter Data Experiment=1 / PE='HIDE' ======================================================================== Totals for program REQUESTS_TO_L2:DATA M/sec req DATA_CACHE_REFILLS:L2_MODIFIED:L2_OWNED:L2_EXCLUSIVE:L2_SHARED 8.451M/sec fills DATA_CACHE_REFILLS_FROM_SYSTEM:ALL 3.140M/sec fills PAPI_L1_DCA M/sec refs User time (approx) secs cycles Cycles secs cycles User time (approx) secs cycles Utilization rate 86.6% LD & ST per D1 miss refs/miss D1 cache hit ratio 99.2% LD & ST per D2 miss refs/miss D2 cache hit ratio 72.9% D1+D2 cache hit ratio 99.8% Effective D1+D2 Reuse 7.57 refs/byte System to D1 refill 3.140M/sec lines System to D1 bandwidth MB/sec bytes L2 to Dcache bandwidth MB/sec bytes ========================================================================

42 Table 21: Sent Message Stats and Suggested MPI Rank Order Sent Msg Total Bytes per MPI rank Max Avg Min Max Min Total Bytes Total Bytes Total Bytes Rank Rank Dual core: Sent Msg Total Bytes per node Rank Max Avg Min Max Node Min Node Order Total Bytes Total Bytes Total Bytes Ranks Ranks d , ,269 u , , ,8 409, , , ,9 502, Quad core: Sent Msg Total Bytes per node Rank Max Avg Min Max Node Min Node Order Total Bytes Total Bytes Total Bytes Ranks Ranks d ,511,178, ,209,68,473 u ,511,178, ,209,68, ,9,503,8 374,137,375, ,264,9, ,502,247, ,9,10,11 500,501,502,503 However this custom placement of ranks did not do much in practice According to CrayPat, the default SMP-like placement of ranks is the worst choice

43 Visualization with Apprentice2 The previous statistics can be viewed graphically with Apprentice2 pat_report version 4.2 produces the.ap2 file, with older versions you get it with -f ap2 switch to pat_report Just launch it by app2 command

44 You will get more info by holding the mouse cursor over a slice. Clicking will show the load balance of the function

45 Smallest, average and largest individual time

46 Same profile as a list, a click on a function will provide the HW counter data of it

47 HW counter overview. Would be useful if cache miss or cycle stall counters have been recorded.

48 Routine call flow window (who calls who) and how the execution is divided

49 The largest execution time on the left, smallest on the right

50 Final remarks on CP2K analysis Load balance is a major concern and largest efforts on optimization should be put there By performance analysis, we know in which routines the most severe imbalances happen and start to look the issue from there Messages are usually not too small, so no need for tedious aggregating I/O seems to be efficiently written Single core efficiency should be investigated in the most intense routines L1 and L2 accessing could be improved We would need also other HW counter data for a true optimization work, e.g. SSE instruction utilization and memory stalls

Introduction to application performance analysis

Introduction to application performance analysis Introduction to application performance analysis Performance engineering We want to get the most science and engineering through a supercomputing system as possible. The more efficient codes are, the more

More information

Load Imbalance Analysis

Load Imbalance Analysis With CrayPat Load Imbalance Analysis Imbalance time is a metric based on execution time and is dependent on the type of activity: User functions Imbalance time = Maximum time Average time Synchronization

More information

Cray Performance Measurement and Analysis Tools. Jason Beech-Brandt

Cray Performance Measurement and Analysis Tools. Jason Beech-Brandt Cray Performance Measurement and Analysis Tools Jason Beech-Brandt Introduction to the Cray Performance Tools Cray performance tools overview Steps to using the tools Performance measurement on the Cray

More information

Optimization tools. 1) Improving Overall I/O

Optimization tools. 1) Improving Overall I/O Optimization tools After your code is compiled, debugged, and capable of running to completion or planned termination, you can begin looking for ways in which to improve execution speed. In general, the

More information

Tools for Performance Debugging HPC Applications. David Skinner deskinner@lbl.gov

Tools for Performance Debugging HPC Applications. David Skinner deskinner@lbl.gov Tools for Performance Debugging HPC Applications David Skinner deskinner@lbl.gov Tools for Performance Debugging Practice Where to find tools Specifics to NERSC and Hopper Principles Topics in performance

More information

Performance Monitoring of Parallel Scientific Applications

Performance Monitoring of Parallel Scientific Applications Performance Monitoring of Parallel Scientific Applications Abstract. David Skinner National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory This paper introduces an infrastructure

More information

RA MPI Compilers Debuggers Profiling. March 25, 2009

RA MPI Compilers Debuggers Profiling. March 25, 2009 RA MPI Compilers Debuggers Profiling March 25, 2009 Examples and Slides To download examples on RA 1. mkdir class 2. cd class 3. wget http://geco.mines.edu/workshop/class2/examples/examples.tgz 4. tar

More information

End-user Tools for Application Performance Analysis Using Hardware Counters

End-user Tools for Application Performance Analysis Using Hardware Counters 1 End-user Tools for Application Performance Analysis Using Hardware Counters K. London, J. Dongarra, S. Moore, P. Mucci, K. Seymour, T. Spencer Abstract One purpose of the end-user tools described in

More information

Performance analysis with Periscope

Performance analysis with Periscope Performance analysis with Periscope M. Gerndt, V. Petkov, Y. Oleynik, S. Benedict Technische Universität München September 2010 Outline Motivation Periscope architecture Periscope performance analysis

More information

Multi-Threading Performance on Commodity Multi-Core Processors

Multi-Threading Performance on Commodity Multi-Core Processors Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction

More information

MAQAO Performance Analysis and Optimization Tool

MAQAO Performance Analysis and Optimization Tool MAQAO Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL andres.charif@uvsq.fr Performance Evaluation Team, University of Versailles S-Q-Y http://www.maqao.org VI-HPS 18 th Grenoble 18/22

More information

NVIDIA Tools For Profiling And Monitoring. David Goodwin

NVIDIA Tools For Profiling And Monitoring. David Goodwin NVIDIA Tools For Profiling And Monitoring David Goodwin Outline CUDA Profiling and Monitoring Libraries Tools Technologies Directions CScADS Summer 2012 Workshop on Performance Tools for Extreme Scale

More information

Performance Analysis and Optimization Tool

Performance Analysis and Optimization Tool Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL andres.charif@uvsq.fr Performance Analysis Team, University of Versailles http://www.maqao.org Introduction Performance Analysis Develop

More information

How To Monitor Performance On A Microsoft Powerbook (Powerbook) On A Network (Powerbus) On An Uniden (Powergen) With A Microsatellite) On The Microsonde (Powerstation) On Your Computer (Power

How To Monitor Performance On A Microsoft Powerbook (Powerbook) On A Network (Powerbus) On An Uniden (Powergen) With A Microsatellite) On The Microsonde (Powerstation) On Your Computer (Power A Topology-Aware Performance Monitoring Tool for Shared Resource Management in Multicore Systems TADaaM Team - Nicolas Denoyelle - Brice Goglin - Emmanuel Jeannot August 24, 2015 1. Context/Motivations

More information

InfoScale Storage & Media Server Workloads

InfoScale Storage & Media Server Workloads InfoScale Storage & Media Server Workloads Maximise Performance when Storing and Retrieving Large Amounts of Unstructured Data Carlos Carrero Colin Eldridge Shrinivas Chandukar 1 Table of Contents 01 Introduction

More information

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp. LS-DYNA Scalability on Cray Supercomputers Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp. WP-LS-DYNA-12213 www.cray.com Table of Contents Abstract... 3 Introduction... 3 Scalability

More information

Application Performance Tools on Discover

Application Performance Tools on Discover Application Performance Tools on Discover Tyler Simon 21 May 2009 Overview 1. ftnchek - static Fortran code analysis 2. Cachegrind - source annotation for cache use 3. Ompp - OpenMP profiling 4. IPM MPI

More information

Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis

Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis Performance Metrics and Scalability Analysis 1 Performance Metrics and Scalability Analysis Lecture Outline Following Topics will be discussed Requirements in performance and cost Performance metrics Work

More information

<Insert Picture Here> An Experimental Model to Analyze OpenMP Applications for System Utilization

<Insert Picture Here> An Experimental Model to Analyze OpenMP Applications for System Utilization An Experimental Model to Analyze OpenMP Applications for System Utilization Mark Woodyard Principal Software Engineer 1 The following is an overview of a research project. It is intended

More information

Performance Debugging: Methods and Tools. David Skinner deskinner@lbl.gov

Performance Debugging: Methods and Tools. David Skinner deskinner@lbl.gov Performance Debugging: Methods and Tools David Skinner deskinner@lbl.gov Performance Debugging: Methods and Tools Principles Topics in performance scalability Examples of areas where tools can help Practice

More information

BLM 413E - Parallel Programming Lecture 3

BLM 413E - Parallel Programming Lecture 3 BLM 413E - Parallel Programming Lecture 3 FSMVU Bilgisayar Mühendisliği Öğr. Gör. Musa AYDIN 14.10.2015 2015-2016 M.A. 1 Parallel Programming Models Parallel Programming Models Overview There are several

More information

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.

More information

- An Essential Building Block for Stable and Reliable Compute Clusters

- An Essential Building Block for Stable and Reliable Compute Clusters Ferdinand Geier ParTec Cluster Competence Center GmbH, V. 1.4, March 2005 Cluster Middleware - An Essential Building Block for Stable and Reliable Compute Clusters Contents: Compute Clusters a Real Alternative

More information

High Performance Computing in Aachen

High Performance Computing in Aachen High Performance Computing in Aachen Christian Iwainsky iwainsky@rz.rwth-aachen.de Center for Computing and Communication RWTH Aachen University Produktivitätstools unter Linux Sep 16, RWTH Aachen University

More information

Performance Counters. Microsoft SQL. Technical Data Sheet. Overview:

Performance Counters. Microsoft SQL. Technical Data Sheet. Overview: Performance Counters Technical Data Sheet Microsoft SQL Overview: Key Features and Benefits: Key Definitions: Performance counters are used by the Operations Management Architecture (OMA) to collect data

More information

64-Bit versus 32-Bit CPUs in Scientific Computing

64-Bit versus 32-Bit CPUs in Scientific Computing 64-Bit versus 32-Bit CPUs in Scientific Computing Axel Kohlmeyer Lehrstuhl für Theoretische Chemie Ruhr-Universität Bochum March 2004 1/25 Outline 64-Bit and 32-Bit CPU Examples

More information

BG/Q Performance Tools. Sco$ Parker BG/Q Early Science Workshop: March 19-21, 2012 Argonne Leadership CompuGng Facility

BG/Q Performance Tools. Sco$ Parker BG/Q Early Science Workshop: March 19-21, 2012 Argonne Leadership CompuGng Facility BG/Q Performance Tools Sco$ Parker BG/Q Early Science Workshop: March 19-21, 2012 BG/Q Performance Tool Development In conjuncgon with the Early Science program an Early SoMware efforts was inigated to

More information

Metrics for Success: Performance Analysis 101

Metrics for Success: Performance Analysis 101 Metrics for Success: Performance Analysis 101 February 21, 2008 Kuldip Oberoi Developer Tools Sun Microsystems, Inc. 1 Agenda Application Performance Compiling for performance Profiling for performance

More information

Uncovering degraded application performance with LWM 2. Aamer Shah, Chih-Song Kuo, Lucas Theisen, Felix Wolf November 17, 2014

Uncovering degraded application performance with LWM 2. Aamer Shah, Chih-Song Kuo, Lucas Theisen, Felix Wolf November 17, 2014 Uncovering degraded application performance with LWM 2 Aamer Shah, Chih-Song Kuo, Lucas Theisen, Felix Wolf November 17, 214 Motivation: Performance degradation Internal factors: Inefficient use of hardware

More information

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical Identify a problem Review approaches to the problem Propose a novel approach to the problem Define, design, prototype an implementation to evaluate your approach Could be a real system, simulation and/or

More information

BG/Q Performance Tools. Sco$ Parker Leap to Petascale Workshop: May 22-25, 2012 Argonne Leadership CompuCng Facility

BG/Q Performance Tools. Sco$ Parker Leap to Petascale Workshop: May 22-25, 2012 Argonne Leadership CompuCng Facility BG/Q Performance Tools Sco$ Parker Leap to Petascale Workshop: May 22-25, 2012 BG/Q Performance Tool Development In conjunccon with the Early Science program an Early SoIware efforts was inicated to bring

More information

CSE Case Study: Optimising the CFD code DG-DES

CSE Case Study: Optimising the CFD code DG-DES CSE Case Study: Optimising the CFD code DG-DES CSE Team NAG Ltd., support@hector.ac.uk Naveed Durrani University of Sheffield June 2008 Introduction One of the activities of the NAG CSE (Computational

More information

Basics of VTune Performance Analyzer. Intel Software College. Objectives. VTune Performance Analyzer. Agenda

Basics of VTune Performance Analyzer. Intel Software College. Objectives. VTune Performance Analyzer. Agenda Objectives At the completion of this module, you will be able to: Understand the intended purpose and usage models supported by the VTune Performance Analyzer. Identify hotspots by drilling down through

More information

Running applications on the Cray XC30 4/12/2015

Running applications on the Cray XC30 4/12/2015 Running applications on the Cray XC30 4/12/2015 1 Running on compute nodes By default, users do not log in and run applications on the compute nodes directly. Instead they launch jobs on compute nodes

More information

Architecture of Hitachi SR-8000

Architecture of Hitachi SR-8000 Architecture of Hitachi SR-8000 University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Slide 1 Most of the slides from Hitachi Slide 2 the problem modern computer are data

More information

Keys to node-level performance analysis and threading in HPC applications

Keys to node-level performance analysis and threading in HPC applications Keys to node-level performance analysis and threading in HPC applications Thomas GUILLET (Intel; Exascale Computing Research) IFERC seminar, 18 March 2015 Legal Disclaimer & Optimization Notice INFORMATION

More information

Optimizing Application Performance with CUDA Profiling Tools

Optimizing Application Performance with CUDA Profiling Tools Optimizing Application Performance with CUDA Profiling Tools Why Profile? Application Code GPU Compute-Intensive Functions Rest of Sequential CPU Code CPU 100 s of cores 10,000 s of threads Great memory

More information

Performance Application Programming Interface

Performance Application Programming Interface /************************************************************************************ ** Notes on Performance Application Programming Interface ** ** Intended audience: Those who would like to learn more

More information

Improving Time to Solution with Automated Performance Analysis

Improving Time to Solution with Automated Performance Analysis Improving Time to Solution with Automated Performance Analysis Shirley Moore, Felix Wolf, and Jack Dongarra Innovative Computing Laboratory University of Tennessee {shirley,fwolf,dongarra}@cs.utk.edu Bernd

More information

Unified Performance Data Collection with Score-P

Unified Performance Data Collection with Score-P Unified Performance Data Collection with Score-P Bert Wesarg 1) With contributions from Andreas Knüpfer 1), Christian Rössel 2), and Felix Wolf 3) 1) ZIH TU Dresden, 2) FZ Jülich, 3) GRS-SIM Aachen Fragmentation

More information

Sequential Performance Analysis with Callgrind and KCachegrind

Sequential Performance Analysis with Callgrind and KCachegrind Sequential Performance Analysis with Callgrind and KCachegrind 4 th Parallel Tools Workshop, HLRS, Stuttgart, September 7/8, 2010 Josef Weidendorfer Lehrstuhl für Rechnertechnik und Rechnerorganisation

More information

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) ( TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 7 th CALL (Tier-0) Contributing sites and the corresponding computer systems for this call are: GCS@Jülich, Germany IBM Blue Gene/Q GENCI@CEA, France Bull Bullx

More information

1 Bull, 2011 Bull Extreme Computing

1 Bull, 2011 Bull Extreme Computing 1 Bull, 2011 Bull Extreme Computing Table of Contents HPC Overview. Cluster Overview. FLOPS. 2 Bull, 2011 Bull Extreme Computing HPC Overview Ares, Gerardo, HPC Team HPC concepts HPC: High Performance

More information

LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS 2015. Hermann Härtig

LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS 2015. Hermann Härtig LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS 2015 Hermann Härtig ISSUES starting points independent Unix processes and block synchronous execution who does it load migration mechanism

More information

Multicore Parallel Computing with OpenMP

Multicore Parallel Computing with OpenMP Multicore Parallel Computing with OpenMP Tan Chee Chiang (SVU/Academic Computing, Computer Centre) 1. OpenMP Programming The death of OpenMP was anticipated when cluster systems rapidly replaced large

More information

find model parameters, to validate models, and to develop inputs for models. c 1994 Raj Jain 7.1

find model parameters, to validate models, and to develop inputs for models. c 1994 Raj Jain 7.1 Monitors Monitor: A tool used to observe the activities on a system. Usage: A system programmer may use a monitor to improve software performance. Find frequently used segments of the software. A systems

More information

High Performance Computing

High Performance Computing High Performance Computing Trey Breckenridge Computing Systems Manager Engineering Research Center Mississippi State University What is High Performance Computing? HPC is ill defined and context dependent.

More information

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions Slide 1 Outline Principles for performance oriented design Performance testing Performance tuning General

More information

A Brief Survery of Linux Performance Engineering. Philip J. Mucci University of Tennessee, Knoxville mucci@pdc.kth.se

A Brief Survery of Linux Performance Engineering. Philip J. Mucci University of Tennessee, Knoxville mucci@pdc.kth.se A Brief Survery of Linux Performance Engineering Philip J. Mucci University of Tennessee, Knoxville mucci@pdc.kth.se Overview On chip Hardware Performance Counters Linux Performance Counter Infrastructure

More information

GraySort on Apache Spark by Databricks

GraySort on Apache Spark by Databricks GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner

More information

Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003

Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003 Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003 Josef Pelikán Charles University in Prague, KSVI Department, Josef.Pelikan@mff.cuni.cz Abstract 1 Interconnect quality

More information

Improving the performance of data servers on multicore architectures. Fabien Gaud

Improving the performance of data servers on multicore architectures. Fabien Gaud Improving the performance of data servers on multicore architectures Fabien Gaud Grenoble University Advisors: Jean-Bernard Stefani, Renaud Lachaize and Vivien Quéma Sardes (INRIA/LIG) December 2, 2010

More information

Parallel Large-Scale Visualization

Parallel Large-Scale Visualization Parallel Large-Scale Visualization Aaron Birkland Cornell Center for Advanced Computing Data Analysis on Ranger January 2012 Parallel Visualization Why? Performance Processing may be too slow on one CPU

More information

Operating System for the K computer

Operating System for the K computer Operating System for the K computer Jun Moroo Masahiko Yamada Takeharu Kato For the K computer to achieve the world s highest performance, Fujitsu has worked on the following three performance improvements

More information

Equalizer. Parallel OpenGL Application Framework. Stefan Eilemann, Eyescale Software GmbH

Equalizer. Parallel OpenGL Application Framework. Stefan Eilemann, Eyescale Software GmbH Equalizer Parallel OpenGL Application Framework Stefan Eilemann, Eyescale Software GmbH Outline Overview High-Performance Visualization Equalizer Competitive Environment Equalizer Features Scalability

More information

Binary search tree with SIMD bandwidth optimization using SSE

Binary search tree with SIMD bandwidth optimization using SSE Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous

More information

GPUs for Scientific Computing

GPUs for Scientific Computing GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles mike.giles@maths.ox.ac.uk Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research

More information

Sequential Performance Analysis with Callgrind and KCachegrind

Sequential Performance Analysis with Callgrind and KCachegrind Sequential Performance Analysis with Callgrind and KCachegrind 2 nd Parallel Tools Workshop, HLRS, Stuttgart, July 7/8, 2008 Josef Weidendorfer Lehrstuhl für Rechnertechnik und Rechnerorganisation Institut

More information

PETASCALE DATA STORAGE INSTITUTE. SciDAC @ Petascale storage issues. 3 universities, 5 labs, G. Gibson, CMU, PI

PETASCALE DATA STORAGE INSTITUTE. SciDAC @ Petascale storage issues. 3 universities, 5 labs, G. Gibson, CMU, PI PETASCALE DATA STORAGE INSTITUTE 3 universities, 5 labs, G. Gibson, CMU, PI SciDAC @ Petascale storage issues www.pdsi-scidac.org Community building: ie. PDSW-SC07 (Sun 11th) APIs & standards: ie., Parallel

More information

Cray XT3 Supercomputer Scalable by Design CRAY XT3 DATASHEET

Cray XT3 Supercomputer Scalable by Design CRAY XT3 DATASHEET CRAY XT3 DATASHEET Cray XT3 Supercomputer Scalable by Design The Cray XT3 system offers a new level of scalable computing where: a single powerful computing system handles the most complex problems every

More information

Use Cases for Large Memory Appliance/Burst Buffer

Use Cases for Large Memory Appliance/Burst Buffer Use Cases for Large Memory Appliance/Burst Buffer Rob Neely Bert Still Ian Karlin Adam Bertsch LLNL-PRES- 648613 This work was performed under the auspices of the U.S. Department of Energy by under contract

More information

JUROPA Linux Cluster An Overview. 19 May 2014 Ulrich Detert

JUROPA Linux Cluster An Overview. 19 May 2014 Ulrich Detert Mitglied der Helmholtz-Gemeinschaft JUROPA Linux Cluster An Overview 19 May 2014 Ulrich Detert JuRoPA JuRoPA Jülich Research on Petaflop Architectures Bull, Sun, ParTec, Intel, Mellanox, Novell, FZJ JUROPA

More information

A Pattern-Based Approach to. Automated Application Performance Analysis

A Pattern-Based Approach to. Automated Application Performance Analysis A Pattern-Based Approach to Automated Application Performance Analysis Nikhil Bhatia, Shirley Moore, Felix Wolf, and Jack Dongarra Innovative Computing Laboratory University of Tennessee (bhatia, shirley,

More information

Debugging and Profiling Lab. Carlos Rosales, Kent Milfeld and Yaakoub Y. El Kharma carlos@tacc.utexas.edu

Debugging and Profiling Lab. Carlos Rosales, Kent Milfeld and Yaakoub Y. El Kharma carlos@tacc.utexas.edu Debugging and Profiling Lab Carlos Rosales, Kent Milfeld and Yaakoub Y. El Kharma carlos@tacc.utexas.edu Setup Login to Ranger: - ssh -X username@ranger.tacc.utexas.edu Make sure you can export graphics

More information

Data Structure Oriented Monitoring for OpenMP Programs

Data Structure Oriented Monitoring for OpenMP Programs A Data Structure Oriented Monitoring Environment for Fortran OpenMP Programs Edmond Kereku, Tianchao Li, Michael Gerndt, and Josef Weidendorfer Institut für Informatik, Technische Universität München,

More information

Application Performance Characterization and Analysis on Blue Gene/Q

Application Performance Characterization and Analysis on Blue Gene/Q Application Performance Characterization and Analysis on Blue Gene/Q Bob Walkup (walkup@us.ibm.com) Click to add text 2008 Blue Gene/Q : Power-Efficient Computing System date GHz cores/rack largest-system

More information

GPU Performance Analysis and Optimisation

GPU Performance Analysis and Optimisation GPU Performance Analysis and Optimisation Thomas Bradley, NVIDIA Corporation Outline What limits performance? Analysing performance: GPU profiling Exposing sufficient parallelism Optimising for Kepler

More information

Petascale Software Challenges. William Gropp www.cs.illinois.edu/~wgropp

Petascale Software Challenges. William Gropp www.cs.illinois.edu/~wgropp Petascale Software Challenges William Gropp www.cs.illinois.edu/~wgropp Petascale Software Challenges Why should you care? What are they? Which are different from non-petascale? What has changed since

More information

Search Strategies for Automatic Performance Analysis Tools

Search Strategies for Automatic Performance Analysis Tools Search Strategies for Automatic Performance Analysis Tools Michael Gerndt and Edmond Kereku Technische Universität München, Fakultät für Informatik I10, Boltzmannstr.3, 85748 Garching, Germany gerndt@in.tum.de

More information

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance 11 th International LS-DYNA Users Conference Session # LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance Gilad Shainer 1, Tong Liu 2, Jeff Layton 3, Onur Celebioglu

More information

How To Visualize Performance Data In A Computer Program

How To Visualize Performance Data In A Computer Program Performance Visualization Tools 1 Performance Visualization Tools Lecture Outline : Following Topics will be discussed Characteristics of Performance Visualization technique Commercial and Public Domain

More information

Using Power to Improve C Programming Education

Using Power to Improve C Programming Education Using Power to Improve C Programming Education Jonas Skeppstedt Department of Computer Science Lund University Lund, Sweden jonas.skeppstedt@cs.lth.se jonasskeppstedt.net jonasskeppstedt.net jonas.skeppstedt@cs.lth.se

More information

Understanding applications using the BSC performance tools

Understanding applications using the BSC performance tools Understanding applications using the BSC performance tools Judit Gimenez (judit@bsc.es) German Llort(german.llort@bsc.es) Humans are visual creatures Films or books? Two hours vs. days (months) Memorizing

More information

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance. Agenda Enterprise Performance Factors Overall Enterprise Performance Factors Best Practice for generic Enterprise Best Practice for 3-tiers Enterprise Hardware Load Balancer Basic Unix Tuning Performance

More information

Performance Analysis for GPU Accelerated Applications

Performance Analysis for GPU Accelerated Applications Center for Information Services and High Performance Computing (ZIH) Performance Analysis for GPU Accelerated Applications Working Together for more Insight Willersbau, Room A218 Tel. +49 351-463 - 39871

More information

The Double-layer Master-Slave Model : A Hybrid Approach to Parallel Programming for Multicore Clusters

The Double-layer Master-Slave Model : A Hybrid Approach to Parallel Programming for Multicore Clusters The Double-layer Master-Slave Model : A Hybrid Approach to Parallel Programming for Multicore Clusters User s Manual for the HPCVL DMSM Library Gang Liu and Hartmut L. Schmider High Performance Computing

More information

A Performance Data Storage and Analysis Tool

A Performance Data Storage and Analysis Tool A Performance Data Storage and Analysis Tool Steps for Using 1. Gather Machine Data 2. Build Application 3. Execute Application 4. Load Data 5. Analyze Data 105% Faster! 72% Slower Build Application Execute

More information

Performance Tools. Tulin Kaman. tkaman@ams.sunysb.edu. Department of Applied Mathematics and Statistics

Performance Tools. Tulin Kaman. tkaman@ams.sunysb.edu. Department of Applied Mathematics and Statistics Performance Tools Tulin Kaman Department of Applied Mathematics and Statistics Stony Brook/BNL New York Center for Computational Science tkaman@ams.sunysb.edu Aug 24, 2012 Performance Tools Community Tools:

More information

COMP/CS 605: Intro to Parallel Computing Lecture 01: Parallel Computing Overview (Part 1)

COMP/CS 605: Intro to Parallel Computing Lecture 01: Parallel Computing Overview (Part 1) COMP/CS 605: Intro to Parallel Computing Lecture 01: Parallel Computing Overview (Part 1) Mary Thomas Department of Computer Science Computational Science Research Center (CSRC) San Diego State University

More information

Control 2004, University of Bath, UK, September 2004

Control 2004, University of Bath, UK, September 2004 Control, University of Bath, UK, September ID- IMPACT OF DEPENDENCY AND LOAD BALANCING IN MULTITHREADING REAL-TIME CONTROL ALGORITHMS M A Hossain and M O Tokhi Department of Computing, The University of

More information

serious tools for serious apps

serious tools for serious apps 524028-2 Label.indd 1 serious tools for serious apps Real-Time Debugging Real-Time Linux Debugging and Analysis Tools Deterministic multi-core debugging, monitoring, tracing and scheduling Ideal for time-critical

More information

Scalability and Classifications

Scalability and Classifications Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static

More information

Java Virtual Machine: the key for accurated memory prefetching

Java Virtual Machine: the key for accurated memory prefetching Java Virtual Machine: the key for accurated memory prefetching Yolanda Becerra Jordi Garcia Toni Cortes Nacho Navarro Computer Architecture Department Universitat Politècnica de Catalunya Barcelona, Spain

More information

Introduction History Design Blue Gene/Q Job Scheduler Filesystem Power usage Performance Summary Sequoia is a petascale Blue Gene/Q supercomputer Being constructed by IBM for the National Nuclear Security

More information

More on Pipelining and Pipelines in Real Machines CS 333 Fall 2006 Main Ideas Data Hazards RAW WAR WAW More pipeline stall reduction techniques Branch prediction» static» dynamic bimodal branch prediction

More information

Multi-GPU Load Balancing for Simulation and Rendering

Multi-GPU Load Balancing for Simulation and Rendering Multi- Load Balancing for Simulation and Rendering Yong Cao Computer Science Department, Virginia Tech, USA In-situ ualization and ual Analytics Instant visualization and interaction of computing tasks

More information

LS DYNA Performance Benchmarks and Profiling. January 2009

LS DYNA Performance Benchmarks and Profiling. January 2009 LS DYNA Performance Benchmarks and Profiling January 2009 Note The following research was performed under the HPC Advisory Council activities AMD, Dell, Mellanox HPC Advisory Council Cluster Center The

More information

POSIX and Object Distributed Storage Systems

POSIX and Object Distributed Storage Systems 1 POSIX and Object Distributed Storage Systems Performance Comparison Studies With Real-Life Scenarios in an Experimental Data Taking Context Leveraging OpenStack Swift & Ceph by Michael Poat, Dr. Jerome

More information

Workshare Process of Thread Programming and MPI Model on Multicore Architecture

Workshare Process of Thread Programming and MPI Model on Multicore Architecture Vol., No. 7, 011 Workshare Process of Thread Programming and MPI Model on Multicore Architecture R. Refianti 1, A.B. Mutiara, D.T Hasta 3 Faculty of Computer Science and Information Technology, Gunadarma

More information

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster Gabriele Jost and Haoqiang Jin NAS Division, NASA Ames Research Center, Moffett Field, CA 94035-1000 {gjost,hjin}@nas.nasa.gov

More information

Practical Performance Understanding the Performance of Your Application

Practical Performance Understanding the Performance of Your Application Neil Masson IBM Java Service Technical Lead 25 th September 2012 Practical Performance Understanding the Performance of Your Application 1 WebSphere User Group: Practical Performance Understand the Performance

More information

White Paper. Real-time Capabilities for Linux SGI REACT Real-Time for Linux

White Paper. Real-time Capabilities for Linux SGI REACT Real-Time for Linux White Paper Real-time Capabilities for Linux SGI REACT Real-Time for Linux Abstract This white paper describes the real-time capabilities provided by SGI REACT Real-Time for Linux. software. REACT enables

More information

Parallel and Distributed Computing Programming Assignment 1

Parallel and Distributed Computing Programming Assignment 1 Parallel and Distributed Computing Programming Assignment 1 Due Monday, February 7 For programming assignment 1, you should write two C programs. One should provide an estimate of the performance of ping-pong

More information

D5.6 Prototype demonstration of performance monitoring tools on a system with multiple ARM boards Version 1.0

D5.6 Prototype demonstration of performance monitoring tools on a system with multiple ARM boards Version 1.0 D5.6 Prototype demonstration of performance monitoring tools on a system with multiple ARM boards Document Information Contract Number 288777 Project Website www.montblanc-project.eu Contractual Deadline

More information

Performance Tools for Parallel Java Environments

Performance Tools for Parallel Java Environments Performance Tools for Parallel Java Environments Sameer Shende and Allen D. Malony Department of Computer and Information Science, University of Oregon {sameer,malony}@cs.uoregon.edu http://www.cs.uoregon.edu/research/paracomp/tau

More information

Application Performance Analysis Tools and Techniques

Application Performance Analysis Tools and Techniques Mitglied der Helmholtz-Gemeinschaft Application Performance Analysis Tools and Techniques 2012-06-27 Christian Rössel Jülich Supercomputing Centre c.roessel@fz-juelich.de EU-US HPC Summer School Dublin

More information

Resource Utilization of Middleware Components in Embedded Systems

Resource Utilization of Middleware Components in Embedded Systems Resource Utilization of Middleware Components in Embedded Systems 3 Introduction System memory, CPU, and network resources are critical to the operation and performance of any software system. These system

More information

Enterprise Performance Tuning: Best Practices with SQL Server 2008 Analysis Services. By Ajay Goyal Consultant Scalability Experts, Inc.

Enterprise Performance Tuning: Best Practices with SQL Server 2008 Analysis Services. By Ajay Goyal Consultant Scalability Experts, Inc. Enterprise Performance Tuning: Best Practices with SQL Server 2008 Analysis Services By Ajay Goyal Consultant Scalability Experts, Inc. June 2009 Recommendations presented in this document should be thoroughly

More information

Performance Tuning Guide for ECM 2.0

Performance Tuning Guide for ECM 2.0 Performance Tuning Guide for ECM 2.0 Rev: 20 December 2012 Sitecore ECM 2.0 Performance Tuning Guide for ECM 2.0 A developer's guide to optimizing the performance of Sitecore ECM The information contained

More information

Performance Counter Monitoring for the Blue Gene/Q Architecture

Performance Counter Monitoring for the Blue Gene/Q Architecture Performance Counter Monitoring for the Blue Gene/Q Architecture Heike McCraw Innovative Computing Laboratory University of Tennessee, Knoxville 1 Introduction and Motivation With the increasing scale and

More information