<Insert Picture Here> An Experimental Model to Analyze OpenMP Applications for System Utilization

Similar documents
Parallel Algorithm Engineering

Contributions to Gang Scheduling

An Oracle Technical White Paper November Oracle Solaris 11 Network Virtualization and Network Resource Management

CPU Scheduling. CPU Scheduling

An Oracle Benchmarking Study February Oracle Insurance Insbridge Enterprise Rating: Performance Assessment

Resource Containers: A new facility for resource management in server systems

Scaling in a Hypervisor Environment

11.1 inspectit inspectit

SP Apps Performance test Test report. 2012/10 Mai Au

An Oracle White Paper March Load Testing Best Practices for Oracle E- Business Suite using Oracle Application Testing Suite

Scheduling. Yücel Saygın. These slides are based on your text book and on the slides prepared by Andrew S. Tanenbaum

Chapter 2: OS Overview

Informatica Ultra Messaging SMX Shared-Memory Transport

Lecture Outline Overview of real-time scheduling algorithms Outline relative strengths, weaknesses

Real-Time Scheduling 1 / 39

How To Test For Performance And Scalability On A Server With A Multi-Core Computer (For A Large Server)

Chapter 5 Process Scheduling

SQL Server Business Intelligence on HP ProLiant DL785 Server

Delivering Quality in Software Performance and Scalability Testing

Deciding which process to run. (Deciding which thread to run) Deciding how long the chosen process can run

HPSA Agent Characterization

Operating Systems. III. Scheduling.

Operating Systems Concepts: Chapter 7: Scheduling Strategies

Ready Time Observations

Operating Systems, 6 th ed. Test Bank Chapter 7

Server Migration from UNIX/RISC to Red Hat Enterprise Linux on Intel Xeon Processors:

Envox CDP 7.0 Performance Comparison of VoiceXML and Envox Scripts

SIDN Server Measurements

Throughput Capacity Planning and Application Saturation

Performance Comparison of Fujitsu PRIMERGY and PRIMEPOWER Servers

QLIKVIEW SERVER LINEAR SCALING

ESX Server Performance and Resource Management for CPU-Intensive Workloads

Virtuoso and Database Scalability

2. is the number of processes that are completed per time unit. A) CPU utilization B) Response time C) Turnaround time D) Throughput

VirtualCenter Database Performance for Microsoft SQL Server 2005 VirtualCenter 2.5

Best Practices for Deploying SSDs in a Microsoft SQL Server 2008 OLTP Environment with Dell EqualLogic PS-Series Arrays

Introduction. Application Performance in the QLinux Multimedia Operating System. Solution: QLinux. Introduction. Outline. QLinux Design Principles

Multicore Parallel Computing with OpenMP

CPU Scheduling. Core Definitions

Informatica Data Director Performance

Networking Virtualization Using FPGAs

Performance Characteristics of Large SMP Machines

Performance Baseline of Oracle Exadata X2-2 HR HC. Part II: Server Performance. Benchware Performance Suite Release 8.4 (Build ) September 2013

Processor Scheduling. Queues Recall OS maintains various queues

Performance Test Report For OpenCRM. Submitted By: Softsmith Infotech.

CPU SCHEDULING (CONT D) NESTED SCHEDULING FUNCTIONS

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Benchmarking Hadoop & HBase on Violin

How to control Resource allocation on pseries multi MCM system

Virtualization Technologies and Blackboard: The Future of Blackboard Software on Multi-Core Technologies

Load Imbalance Analysis

Objectives. Chapter 5: CPU Scheduling. CPU Scheduler. Non-preemptive and preemptive. Dispatcher. Alternating Sequence of CPU And I/O Bursts

Understanding Hardware Transactional Memory

Thomas Fahrig Senior Developer Hypervisor Team. Hypervisor Architecture Terminology Goals Basics Details

UTS: An Unbalanced Tree Search Benchmark

Web Application s Performance Testing

W4118 Operating Systems. Instructor: Junfeng Yang

DATABASE. Pervasive PSQL Performance. Key Performance Features of Pervasive PSQL. Pervasive PSQL White Paper

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

ICS Principles of Operating Systems

Parallel Programming Survey

A Study on Reducing Labor Costs Through the Use of WebSphere Cloudburst Appliance

Liferay Portal Performance. Benchmark Study of Liferay Portal Enterprise Edition

Oracle Solaris: Aktueller Stand und Ausblick

Performance Analysis of Web based Applications on Single and Multi Core Servers

Oracle TimesTen In-Memory Database on Oracle Exalogic Elastic Cloud

Getting OpenMP Up To Speed

A Comparison of Oracle Performance on Physical and VMware Servers

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Streaming and Virtual Hosted Desktop Study

Capacity Planning for Microsoft SharePoint Technologies

Informatica Master Data Management Multi Domain Hub API: Performance and Scalability Diagnostics Checklist

Copyright 2013, Oracle and/or its affiliates. All rights reserved.

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

Scheduling Task Parallelism" on Multi-Socket Multicore Systems"

Job Scheduling Model

PeerMon: A Peer-to-Peer Network Monitoring System

Microsoft Windows Server 2003 with Internet Information Services (IIS) 6.0 vs. Linux Competitive Web Server Performance Comparison

OpenMP Programming on ScaleMP

JVM Performance Study Comparing Oracle HotSpot and Azul Zing Using Apache Cassandra

McAfee VirusScan Enterprise for Storage 1.0 Sizing Guide for NetApp Filer on Data ONTAP 7.x

Agility Database Scalability Testing

Applications. Network Application Performance Analysis. Laboratory. Objective. Overview

Oracle Database Scalability in VMware ESX VMware ESX 3.5

TRACE PERFORMANCE TESTING APPROACH. Overview. Approach. Flow. Attributes

POSIX and Object Distributed Storage Systems

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build

Q & A From Hitachi Data Systems WebTech Presentation:

Load Balancing on Speed

Amazon EC2 XenApp Scalability Analysis

JBoss Data Grid Performance Study Comparing Java HotSpot to Azul Zing

Sizing guide for SAP and VMware ESX Server running on HP ProLiant x86-64 platforms

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Multiprocessor Scheduling and Scheduling in Linux Kernel 2.6

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

Transcription:

<Insert Picture Here> An Experimental Model to Analyze OpenMP Applications for System Utilization Mark Woodyard Principal Software Engineer 1

The following is an overview of a research project. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle s products remains at the sole discretion of Oracle. The opinions expressed here are my own and do not express the views of Oracle Corporation. 2

The Importance of System Utilization No program is 100% parallel Utilization unimportant for one small, cheap server... Critically important for a large data center No OS can provide 100% cycles to applications Focus on single application performance Lean on processor sets, processor binding Result: reduced overall utilization Core counts per socket are increasing Systems are getting larger: more sockets & cores More than many applications can efficiently use 3

OpenMP Performance in a Throughput Context Operating System Scheduler: Software Threads Hardware Threads Scheduling & Work Distribution Interactions Impact Utilization process* (17 threads) process process (5 threads) ~ ~~~~ ~ ~~~~ ~ ~ ~ OS Scheduler context switch Thread Thread Thread Thread Socket Socket Thread Thread Thread Thread Core Core Core Core *Process = Executing Program ~ = Executing Thread Process Memory 4

Need: Enable High Parallel Utilization Philosophy: Make efficient use of any available cycles If my program doesn't, how do I find the cause? Key scalability questions: How well does the entire program scale? Which parallel region is a bottleneck? Do I have enough parallel work vs. overhead? Is the serial time significant vs. parallel work? Are the work chunks large enough vs. the time quantum? 5

Comparison of Scaling Analysis Models Dedicated Analysis Time to Solution Run at 1, 2, N threads One thread per core Compare speedup at each thread count Scaling Reference: traditional 1 vs. N cores Scaling Bottleneck: Ambiguous: OpenMP or hardware Non-Dedicated* Analysis Usage of Available Cycles Run only at N threads Up to N cores Model speedup at <= N cores Scaling Reference: loaded system Scaling Bottleneck: Unambiguous: OpenMP only *Non-Dedicated is called Throughput 6

c <Insert Picture Here> Tool Overview 7

Experimental Tool Experimental tool implements the model Runs on Solaris (x64 & SPARC) Model applies to any UNIX system Tool generates shell/dtrace script to run target program Trace data processed to generate performance model: Reconstruct the OpenMP program structure as run Simulate execution on a range of core counts (round-robin) Generate a text report, and optionally sequence of graphs Speedup analysis aggregated to: Each worksharing construct Each parallel region Entire program Complimentary to existing tools 8

Terms Used: What is Speedup? Speedup: ratio of work time to elapsed time Work Time: Any time not OpenMP overhead Measured as run with N threads Ideal Speedup: without OpenMP overhead Elapsed Time: Measured (as run), presuming all cores available Estimated via simulation (1 thru N cores using times as run) Preceding Serial: single-thread time before a parallel region Stand-Alone: loop speedup without considering preceding serial time 9

Example Tool Speedup Analysis Entire Program Serial Region: 1.29% of Program Time Only measured value on 4 cores Serial Region time of 1.29% agrees with Amdahl's Law prediction of 3.8 speedup 10

About the Charts Maximum value measured or estimated X Average value measured or estimated Minimum value measured or estimated 11

Example Tool Detail for a Work Share Loop with Load Imbalance Only measured value on 4 cores Max Est. Min Est. Avg Est. Load Imbalance; no significant overhead compared to compute 12

How To Measure Utilization on Solaris Use cputimes script from DTrace Toolkit http://opensolaris.org Community Group DTrace Measures all cycles for duration of script: User threads Daemons Kernel Idle time Utilization: Percent of non-idle cycles during run Sanity check: daemons & kernel are small fraction Measurement system: Sun Ultra 27 Intel quad-core Xeon W3570 processor (3.2 GHz) Processor threading turned off (one thread per core) Solaris 10 13

c <Insert Picture Here> Example Program: Time Share Scheduling Class 14

Example Program: w/parallelized Fill Loop, n=5000 22!$omp parallel shared(a,b,c,n,p) private(i,j,k,z)... 38!$omp do schedule(runtime) 39 do k = 1, n 40 do i = 1, n 41 do j = 1, n 42 c(j,k) = c(j,k) + a(j,i) * b(i,k) * z + p 43 end do 44 end do 45 end do 46!$omp end do... Both regions should scale very well 71!$omp end parallel 91!$omp parallel do shared(a,pi,n) private(j,i) schedule(runtime) 92 do j = 1, n 93 do i = 1, n 94 a(i,j) = sqrt(2.0d0/(n+1))*sin(i*j*pi/(n+1)) 95 end do 96 end do 97!$omp end parallel do 15

Estimated Speedups: Static Work Distribution Parallel Loops of Lines 38 and 91 Why the poor scalability on the second region? 16

Four Thread & Single Thread Processes Time-Share Scheduler View Solaris Scheduling Priority 60 59 58 57... 3 2 1 0 0 1 2 3 P1 T2 Core run queue snapshot P1 T1 P1 T0 P0 T0 P1 T3 Process 0: single threaded (infinite loop) Process 1: 4-threads static schedule Two processes share one core Why the poor scalability on the second loop? 17

OpenMP Static Work Distribution & TS Scheduler Four Threads Sharing Four Cores With One Other Process, Static Solaris time slices: 20 millisec for high priority 200 millisec for low priority Loop 38 work: >16 sec Loop 91 work: ~57 msec Time Share forces compute-intensive threads to low priority Result: load imbalance for work chunks < time slice Loop 38 Dedicated Loop 38 Shared Loop 91 Dedicated Loop 91 Shared Thread 0 Thread 2 Other Process Thread 1 Thread 3 4x Speedup 3.2x 4x Speedup 2x Speedup; Gap Poor Utilization Does this really happen? 0 1 2 3 4 5 6 Time Slice 18

TS Experiment: Fill Loop of Line 91 Isolated Loop Run with 4 Threads along with Single-Threaded Process Expect 3.2x Speedup (4/5ths of 4 cores) Dynamic & Guided Measurements 20 ms slice 200 ms slice Static Schedule Measurement 19

First Insight Any statically scheduled work-sharing construct can suffer from load imbalance if the time to execute each thread's work is small compared to the time quantum. How would a programmer identify this situation without measurement? What other problematic scenarios occur? 20

c <Insert Picture Here> Time Share System Utilization Results 21

TS System Utilization: Isolated Loop 91 Utilization measurement over 60 seconds Dynamic & Guided utilization: 99.9% Static allocation utilization measurement: 86% That's a 14% loss in utilization! Measured overhead: Dtrace command: 0.0038% processor cycles Kernel & daemons: 0.04% 22

c <Insert Picture Here> Example Program: Fixed Priority Scheduling Class 23

Solaris Fixed-Priority (FX) Scheduling Class Scheduler Never Adjusts Thread Priority Solaris Scheduling Priority Core run queue snapshot 0 1 2 3 60 nfs4cbd 59 sshd iscsid 58 57... 3 2 High FX Priority 1 Medium FX Priority Process 0 Low FX Priority Process Daemons run in TS, as needed Highest FX threads run dedicated Lower-priority threads get unused cycles Example: Priority 2: 2 thr Priority 0,1: 4 thr 24

Four Thread & Single Thread Processes Fixed-Priority Class Scheduler View Solaris Scheduling Priority Core run queue snapshot 0 1 2 3 60 59 58 57... 3 2 P0 1 0 T0 T1 T2 T3 Process P0: Fixed Priority 2 (infinite loop) Process P1: Four threads Fixed Priority 0 Two P1 threads share one core 25

Four Thread & Single Thread Processes Fixed-Priority Class Scheduler View Solaris Scheduling Priority Core run queue snapshot 0 1 2 3 60 59 58 57... 3 2 P0 1 0 T0 Blocked T1 T2 T3 Process P0: Fixed Priority 2 (infinite loop) Process P1: Four threads Fixed Priority 0 Two P1 threads share one core Until one thread is sometimes blocked 26

OpenMP Dynamic/Guided & Fixed Priority Scheduler Four Low-Priority Threads & One High-Priority Process Sharing 4 Cores High-Priority single-thread consumes one core Four threads share other three cores until one is blocked Static, Guided: potentially large remaining work Dynamic: controlled, smaller work chunks Loop 38 Dedicated Loop 38 Static Loop 38 Guided Loop 38 Dynamic Thread 0 Thread 2 Thread 1 Thread 3 4x Speedup Other Process 74.6% Utilization 2x Speedup 88% Util. Speedup: Depends on size of early chunk 99.9% Utilization Speedup: Depends on size of chunk 0 1 2 3 4 5 6 Time Slice 27

FX Priority Class Experiment: Static Four Threads run on Three Cores; High-Priority Process on One Core 'Static' prediction with infinite time slices: no round-robin scheduling 28

FX Priority Class Experiment: Guided & Dynamic Four Threads run on Three Cores; High-Priority Process on One Core 'Guided' prediction without round-robin scheduling 29

Conclusions: OpenMP Throughput OpenMP throughput philosopy: Efficiently use any available processor cycles Tool helps the OpenMP programmer improve scalability Tool has uncovered subtle utilization bottlenecks Tool can help programmer improve utilization in Time Share scheduling class Fixed Priority scheduling class with prioritized control Dynamic schedule (or tasking): best for utilization If you choose a good chunk size Tool measurements enable good choice 30

Current Limitations & Future Work Nested Parallel Regions Not clear how to interpret speedup Tasking probably superior approach Collecting Traces Dropped trace data: high rate of calls, app probably won't scale Very large traces: option to restrict repeated regions; compress Threaded & NUMA systems Hardware characteristics: leverage existing tools Tool enables differentiation: OpenMP vs. hardware bottlenecks Add dedicated scaling analysis Future Work Larger configurations (in progress) Testing on a broad range of applications 31

32

<Insert Picture Here> Mark Woodyard Principal Software Engineer mark.woodyard@oracle.com 33