TDDD56 Multicore and GPU computing Lab 1: Load balancing



Similar documents
Get an Easy Performance Boost Even with Unthreaded Apps. with Intel Parallel Studio XE for Windows*

Hierarchical Clustering Analysis

Debugging with TotalView

Visualization of 2D Domains

INTEL PARALLEL STUDIO EVALUATION GUIDE. Intel Cilk Plus: A Simple Path to Parallelism

Using MATLAB to Measure the Diameter of an Object within an Image

Podium View TM 2.0 Visual Presenter Image Software User Manual - English (WINDOWS)

INTEL PARALLEL STUDIO XE EVALUATION GUIDE

DeCyder Extended Data Analysis module Version 1.0

STATGRAPHICS Online. Statistical Analysis and Data Visualization System. Revised 6/21/2012. Copyright 2012 by StatPoint Technologies, Inc.

MONITORING PERFORMANCE IN WINDOWS 7

Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach

Abstract. For notes detailing the changes in each release, see the MySQL for Excel Release Notes. For legal information, see the Legal Notices.

The following is an overview of lessons included in the tutorial.

Real-time processing the basis for PC Control

ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS

SuperViz: An Interactive Visualization of Super-Peer P2P Network

Performance test report

AXIS Camera Station Quick Installation Guide

CSE Case Study: Optimising the CFD code DG-DES

GAMBIT Demo Tutorial

Building an Advanced Invariant Real-Time Human Tracking System

Wireless Device Low-Level API Specification

Designing a Schematic and Layout in PCB Artist

NetBeans Profiler is an

Assignment 2: Animated Transitions Due: Oct 12 Mon, 11:59pm, 2015 (midnight)

Model Simulation in Rational Software Architect: Business Process Simulation

Testing Database Performance with HelperCore on Multi-Core Processors

Lab 0: Preparing your laptop for the course OS X

Inside the Erlang VM

HTML Code Generator V 1.0 For Simatic IT Modules CP IT, IT, IT

Excel 2007 Basic knowledge

RingCentral for Desktop. UK User Guide

PowerWorld Simulator

Visualizing Data: Scalable Interactivity

Smoke Density Monitor application documentation

3D Viewer. user's manual _2

Autodesk 3ds Max 2010 Boot Camp FAQ

ActiView. Visual Presenter Image Software User Manual - English

User Manual CROMLAWATCH. Data Logging Software. Version 1.6. For use with Color Sensors CROMLAVIEW CR100 CROMLAVIEW CR200 CROMLAVIEW CR210

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Speed Performance Improvement of Vehicle Blob Tracking System

ROBOTRACKER A SYSTEM FOR TRACKING MULTIPLE ROBOTS IN REAL TIME. by Alex Sirota, alex@elbrus.com

10 STEPS TO YOUR FIRST QNX PROGRAM. QUICKSTART GUIDE Second Edition

Best Practices for Dashboard Design with SAP BusinessObjects Design Studio

Parallel Programming for Multi-Core, Distributed Systems, and GPUs Exercises

2020 Design Update Release Notes November 10, 2015

EE8205: Embedded Computer System Electrical and Computer Engineering, Ryerson University. Multitasking ARM-Applications with uvision and RTX

Visualizing molecular simulations

WESTMORELAND COUNTY PUBLIC SCHOOLS Integrated Instructional Pacing Guide and Checklist Computer Math

Distributed Image Processing using Hadoop MapReduce framework. Binoy A Fernandez ( ) Sameer Kumar ( )

Revit products will use multiple cores for many tasks, using up to 16 cores for nearphotorealistic

CMS-DH CENTRAL MANAGEMENT SOFTWARE

3. Programming the STM32F4-Discovery

Applications. Network Application Performance Analysis. Laboratory. Objective. Overview

QAD Usability Customization Demo

MultiDSLA v4.8.1: Release Notes at 07 October 2013

TABKAM X1 Tablet camera system evolution of video microscope systems

i -CEN S USER S Manual

Scientific Visualization with ParaView

Network Licensing. White Paper 0-15Apr014ks(WP02_Network) Network Licensing with the CRYPTO-BOX. White Paper

Topics in Computer System Performance and Reliability: Storage Systems!

Summary of important mathematical operations and formulas (from first tutorial):

IGSS. Interactive Graphical SCADA System. Quick Start Guide

Lab #8: Introduction to ENVI (Environment for Visualizing Images) Image Processing

Ulf Hermann. October 8, 2014 / Qt Developer Days The Qt Company. Using The QML Profiler. Ulf Hermann. Why? What? How To. Examples.

Regintech Skype video conference box

DbSchema Tutorial with Introduction in SQL Databases

Some of the Choices. If you want to work on your own PC with a C++ compiler, rather than being logged in remotely to the PSU systems

System requirements for Autodesk Building Design Suite 2017

Microsoft Excel 2010 Part 3: Advanced Excel

Visualization with OpenDX

OpenDaylight Performance Stress Tests Report

Q N X S O F T W A R E D E V E L O P M E N T P L A T F O R M v Steps to Developing a QNX Program Quickstart Guide

Unit 4 i5/os Work Management

<User s Guide> Plus Viewer. monitoring. Web

PERFORMANCE ENHANCEMENTS IN TreeAge Pro 2014 R1.0

The VB development environment

Law Conferencing uses the Webinterpoint 8.2 web conferencing platform. This service is completely reservationless and available 24/7.

Development of Monitoring and Analysis Tools for the Huawei Cloud Storage

Microsoft SQL Server Decision Support (DSS) Load Testing

Tutorial for Tracker and Supporting Software By David Chandler

stream.jw.org - Viewer User Guide Table of Contents

Voronoi Treemaps in D3

Education Solutions Development, Inc. APECS Navigation: Business Systems Getting Started Reference Guide

IV3Dm provides global settings which can be set prior to launching the application and are available through the device settings menu.

(Refer Slide Time: 00:01:16 min)

Using In-Memory Computing to Simplify Big Data Analytics

Student Manual. for Virtual Classroom (Big Blue Button)

Operating Systems Concepts: Chapter 7: Scheduling Strategies

Mikogo User Guide Linux Version

Web Based 3D Visualization for COMSOL Multiphysics

Pro/E Design Animation Tutorial*

Mathematics. What to expect Resources Study Strategies Helpful Preparation Tips Problem Solving Strategies and Hints Test taking strategies

Virtuozzo Virtualization SDK

Contents. 2. cttctx Performance Test Utility Server Side Plug-In Index All Rights Reserved.

AquaExcel at Wageningen University

Data Mining. SPSS Clementine Clementine Overview. Spring 2010 Instructor: Dr. Masoud Yaghini. Clementine

Florence School District #1

Transcription:

TDDD56 Multicore and GPU computing Lab 1: Load balancing Nicolas Melot nicolas.melot@liu.se November 20, 2015 1 Introduction This laboratory aims to implement a parallel algorithm to generates a visual representation of the Mandelbrot set similar to Fig. 1. The algorithm we use in this lab work to compute such picture computes each pixel independently. Since this would allow to process all pixels in one parallel step, it is called an embarassingly parallel algorithm. However, a processor has typically much less computing units (cores) available than pixels. Thus, the entire picture to be computed must be shared among all cores and each core can sequentially compute its part, while other cores also compute theirs at the same time. Although the algorithm is embarassingly parallel, one must take great care when distributing the work, as bad work partitioning can produce parts harder to compute than others. If all cores receive one part and one part is significantly longer to compute, then one core will takes more time to run, delaying the completion of the overall algorithm and leaving other unused. In contrast, if all cores work for the same amount of time on their part, then available cores are exploited optimally and the runtime of the algorithm is further reduced. The work in this lab consists in three parts. First we study the algorithm we use in order to implement a first parallel version with a naive work partition. Second, we measure and analyse the performance of this first implementation, and investigate where and why the computation can be slowed down. The last step consists in programming a smarter partitionning strategy in order to overcome the difficulties identified in the second step, and measure the performance improvements. Figure 1: A representation of the Mandelbrot set (black area) in the range 2 C re 0.6 and 1 C im 1. The pixels in the black area require the full iteration number MAXIT ER+1. 1

2 Computing 2D representations of the Mandelbrot set The Mandelbrot set is defined as the set of complex values c, for which the norm of the Julia sequence a n shown in Eq. 1, starting with z = (0,0), is bounded to b R + with b > 0. We can represent the 2D space of a subset of C with a 2D picture. Let us define the bounds of this subset with Cmin im and Cim max respectively for the minimum and maximum multiples of i and Cmin re and Cre max for the real counterpart. The set we want to represent is then C ([Cmin im,cim max] [Cmin re,cre max]); we call it P. We can attribute each pixels of a 2 dimensional picture, to exactly one point p of P. a 0 a n+1 = z = a 2 n+ c, n x N:n 0} (1) In order to give each pixel a color, we need to determine if its corresponding point p is included in the Mandelbrot set. The algorithm depict in Fig. 2 computes the Julia sequence starting at z=(0,0) and returns the number of iterations n it ran before detecting the sequence diverges. If n is below a maximum constant MAXIT ER, then we consider p is not included in the Mandelbrot set and we give it a color depending on p; otherwise, the pixel is colored in black. Note that this algorithm is an approximation: if the Julia sequence with c= p diverges very slowly, then a constant number of iteration may not be enough to generate a divergence greater than MAXDIV. In this case, p is assumed to be part of the Mandelbrot set, whereas it is not. On the other hand, if MAXIT ER is too high or if MAXDIV is too low, then the algorithm may detect a divergence where the sequence only varies toward a convergence. Then p would not be accounted as part of the Mandelbrot set even if it should be. for a given value of MAXDIV, a high value for MAXIT ER makes the decision more reliable, but it also makes the cores to run more iterations before giving-up and consider p as part of the Mandelbrot set. 3 Getting started 3.1 Installation Fetch the lab 1 skeleton source files from the CPU lab page 1 and extract them to your personal storage space. Students in group A (Southfork): install the R packages ggplot2 and plyr by downloading and running this script 2. Load modules to set up your environment. In Southfork (group A): type the following in a terminal after each login nicme26@astmatix:~$ setenv MODULEPATH /edu/nicme26/tddd56:$modulepath; module add tddd56 In Konrad Zuse (groups B and C): type the following in a terminal nicme26@astmatix:~$ module add ~TDDD56/module nicme26@astmatix:~$ module initadd ~TDDD56/module 3.2 Source file skeleton This section describes how to use and compile the skeleton source fiels provided and introduce a few key points you need to know. 1 http://www.ida.liu.se/~nicme26/tddd56.en.shtml 2 http://www.ida.liu.se/~nicme26/tddd56/southfork_setup 2

i n t i s _ i n _ M a n d e l b r o t ( f l o a t Cre, f l o a t Cim ) i n t i t e r ; f l o a t x = 0. 0, y = 0. 0, xto2 = 0. 0, yto2 = 0. 0, d i s t 2 ; f o r ( i t e r = 0 ; i t e r <= MAXITER; i t e r ++) y = x * y ; y = y + y + Cim ; x = xto2 yto2 + Cre ; xto2 = x * x ; yto2 = y * y ; d i s t 2 = xto2 + yto2 ; } i f ( ( i n t ) d i s t 2 >= MAXDIV) break ; / / c o n v e r g e s t o i n f i n i t y } } return i t e r ; Figure 2: A C function to decide whether the complex number belongs to the Mandelbrot set, returning the number of iterations necessary to take the decision. 3.2.1 Compiling and running The skeleton can be compiled in three different modes you can enable or disable by passing variables when calling make: Debugging: This mode makes the executable to compute one picture and store it in mandelbrot.ppm. You can use this mode and tweak other variables to make sure the algorithm runs, terminates and produces the expected result. Just call make without passing any variable, or only variables to tweak the algorithm. Measuring: This makes the executable to compute one picture and quit without saving it. It also makes the program to check time (in seconds and nanoseconds) before and after computating, and display these values on the terminal after completion and before exiting. This mode is used by batch and plotting scripts to generate many values, analyse and plot them. You can also use these numbers if you want to interpret them yourself. Call make with the variable MEASURE set to any value (make MEASURE=1. Showing off: The executable runs an OpenGl anymation where a camera zooms and unzoom, jumping randomly from a pre-defined point to another. You can press h in the main window to get some help on how to control it. Press b to stop jumping and use the mouse to browse it yourself: left-click to slide in any direction, right-click plus moving the mouse up to zoom in and right-click plus move down to unzoom. Call make with the variable GLUT set to 1 (make GLUT=1); note that the variable MEASURE preempts GLUT. 3

Make can also take other useful options through other variables passing: NB_THREADS: Call make with NB_T HREADS=n make... NB_THREADS=n where n is a non-negative integer. It instructs make to compile a multithreaded version running n threads. If n=0 then it compiles a sequential version. If n=1, then it compile a parallel version in which only one thread runs. LOADBALANCE: Call make with LOADBALANCE = n make... LOADBALANCE=n where n [0,1,2]. It selects and compile one load-balancing method among no loadbalancing (0), your load-balancing method (1) and an optional additional load-balancing method (2). Browse the file Makefile to find out more variables you can use to tweak your algorithm. You can modify the maximum amount of iterations, you can move P inc, you can change the size of the picture to generate and you can provide another color to represent points in the Mandelbrot set (black by default). 3.2.2 Structure The skeleton provides an sequential implementation of the algorithm described in sec.2. It is divided in the four source files mandelbrot_main.c, mandelbrot.c, ppm.c and gl_mandelbrot.c with their associated header files. mandelbrot_main.c The program starts here. Depending on the options you passed to make, it computes and maybe stores a picture, or it starts the opengl engine. mandelbrot.c This is the only file you need to modify. At initialization time, it spawns by itself all threads you instructed to use at compile time. Whatever running mode, whatever amount of threads you use, computation is always started by a call to compute_mandelbrot(...). Depending on the number of threads, this runs directly sequential_mandelbrot(...) or it releases all threads which, each of them individually, run parallel_mandelbrot(...). The implementation uses function is_in_mandelbrot(...) to check if a complex number is in the mandelbrot set and compute_chunk() to compute the whole picture or a region of it, depending on the values of parameters parameters > begin_h and parameters > end_h for height (respectively begin and end) and parameters > begin_w and parameters > end_w. By default, one call to parallel_mandelbrot(...) computes nothing. The function admits as parameters args and param. args id gives the thread id, from 0 to NB_T HREADS 1. The parameter param points to a structure from which you can get the dimension of the picture to compute (height, width), the maximum amount of iterations (maxiter), the color for the Mandelbrot set (mandelbrot_color, black by default), the bounds defining the subset P ofcmatching the picture (lower_r, upper_r, lower_i, upper_i, respectivey for C re C re max, Cmin im and Cim max) and a pointer to the ppm data structure holding all pixels (see ppm.h). Note that the function init_round(...) is garanteed to run by each thread and return before any thread begin to run parallel_mandelbrot(...). It is a preferred place to implement initializations for any shared resource your threads may use. ppm.c This file holds all functions required to manipulate ppm files. You need to store pixels in the args picture, using ppm_write(args picture, x, y, color) to give a color to pixel of coordinates (x,y) and where color is an instance of the structure color, a triplet of three values for red, blue and green intensities (see ppm.h). You can pick a color in the global variable color and extract its RGB components using shifts (<< 0, 8or16) and bitwise AND. All this work is already implmented in compute_chunk(...) of mandelbrot.c. min, 4

gl_mandelbrot.c This file includes all necessary code to run the OpenGl animation. You don t need to browse this code for the lab work. 4 Before the lab session Before coming to the lab, we recommend you do the following preparatory work Write a detailed explanation why computation load can be imbalanced and how it affects the global performance. Hint: What is necessary to compute a black pixel, as opposed to a colored pixel? Describe a load-balancing method that would help reducing the performance loss due to load-imbalance. Hint: Observe that the load-balancing method must be valid for āny picture computed, not only the default picture. Implement all the algorithms required in sections 5 and 6 ahead of the lab session scheduled, so you can measure their performance during lab sessions. Use the preprocessor symbol LOADBALANCE to take decision to use either no load-balancing or one or several load-balancing methods (the helper scripts assume a value of 0 for no load-balacing and 1 and 2 for two different load-balancing methods). This value is known at compile time and can be handled using preprocessor instructions such as #if-#then-#else; see the skeleton for example. 5 During the lab session Take profit of the exclusive access you have to the computer you use in the lab session to perform the following tasks Measure the performance of the naive parallel implementation (naive partitioning shown in Fig. 4 and generate a graph showing execution time as a function of number of threads involved. Hint: You will observe more easily the load-imbalance effects if you generate only pictures of the Mandelbrot set in the range C re [ 2;+0.6] and C im [ 1;+1]. We suggest to generate a picture of 500 375 pixels, using a maximum of 256 iterations. You are encouraged to change these parameters if this helps you to have a better understanding of the problem or to find a solution. Measure the performance of the load-balanced variant and produce a graph featuring both load-balanced and load-imbalanced global execution time curves. 6 Lab demo Demonstrate to your lab assistant the following elements: 1. Show the performance (execution time) as a function of number of threads measured on the naive parallel algorithm, through a clear diagram. 2. Explain the reason why some threads get more work than others 3. Explain the load-balancing strategy you implemented and argue why it helps improving performance 4. Show the performance of your load-balanced version and compare it to the naive parallel algorithm. Explain the reason of performance differences. 5

400 Global computation time for unbalanced threads 1400 Unbalanced Time in milliseconds 1200 1000 800 600 0 1 2 3 4 5 6 7 8 Number of threads employed Figure 3: Performance of an unbalanced implementation. The computation time does not lower accordingly with the increasing number of threads. Figure 4: A naive partition for the computation of the representation of a Mandelbrot subset. Every thread receives an equally big subarea of the global picture to compute. 7 Helpers The given source files are instrumented to measure the global and per thread execution time. Install Freja (see Sec. 3) and Run freja compile to compile several suggested variants, then freja run <name for experiment> to run them and measure their performance; note that you must give a name to your experiment. Finally, run the script drawplots.r (./drawplots.r) to generate graphs showing the behavior of your implementation. Observe that during performance measurement, the compile scripts define the symbol LOADBALANCE with values 0, 1 or 2. These values are meaningless if you don t use them is your source code, but they are intended to trigger alternative load-balancing methods (0 stands for the naive parallel implementation with no load-balancing). The file variables defines experiments where the number of threads varies from 0 to 6 and where 3 different load-balancing methods are tried: no load-balancing, your loadbalancing and an optional extra load-balancing method. Modify the files compile, run, variables and Makefile at will to fit further experiments you may need. You can read the documentation about Freja and at the page http://www.ida.liu.se/~nicme26/pub/freja/. Alternatively, feel free to use any other mean of measuring performance but in any case, make sure you can explain the measurement process and the numbers you show. 8 Investigate further You can investigate further the load-imbalance issue when computing the Mandelbrot set. We suggest the following tasks Training for exam: Provide a performance analysis of the sequential Mandelbrot generator, using the big O notation. Analyze the naive parallel version and give speedup and efficiency. Training for exam: Analyze the load-balanced parallel variant using the big O notation. Compare with the naive parallel version analysis. 6

Just for fun: Update the function update_colors (mandelbrot.c, line 210) to generate other fancy colors for the fractal. 9 Lab rooms During the laboratory sessions, you have priority in rooms Southfork and Konrad Suze in the B building. See below for information about these rooms: 9.1 Konrad Suze (IDA) Intel Xeon X5660 3 6 cores 2.80GHz 6 GiB RAM OS: Debian Squeeze During lab sessions, you are guaranteed to be alone using one computer at a time. Note that Konrad Suze is not accessible outside the lab sessions; you can nevertheless connect to one computer through ssh at ssh <ida_student_id>@li21-<1..8>.ida.liu.se using your IDA student id. 9.2 Southfork (ISY) Intel Core 2 Quad CPU Q9550 4 4 cores 2.80GHz 4 GiB RAM OS: CentOS 6 i386 Southfork is open and accessible from 8:00 to 17:00 every day, except when other courses are taught. You can also remotely connect to ssh <isy_student_id>ixtab.edu.isy.liu.se, using your ISY student id. 3 http://ark.intel.com/products/47921/intel-xeon-processor-x5660-(12m-cache-2_80-ghz-6_40-gts-intel- QPI) 4 http://ark.intel.com/products/33924/intel-core2-quad-processor-q9550-(12m-cache-2_83-ghz-1333- MHz-FSB) 7