Parallel Programming Multicore systems

Similar documents
Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Multicore Programming with LabVIEW Technical Resource Guide

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Chapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors. Lesson 05: Array Processors

Course Development of Programming for General-Purpose Multicore Processors

Introduction to Cloud Computing

Thread level parallelism

Introduction to GPU Programming Languages

FPGA-based Multithreading for In-Memory Hash Joins

Open Flow Controller and Switch Datasheet

Chapter 2: OS Overview

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Unit A451: Computer systems and programming. Section 2: Computing Hardware 1/5: Central Processing Unit

Chapter 2 Logic Gates and Introduction to Computer Architecture

1. Memory technology & Hierarchy

Historically, Huge Performance Gains came from Huge Clock Frequency Increases Unfortunately.

GPUs for Scientific Computing

Multi-core and Linux* Kernel

Parallel Programming Survey

Scheduling. Scheduling. Scheduling levels. Decision to switch the running process can take place under the following circumstances:

Embedded Parallel Computing

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

Software and the Concurrency Revolution

~ Greetings from WSU CAPPLab ~

Chapter 2 Parallel Computer Architecture

Multithreading Lin Gao cs9244 report, 2006

Testing Database Performance with HelperCore on Multi-Core Processors

CHAPTER 1 INTRODUCTION

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

HP ProLiant Gen8 vs Gen9 Server Blades on Data Warehouse Workloads

Parallelism and Cloud Computing

Parallel Computing 37 (2011) Contents lists available at ScienceDirect. Parallel Computing. journal homepage:

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

HyperThreading Support in VMware ESX Server 2.1

Introduction to GPU hardware and to CUDA

Architectural Level Power Consumption of Network on Chip. Presenter: YUAN Zheng

VHDL DESIGN OF EDUCATIONAL, MODERN AND OPEN- ARCHITECTURE CPU

Interconnection Networks

This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings

Multi-Threading Performance on Commodity Multi-Core Processors

Solving Network Challenges

Four Keys to Successful Multicore Optimization for Machine Vision. White Paper

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Control 2004, University of Bath, UK, September 2004

Architectures and Platforms

Enabling Technologies for Distributed and Cloud Computing

Parallel Computing for Data Science

Symmetric Multiprocessing

Sequential Performance Analysis with Callgrind and KCachegrind

Configuring and using DDR3 memory with HP ProLiant Gen8 Servers

Improving System Scalability of OpenMP Applications Using Large Page Support

Overview Motivating Examples Interleaving Model Semantics of Correctness Testing, Debugging, and Verification

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

CS 159 Two Lecture Introduction. Parallel Processing: A Hardware Solution & A Software Challenge

CHAPTER 2: HARDWARE BASICS: INSIDE THE BOX

Sequential Performance Analysis with Callgrind and KCachegrind

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

CUDA programming on NVIDIA GPUs

Spring 2011 Prof. Hyesoon Kim

Emerging IT and Energy Star PC Specification Version 4.0: Opportunities and Risks. ITI/EPA Energy Star Workshop June 21, 2005 Donna Sadowy, AMD

Operating Systems 4 th Class

Final Project Report. Trading Platform Server

Virtuoso and Database Scalability

Lesson 7: SYSTEM-ON. SoC) AND USE OF VLSI CIRCUIT DESIGN TECHNOLOGY. Chapter-1L07: "Embedded Systems - ", Raj Kamal, Publs.: McGraw-Hill Education

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

CYCLIC: A LOCALITY-PRESERVING LOAD-BALANCING ALGORITHM FOR PDES ON SHARED MEMORY MULTIPROCESSORS

Agenda. Michele Taliercio, Il circuito Integrato, Novembre 2001

CPU Session 1. Praktikum Parallele Rechnerarchtitekturen. Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14,

Best Practises for LabVIEW FPGA Design Flow. uk.ni.com ireland.ni.com

CSE 6040 Computing for Data Analytics: Methods and Tools

Low Power AMD Athlon 64 and AMD Opteron Processors

Performance evaluation

Operating System Impact on SMT Architecture

ELEC 5260/6260/6266 Embedded Computing Systems

Communicating with devices

Introduction to the NI Real-Time Hypervisor

Let s put together a Manual Processor

Load Balancing & DFS Primitives for Efficient Multicore Applications

OpenCL Programming for the CUDA Architecture. Version 2.3

Optimizing Shared Resource Contention in HPC Clusters

ultra fast SOM using CUDA

ADVANCED COMPUTER ARCHITECTURE: Parallelism, Scalability, Programmability

CHAPTER 7: The CPU and Memory

Vorlesung Rechnerarchitektur 2 Seite 178 DASH

EECS 750: Advanced Operating Systems. 01/28 /2015 Heechul Yun

Hardware and Software

Hunting Asynchronous CDC Violations in the Wild

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

Rethinking SIMD Vectorization for In-Memory Databases

路 論 Chapter 15 System-Level Physical Design

Computer Organization

The Evolution of CCD Clock Sequencers at MIT: Looking to the Future through History

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

Design Cycle for Microprocessors

Chapter 1 Computer System Overview

ARM Webinar series. ARM Based SoC. Abey Thomas

CPU Scheduling Outline

Graph Analytics in Big Data. John Feo Pacific Northwest National Laboratory

Transcription:

FYS3240 PC-based instrumentation and microcontrollers Parallel Programming Multicore systems Spring 2013 Lecture #9 Bekkeng, 17.3.2013

Introduction Until recently, innovations in processor technology have resulted in computers with CPUs that operate at higher clock rates. However, as clock rates approach their theoretical physical limits, companies are developing new processors with multiple processing cores. P = ACV 2 f (P is consumed power, A is active chip area, C is the switched capacitance, f is frequency and V is voltage). With these multicore processors the best performance and highest throughput is achieved by using parallel programming techniques This requires knowledge and suitable programming tools to take advantage of the multicore processors

Multiprocessors & Multicore Processors Multiprocessor systems contain multiple CPUs that are not on the same chip Multicore Processors contain any number of multiple CPUs on a single chip The multiprocessor system has a divided cache with long-interconnects The multicore processors share the cache with short interconnects

Hyper-threading Hyper-threading is a technology that was introduced by Intel, with the primary purpose of improving support for multithreaded code. Under certain workloads hyper-threading technology provides a more efficient use of CPU resources by executing threads in parallel on a single processor. Hyper-threading works by duplicating certain sections of the processor. A hyper-threading equipped processor (core) pretends to be two "logical" processors to the host operating system, allowing the operating system to schedule two threads or processes simultaneously. E.g. Pentium 4, Xeon, Core i5 and Core i7 processors implement hyper-threading.

FPGAs FPGA = Field Programmable Gate Array VHDL for programming of FPGAs is part of FYS4220! Contains huge amount of programmable gates that can be programmed into many parallel hardware paths FPGAs are truly parallel in nature so different processing operations do not have to compete for the same resources (no thread prioritization as typical to most common operating system)

How LabVIEW Implements Multithreading Parallel code paths on a block diagram can execute in unique threads LabVIEW automatically divides each application into multiple execution threads (originally introduced in 1998 with LabVIEW 5.0)

LabVIEW Example: DAQ Two separate tasks that are not dependent on one another for data will run in parallel without the need for any extra programming.

How LabVIEW Implements Multithreading II Automatic Multithreading using LabVIEW Execution System (Implicit Parallelism / Threading)

Multicore Programming Goals Increase code execution speed (# of FLOPS) Maintain rate of execution but increase data throughput Evenly balance tasks across available CPUs (fair distribution of processing load) Dedicate time-critical tasks to a single CPU In computing, FLOPS (floating-point operations per second) is a measure of a computer's performance

Data Parallelism

Example: Data Parallelism in LabVIEW Mult. Split Combine A standard implementation of matrix multiplication in LabVIEW does not use data parallelism Data parallelism; by dividing the matrix in half the operation can be computed simultaneously on two CPU cores. With 1000 x 1000 matrices used for the input matrices:

Data Flow Many applications involve sequential, multistep algorithms

LabVIEW: Sequential data flow Sim Signal processing Write The five tasks run in the same thread because they are connected sequentially

Data Flow Parallelism - Pipelining Applying pipelining can increase performance Increase throughput (amount of data processed in a given time period) Pipelining Strategy:

Pipelining in LabVIEW

Producer-consumer loops and pipelining using queues Note: Pipelined processing does introduce latency between input and output! In general, the CPU and data bus operate most efficiently when processing large blocks of data

Pipelining increases latency Pipelining increase throughput but it also introduces additional latency

Producer-Consumer for DAQ

Producer-Consumer LabVIEW example

Multicore Programming Challenges Thread Synchronization Race Conditions Deadlocks Shared resources Data transfer between processor cores

Synchronization in LabVIEW Synchronization mechanisms in LabVIEW: Notifiers Queues Semaphores Rendezvous Occurrences

Data Transfer between cores Physical distance between processors and the quality of the processor connections can have large effect on execution speed

Unbalanced vs. balanced Parallel Tasks Unbalanced Balanced

Pipelining and balancing In order to gain the most performance increase possible from pipelining, individual stages must be carefully balanced so that no single stage takes a much longer time to complete than other stages. In addition, any data transfer between pipeline stages should be minimized to avoid decreased performance due to memory access from multiple cores. Not optimal for pipelining! Move tasks from Stage 1 to Stage 2 until both stages take approximately equal times to execute

Conclusions PC-based instrumentation benefit greatly from advances in multicore processor technology and improved data bus speeds. As new CPUs improve performance by adding multiple processing cores, parallel or pipelined processing structures are necessary to maximize CPU efficiency. You can achieve significant performance improvements by structuring algorithms to take advantage of parallel processing.