An Introduction to Parallel Computing/ Programming
|
|
|
- Sheila Evans
- 9 years ago
- Views:
Transcription
1 An Introduction to Parallel Computing/ Programming Vicky Papadopoulou Lesta Astrophysics and High Performance Computing Research Group ( Dep. of Computer Science and Engineering European University The presentation is based on material of the book: An Introduction to Parallel Programming Peter Pacheco 1
2 Topic 1: Why Parallel Computing? 2
3 Why we need ever-increasing performance.. 3
4 Climate modeling 4
5 Protein folding Misfolded proteins related to diseases like Parkinson, Alzheimer. 5
6 Drug discovery 6
7 Energy research 7
8 Big Data processing! 8
9 Past: Serial hardware and software input programs output Computer runs one program at a time. 9
10 Past : Computer architecture: The von Neumann Architecture 10
11 Increasing single processor performance From , microprocessors were increasing in performance an average of 50% per year by increasing density of transistors. Since then, it s dropped to about 20% increase per year.. 11
12 Increasing density of transistors circuit board with a 4 4 array of SyNAPSE-developed chips, each chip using 5.4 billion transistors. 12
13 Heating Problems.. Smaller transistors = faster processors. Faster processors = increased power consumption. Increased power consumption = increased heat. Increased heat = unreliable processors. 13
14 Solution Use multicore processors (CPUs) on a single chip called cores An Intel Core 2 Duo E6750 dual-core processor. Introducing parallelism!!! 14
15 Now it s up to the programmers Adding more processors doesn t help much if programmers aren t aware of them or don t know how to use them. Serial programs don t benefit from this approach (in most cases). 15
16 How to write parallel programs Running multiple instances of a serial program often isn t very useful. Think of running multiple instances of your favorite game. What you really want is for it to run faster. 16
17 Approaches to the serial problem Rewrite serial programs so that they re parallel. Write translation programs that automatically convert serial programs into parallel programs. This is very difficult to do. Success has been limited. 17
18 Example Compute n values and add them together. Serial solution: 18
19 Example (cont.) We have p cores, p much smaller than n. Each core performs a partial sum of approximately n/p values. Each core uses it s own private variables and executes this block of code independently of the other cores. 19
20 Example (cont.) After each core completes execution of the code, a private variable my_sum contains the sum of the values computed by its calls to Compute_next_value. Ex., 8 cores, n = 24, then the calls to Compute_next_value return: 1,4,3, 9,2,8, 5,1,1, 5,2,7, 2,5,0, 4,1,8, 6,5,1, 2,3,9 20
21 Example (cont.) Once all the cores are done computing their private my_sum, they form a global sum by sending results to a designated master core which adds the final result 21
22 Example (cont.) 22
23 Example (cont.) Core my_sum Global sum = 95 Core my_sum
24 But wait! There s a much better way to compute the global sum. 24
25 Better parallel algorithm Don t make the master core do all the work. Share it among the other cores. Pair the cores so that core 0 adds its result with core 1 s result. Core 2 adds its result with core 3 s result, etc. Work with odd and even numbered pairs of cores. 25
26 Better parallel algorithm (cont.) Repeat the process now with only the evenly ranked cores. Core 0 adds result from core 2. Core 4 adds the result from core 6, etc. Now cores divisible by 4 repeat the process, and so forth, until core 0 has the final result. 26
27 Multiple cores forming a global sum 27
28 Analysis In the first example, the master core performs 7 receives and 7 additions. In the second example, the master core performs 3 receives and 3 additions. The improvement is more than a factor of 2! 28
29 Analysis (cont.) The difference is more dramatic with a larger number of cores. If we have 1000 cores: The first example would require the master to perform 999 receives and 999 additions. The second example would only require 10 receives and 10 additions. That s an improvement of almost a factor of 100! 29
30 How do we write parallel programs? Data parallelism Task parallelism Partition the data used in solving the problem among the cores. Each core carries out similar operations on it s part of the data. Partition various tasks carried out solving the problem among the cores. 30
31 Data Parallelism Definition: each process does the same work on unique and independent pieces of data Examples: 8 farmers mow a lawn 2 farmers paint a storage area Usually more scalable than functional parallelism Can be programmed at a high level with OpenMP 31
32 Task Parallelism Definition: each processor executes a different process/ thread (same or different code) on the same or different data. Usually each processor executes a different process or an independent program Processes communicate with one another as they work by passing data from one process/thread to the next More suitable for distributed computation Examples: Independent Monte Carlo Simulations ATM Transactions 32
33 Example follows.. 33
34 Professor P 15 questions 300 exams 34
35 Professor P s grading assistants TA#1 TA#2 TA#3 35
36 Division of work data parallelism TA#1 100 exams TA#3 100 exams 100 exams TA#2 36
37 Division of work task parallelism TA#1 Questions 1-5 Questions TA#3 Questions 6-10 TA#2 37
38 Division of work data parallelism 38
39 Division of work task parallelism Tasks 1) Receiving 2) Addition 39
40 Terminology (1/3) Serial code is a single thread of execution working on a single data item at any one time Parallel code has more than one thing happening at a time. This could be Multiple threads of execution in a single executable on different data Multiple executables (processes) all working on the same problem (in on or more programs) Any combination of the above Task is a program or a function. Each task has its own virtual address space and may have multiple threads 40
41 Terminology (2/3) Traditional CPU: a single central processing unit (CPU) on a chip. Multi-core processor/ socket : A single chip containing two or more CPUs called "cores A node may have multiple cores 41
42 Terminology (3/3) Node: a actual physical self contained computer unit that has its own processors, memory, I/O bus and storage. 42
43 Coordination Cores usually need to coordinate their work. Communication one or more cores send their current partial sums to another core. Load balancing share the work evenly among the cores so that one is not heavily loaded. Synchronization because each core works at its own pace, make sure cores do not get too far ahead of the rest. 43
44 Type of parallel systems Shared-memory The cores can share access to the computer s memory. Coordinate the cores by having them examine and update shared memory locations. Distributed-memory Each core has its own, private memory. The cores must communicate explicitly by sending messages across a network. 44
45 Type of parallel systems Shared-memory Distributed-memory 45
46 Terminology Parallel computing a single program in which multiple tasks cooperate closely to solve a problem Distributed computing many programs cooperate with each other to solve a problem. 46
47 Time for a Break? Or for some comments/ discussion up to now?! 47
48 Part 2 Parallel Hardware and Parallel Software 48
49 Past: Serial hardware and software input programs output Computer runs one program at a time. 49
50 The von Neumann Architecture 50
51 von Neumann bottleneck 51
52 MODIFICATIONS TO THE VON NEUMANN MODEL 52
53 A programmer can write code to exploit. PARALLEL HARDWARE 53
54 Flynn s Taxonomy SISD Single instruction stream Single data stream (SIMD) Single instruction stream Multiple data stream MISD Multiple instruction stream Single data stream (MIMD) Multiple instruction stream Multiple data stream 54
55 Flynnʼs Parallel Architecture Taxonomy (1966) SISD: single instruction single data traditional serial processing! MISD: rare multiple instructions on a single data iteme.g., for fault tolerance! SIMD: single instruction on multiple data! Some old architectures with a resurgence in accelerators! Vector processors - pipelining! MIMD: multiple instructions multiple data - almost all parallel computers! 55
56 #Terminology 56
57 Single Instruction, Single Data (SISD) A serial (non-parallel) computer Single Instruction: Only one instruction stream is being acted on by the CPU during any one clock cycle Single Data: Only one data stream is being used as input during any one clock cycle 57
58 SIMD (Single Instruction Multiple Data) Parallelism achieved by dividing data among the processors. Applies the same instruction to multiple data items. Called data parallelism. 58
59 SIMD example control unit n data items n ALUs x[1] x[2] x[n] ALU 1 ALU 2 ALU n for (i = 0; i < n; i++) x[i] += y[i]; 59
60 SIMD What if we don t have as many ALUs as data items? Divide the work and process iteratively. Ex. m = 4 ALUs and n = 15 data items. Round3 ALU 1 ALU 2 ALU 3 ALU 4 1 X[0] X[1] X[2] X[3] 2 X[4] X[5] X[6] X[7] 3 X[8] X[9] X[10] X[11] 4 X[12] X[13] X[14] 60
61 SIMD drawbacks All ALUs are required to execute the same instruction, or remain idle. In classic design, they must also operate synchronously. The ALUs have no instruction storage. Efficient for large data parallel problems, but not other types of more complex parallel problems. 61
62 SIMD example: Vector processors Operate on arrays or vectors of data Advantages: Fast, Easy to use Good vectoring compilers High memory bandwidth. Disadvantages: don t handle irregular data structures Scalability 62
63 Not pure SIMD: Graphics Processing Units (GPU) Real time graphics application programming interfaces or API s use points, lines, and triangles to internally represent the surface of an object. 63
64 Graphics Processing Unit (GPU) A specialized electronic circuit with thousands of cores that are specifically designed to perform data-parallel computation It process multiple elements in the graphics stream. 64
65 GPUs: how they work Can rapidly manipulate and alter memory to accelerate the creation of images for output to a display Each pixel is processed by a short program before it was projected onto the screen Other Applications GPUs are used as vector processors for non-graphics applications that require repetitive computations. 65
66 Multiple Instruction, Single Data (MISD): Multiple Instruction: Each processing unit operates on the data independently via separate instruction streams. Single Data: A single data stream is fed into multiple processing units. Few (if any) actual examples of this class of parallel computer have ever existed. 66
67 MIMD (multiple Instructions Multiple Data) Multiple Instruction: Every processor may be executing a different instruction Multiple Data: Every processor may be working with a different data Typically consist of a collection of fully independent processing units or cores, each of which has its own control unit and its own ALU. MIMD Architecture types: Shared memory Distributed memory Hybrid! Cray XT3 67
68 Shared Memory System (1/2) 68
69 Share Memory Access (2/2) 69
70 Distributed Memory System 70
71 Interconnection Networks 71
72 PARALLEL SOFTWARE 72
73 The burden is on software In shared memory programs: Start a single process and fork threads. Threads carry out tasks. Share a common memory space In distributed memory programs: Start multiple processes. Processes carry out tasks. No shared memory space Data is communicated between each other via message passing 73
74 Approaches to Parallel Programming Shared memory programming: assumes a global address space data visible to all processes. Issue: synchronizing updates of shared data Software Tool: OpenMP Distributed memory programming: assumes distributed address spaces each process sees only its local data. Issue: communication of data to other processes Software Tool : MPI (Message Passing Interface) 74
75 Writing Parallel Programs 1. Divide the work among the processes/threads (a) so each process/thread gets roughly the same amount of work (b) and communication is minimized. double x[n], y[n]; for (i = 0; i < n; i++) x[i] += y[i]; 2. Arrange for the processes/threads to synchronize. 3. Arrange for communication among processes/threads. 75
76 Shared Memory: Nondeterminism... printf ( "Thread %d > my_val = %d\n", my_rank, my_x ) ;... Private variable Thread 1 > my_val = 19 Thread 0 > my_val = 7 Thread 0 > my_val = 7 Thread 1 > my_val = 19 76
77 Shared Memory: Nondeterminism my_val = Compute_val ( my_rank ) ; x += my_val ; WRONG RESULT! 77
78 Shared Memory: Nondeterminism Race condition Critical section Mutually exclusive Mutual exclusion lock (mutex, or simply lock) my_val = Compute_val ( my_rank ) ; Lock(&add_my_val_lock ) ; x += my_val ; Unlock(&add_my_val_lock ) ; 78
79 Shared Memory: busy-waiting my_val = Compute_val ( my_rank ) ; i f ( my_rank == 1) Shared variable whi l e (! ok_for_1 ) ; /* Busy wait loop */ x += my_val ; /* Critical section */ i f ( my_rank == 0) ok_for_1 = true ; /* Let thread 1 update x */ 79
80 Distributed Memory: message-passing char message [ ] ;... my_rank = Get_rank ( ) ; i f ( my_rank == 1) { Local variable sprintf ( message, "Greetings from process 1" ) ; Send ( message, MSG_CHAR, 100, 0 ) ; } e l s e i f ( my_rank == 0) { Receive ( message, MSG_CHAR, 100, 1 ) ; printf ( "Process 0 > Received: %s\n", message ) ; } 80
81 Input and Output In distributed memory programs, only process 0 will access stdin. In shared memory programs, only the master thread or thread 0 will access stdin. In both distributed memory and shared memory programs all the processes/threads can access stdout and stderr. 81
82 PERFORMANCE 82
83 Speedup Number of cores = p Serial run-time = T serial Parallel run-time = T parallel T parallel = T serial / p 83
84 Speedup of a parallel program S = T serial T parallel Speedup is at most p. 84
85 Efficiency of a parallel program T serial E = S p = T parallel p = T serial p. T parallel Efficiency is at most 1. 85
86 Speedups and efficiencies of a parallel program Efficiency decreases as processors increase.. 86
87 Speedups and efficiencies of parallel program on different problem sizes For a fixed p, Efficiency increases as the size of the problem increases.. 87
88 Speedup Speedup increases as the size of the problem increases.. Speedup increase decreases as processors increase.. 88
89 Efficiency Efficiency decreases as the processors increase more when the size of the problem decreases.. 89
90 Effect of overhead T parallel = T serial / p + T overhead 90
91 Amdahl s Law Amdahl s Law shows a theoretical upper limit for speedup In reality, the situation is even worse than predicted by Amdahl s Law due to: Load balancing (waiting) Scheduling (shared processors or memory) Communications I/O 91
92 Example We can parallelize 90% of a serial program. Parallelization is perfect regardless of the number of cores p we use. T serial = 20 seconds Runtime of parallelizable part is 0.9 x T serial / p = 18 / p 92
93 Example (cont.) Runtime of unparallelizable part is 0.1 x T serial = 2 Overall parallel run-time is T parallel = 0.9 x T serial / p x T serial = 18 / p
94 Example (cont.) Speed up T serial 20 S = = 18 / p x T serial / p x T serial 94
95 Scalability In general, a problem is scalable if it can handle ever increasing problem sizes. If we increase the number of processes/threads and keep the efficiency fixed without increasing problem size, the problem is strongly scalable. If we keep the efficiency fixed by increasing the problem size at the same rate as we increase the number of processes/threads, the problem is weakly scalable. 95
96 Shared Memory: OpenMP 96
97 Example: Sum of Squares 97
98 OpenMP comments Plus: Very simple to use High level parallel directives Not getting in low-level (thread) programming Drawbacks You can not use more than one cores using OpenMP (due to the shared memory between threads) To do so use need to use hybrid (openmp/mpi) parallel programming 98
99 Distributed Memory: MPI EUC Colloquium on Mathematics, Computer Science and Engineering - February 3 rd
100 MPI Features 100
101 MPI code: sum of squares 101
102 Example: Pi Calculation with openmp Serial: Parallel code: 102
103 High Performance Computing at EUC Astrophysics and High Performance Computing Research Group ( Applying HPC for developing efficient solutions Currently: in Astrophysics HST IMAGE OF Future plans: In Galaxies evolution in Health Sciences Medical imaging more 103
104 AHPC Research Group Head: Prof. Andreas Efstathiou Members: Ass. Prof. Vicky Papadopoulou Researchers: Natalie Christopher Andreas Papadopoulos Phd students Elena Stylianou Undergraduate Students: Michalis Kyprianou (Bachelor CS, EUC) Andreas Howes (Bachelor CS, EUC) 104
105 Parallel Processing in Python In another Colloquium s talk! 105
Chapter 2 Parallel Architecture, Software And Performance
Chapter 2 Parallel Architecture, Software And Performance UCSB CS140, T. Yang, 2014 Modified from texbook slides Roadmap Parallel hardware Parallel software Input and output Performance Parallel program
Introduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
Introduction to GPU Programming Languages
CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure
Lecture 2 Parallel Programming Platforms
Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple
Parallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems
David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems About me David Rioja Redondo Telecommunication Engineer - Universidad de Alcalá >2 years building and managing clusters UPM
UNIT 2 CLASSIFICATION OF PARALLEL COMPUTERS
UNIT 2 CLASSIFICATION OF PARALLEL COMPUTERS Structure Page Nos. 2.0 Introduction 27 2.1 Objectives 27 2.2 Types of Classification 28 2.3 Flynn s Classification 28 2.3.1 Instruction Cycle 2.3.2 Instruction
22S:295 Seminar in Applied Statistics High Performance Computing in Statistics
22S:295 Seminar in Applied Statistics High Performance Computing in Statistics Luke Tierney Department of Statistics & Actuarial Science University of Iowa August 30, 2007 Luke Tierney (U. of Iowa) HPC
GPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
Computer Architecture TDTS10
why parallelism? Performance gain from increasing clock frequency is no longer an option. Outline Computer Architecture TDTS10 Superscalar Processors Very Long Instruction Word Processors Parallel computers
Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011
Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis
Scalability and Classifications
Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static
Computer Graphics Hardware An Overview
Computer Graphics Hardware An Overview Graphics System Monitor Input devices CPU/Memory GPU Raster Graphics System Raster: An array of picture elements Based on raster-scan TV technology The screen (and
Systolic Computing. Fundamentals
Systolic Computing Fundamentals Motivations for Systolic Processing PARALLEL ALGORITHMS WHICH MODEL OF COMPUTATION IS THE BETTER TO USE? HOW MUCH TIME WE EXPECT TO SAVE USING A PARALLEL ALGORITHM? HOW
ADVANCED COMPUTER ARCHITECTURE
ADVANCED COMPUTER ARCHITECTURE Marco Ferretti Tel. Ufficio: 0382 985365 E-mail: [email protected] Web: www.unipv.it/mferretti, eecs.unipv.it 1 Course syllabus and motivations This course covers the
Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o [email protected]
Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o [email protected] Informa(on & Communica(on Technology Sec(on (ICTS) Interna(onal Centre for Theore(cal Physics (ICTP) Mul(ple Socket
Parallelism and Cloud Computing
Parallelism and Cloud Computing Kai Shen Parallel Computing Parallel computing: Process sub tasks simultaneously so that work can be completed faster. For instances: divide the work of matrix multiplication
Parallel Programming
Parallel Programming Parallel Architectures Diego Fabregat-Traver and Prof. Paolo Bientinesi HPAC, RWTH Aachen [email protected] WS15/16 Parallel Architectures Acknowledgements Prof. Felix
Spring 2011 Prof. Hyesoon Kim
Spring 2011 Prof. Hyesoon Kim Today, we will study typical patterns of parallel programming This is just one of the ways. Materials are based on a book by Timothy. Decompose Into tasks Original Problem
MPI and Hybrid Programming Models. William Gropp www.cs.illinois.edu/~wgropp
MPI and Hybrid Programming Models William Gropp www.cs.illinois.edu/~wgropp 2 What is a Hybrid Model? Combination of several parallel programming models in the same program May be mixed in the same source
Intro to GPU computing. Spring 2015 Mark Silberstein, 048661, Technion 1
Intro to GPU computing Spring 2015 Mark Silberstein, 048661, Technion 1 Serial vs. parallel program One instruction at a time Multiple instructions in parallel Spring 2015 Mark Silberstein, 048661, Technion
CMSC 611: Advanced Computer Architecture
CMSC 611: Advanced Computer Architecture Parallel Computation Most slides adapted from David Patterson. Some from Mohomed Younis Parallel Computers Definition: A parallel computer is a collection of processing
Petascale Software Challenges. William Gropp www.cs.illinois.edu/~wgropp
Petascale Software Challenges William Gropp www.cs.illinois.edu/~wgropp Petascale Software Challenges Why should you care? What are they? Which are different from non-petascale? What has changed since
Enhancing Cloud-based Servers by GPU/CPU Virtualization Management
Enhancing Cloud-based Servers by GPU/CPU Virtualiz Management Tin-Yu Wu 1, Wei-Tsong Lee 2, Chien-Yu Duan 2 Department of Computer Science and Inform Engineering, Nal Ilan University, Taiwan, ROC 1 Department
Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association
Making Multicore Work and Measuring its Benefits Markus Levy, president EEMBC and Multicore Association Agenda Why Multicore? Standards and issues in the multicore community What is Multicore Association?
Lecture 23: Multiprocessors
Lecture 23: Multiprocessors Today s topics: RAID Multiprocessor taxonomy Snooping-based cache coherence protocol 1 RAID 0 and RAID 1 RAID 0 has no additional redundancy (misnomer) it uses an array of disks
Scientific Computing Programming with Parallel Objects
Scientific Computing Programming with Parallel Objects Esteban Meneses, PhD School of Computing, Costa Rica Institute of Technology Parallel Architectures Galore Personal Computing Embedded Computing Moore
Binary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
HPC Wales Skills Academy Course Catalogue 2015
HPC Wales Skills Academy Course Catalogue 2015 Overview The HPC Wales Skills Academy provides a variety of courses and workshops aimed at building skills in High Performance Computing (HPC). Our courses
Next Generation GPU Architecture Code-named Fermi
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
Parallel Algorithm Engineering
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis
OBJECTIVE ANALYSIS WHITE PAPER MATCH FLASH. TO THE PROCESSOR Why Multithreading Requires Parallelized Flash ATCHING
OBJECTIVE ANALYSIS WHITE PAPER MATCH ATCHING FLASH TO THE PROCESSOR Why Multithreading Requires Parallelized Flash T he computing community is at an important juncture: flash memory is now generally accepted
Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child
Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.
Designing and Building Applications for Extreme Scale Systems CS598 William Gropp www.cs.illinois.edu/~wgropp
Designing and Building Applications for Extreme Scale Systems CS598 William Gropp www.cs.illinois.edu/~wgropp Welcome! Who am I? William (Bill) Gropp Professor of Computer Science One of the Creators of
Parallel Computing for Data Science
Parallel Computing for Data Science With Examples in R, C++ and CUDA Norman Matloff University of California, Davis USA (g) CRC Press Taylor & Francis Group Boca Raton London New York CRC Press is an imprint
Tools Page 1 of 13 ON PROGRAM TRANSLATION. A priori, we have two translation mechanisms available:
Tools Page 1 of 13 ON PROGRAM TRANSLATION A priori, we have two translation mechanisms available: Interpretation Compilation On interpretation: Statements are translated one at a time and executed immediately.
Parallel Computing. Benson Muite. [email protected] http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite [email protected] http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
LSN 2 Computer Processors
LSN 2 Computer Processors Department of Engineering Technology LSN 2 Computer Processors Microprocessors Design Instruction set Processor organization Processor performance Bandwidth Clock speed LSN 2
Parallel Large-Scale Visualization
Parallel Large-Scale Visualization Aaron Birkland Cornell Center for Advanced Computing Data Analysis on Ranger January 2012 Parallel Visualization Why? Performance Processing may be too slow on one CPU
NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist
NVIDIA CUDA Software and GPU Parallel Computing Architecture David B. Kirk, Chief Scientist Outline Applications of GPU Computing CUDA Programming Model Overview Programming in CUDA The Basics How to Get
Embedded Systems: map to FPGA, GPU, CPU?
Embedded Systems: map to FPGA, GPU, CPU? Jos van Eijndhoven [email protected] Bits&Chips Embedded systems Nov 7, 2013 # of transistors Moore s law versus Amdahl s law Computational Capacity Hardware
Scalability evaluation of barrier algorithms for OpenMP
Scalability evaluation of barrier algorithms for OpenMP Ramachandra Nanjegowda, Oscar Hernandez, Barbara Chapman and Haoqiang H. Jin High Performance Computing and Tools Group (HPCTools) Computer Science
Parallel and Distributed Computing Programming Assignment 1
Parallel and Distributed Computing Programming Assignment 1 Due Monday, February 7 For programming assignment 1, you should write two C programs. One should provide an estimate of the performance of ping-pong
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
A Comparison of Distributed Systems: ChorusOS and Amoeba
A Comparison of Distributed Systems: ChorusOS and Amoeba Angelo Bertolli Prepared for MSIT 610 on October 27, 2004 University of Maryland University College Adelphi, Maryland United States of America Abstract.
Overview of HPC Resources at Vanderbilt
Overview of HPC Resources at Vanderbilt Will French Senior Application Developer and Research Computing Liaison Advanced Computing Center for Research and Education June 10, 2015 2 Computing Resources
SWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri
SWARM: A Parallel Programming Framework for Multicore Processors David A. Bader, Varun N. Kanade and Kamesh Madduri Our Contributions SWARM: SoftWare and Algorithms for Running on Multicore, a portable
GPUs for Scientific Computing
GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles [email protected] Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research
LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.
LS-DYNA Scalability on Cray Supercomputers Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp. WP-LS-DYNA-12213 www.cray.com Table of Contents Abstract... 3 Introduction... 3 Scalability
Parallelization: Binary Tree Traversal
By Aaron Weeden and Patrick Royal Shodor Education Foundation, Inc. August 2012 Introduction: According to Moore s law, the number of transistors on a computer chip doubles roughly every two years. First
The Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud.
White Paper 021313-3 Page 1 : A Software Framework for Parallel Programming* The Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud. ABSTRACT Programming for Multicore,
FPGA-based Multithreading for In-Memory Hash Joins
FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded
Quiz for Chapter 1 Computer Abstractions and Technology 3.10
Date: 3.10 Not all questions are of equal difficulty. Please review the entire quiz first and then budget your time carefully. Name: Course: Solutions in Red 1. [15 points] Consider two different implementations,
Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age
Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age Xuan Shi GRA: Bowei Xue University of Arkansas Spatiotemporal Modeling of Human Dynamics
Four Keys to Successful Multicore Optimization for Machine Vision. White Paper
Four Keys to Successful Multicore Optimization for Machine Vision White Paper Optimizing a machine vision application for multicore PCs can be a complex process with unpredictable results. Developers need
A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin
A Comparison Of Shared Memory Parallel Programming Models Jace A Mogill David Haglin 1 Parallel Programming Gap Not many innovations... Memory semantics unchanged for over 50 years 2010 Multi-Core x86
Seeking Opportunities for Hardware Acceleration in Big Data Analytics
Seeking Opportunities for Hardware Acceleration in Big Data Analytics Paul Chow High-Performance Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Toronto Who
Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27
Logistics Week 1: Wednesday, Jan 27 Because of overcrowding, we will be changing to a new room on Monday (Snee 1120). Accounts on the class cluster (crocus.csuglab.cornell.edu) will be available next week.
Symmetric Multiprocessing
Multicore Computing A multi-core processor is a processing system composed of two or more independent cores. One can describe it as an integrated circuit to which two or more individual processors (called
Learning Outcomes. Simple CPU Operation and Buses. Composition of a CPU. A simple CPU design
Learning Outcomes Simple CPU Operation and Buses Dr Eddie Edwards [email protected] At the end of this lecture you will Understand how a CPU might be put together Be able to name the basic components
High Performance Computing
High Performance Computing Trey Breckenridge Computing Systems Manager Engineering Research Center Mississippi State University What is High Performance Computing? HPC is ill defined and context dependent.
Software and the Concurrency Revolution
Software and the Concurrency Revolution A: The world s fastest supercomputer, with up to 4 processors, 128MB RAM, 942 MFLOPS (peak). 2 Q: What is a 1984 Cray X-MP? (Or a fractional 2005 vintage Xbox )
High Performance Computing in the Multi-core Area
High Performance Computing in the Multi-core Area Arndt Bode Technische Universität München Technology Trends for Petascale Computing Architectures: Multicore Accelerators Special Purpose Reconfigurable
High Performance Computing. Course Notes 2007-2008. HPC Fundamentals
High Performance Computing Course Notes 2007-2008 2008 HPC Fundamentals Introduction What is High Performance Computing (HPC)? Difficult to define - it s a moving target. Later 1980s, a supercomputer performs
Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga
Programming models for heterogeneous computing Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Talk outline [30 slides] 1. Introduction [5 slides] 2.
Middleware and Distributed Systems. Introduction. Dr. Martin v. Löwis
Middleware and Distributed Systems Introduction Dr. Martin v. Löwis 14 3. Software Engineering What is Middleware? Bauer et al. Software Engineering, Report on a conference sponsored by the NATO SCIENCE
COMP/CS 605: Introduction to Parallel Computing Lecture 21: Shared Memory Programming with OpenMP
COMP/CS 605: Introduction to Parallel Computing Lecture 21: Shared Memory Programming with OpenMP Mary Thomas Department of Computer Science Computational Science Research Center (CSRC) San Diego State
Turbomachinery CFD on many-core platforms experiences and strategies
Turbomachinery CFD on many-core platforms experiences and strategies Graham Pullan Whittle Laboratory, Department of Engineering, University of Cambridge MUSAF Colloquium, CERFACS, Toulouse September 27-29
GPU File System Encryption Kartik Kulkarni and Eugene Linkov
GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
Multicore Parallel Computing with OpenMP
Multicore Parallel Computing with OpenMP Tan Chee Chiang (SVU/Academic Computing, Computer Centre) 1. OpenMP Programming The death of OpenMP was anticipated when cluster systems rapidly replaced large
GPGPU accelerated Computational Fluid Dynamics
t e c h n i s c h e u n i v e r s i t ä t b r a u n s c h w e i g Carl-Friedrich Gauß Faculty GPGPU accelerated Computational Fluid Dynamics 5th GACM Colloquium on Computational Mechanics Hamburg Institute
Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup
Chapter 12: Multiprocessor Architectures Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup Objective Be familiar with basic multiprocessor architectures and be able to
Lecture 1 Introduction to Parallel Programming
Lecture 1 Introduction to Parallel Programming EN 600.320/420 Instructor: Randal Burns 4 September 2008 Department of Computer Science, Johns Hopkins University Pipelined Processor From http://arstechnica.com/articles/paedia/cpu/pipelining-2.ars
10- High Performance Compu5ng
10- High Performance Compu5ng (Herramientas Computacionales Avanzadas para la Inves6gación Aplicada) Rafael Palacios, Fernando de Cuadra MRE Contents Implemen8ng computa8onal tools 1. High Performance
RevoScaleR Speed and Scalability
EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution
Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers
Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers Haohuan Fu [email protected] High Performance Geo-Computing (HPGC) Group Center for Earth System Science Tsinghua University
Tutorial-4a: Parallel (multi-cpu) Computing
HTTP://WWW.HEP.LU.SE/COURSES/MNXB01 Introduction to Programming and Computing for Scientists (2015 HT) Tutorial-4a: Parallel (multi-cpu) Computing Balazs Konya (Lund University) Programming for Scientists
MAPREDUCE Programming Model
CS 2510 COMPUTER OPERATING SYSTEMS Cloud Computing MAPREDUCE Dr. Taieb Znati Computer Science Department University of Pittsburgh MAPREDUCE Programming Model Scaling Data Intensive Application MapReduce
Multi-Core Programming
Multi-Core Programming Increasing Performance through Software Multi-threading Shameem Akhter Jason Roberts Intel PRESS Copyright 2006 Intel Corporation. All rights reserved. ISBN 0-9764832-4-6 No part
Introduction to Cluster Computing
Introduction to Cluster Computing Brian Vinter [email protected] Overview Introduction Goal/Idea Phases Mandatory Assignments Tools Timeline/Exam General info Introduction Supercomputers are expensive Workstations
ultra fast SOM using CUDA
ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A
Infrastructure Matters: POWER8 vs. Xeon x86
Advisory Infrastructure Matters: POWER8 vs. Xeon x86 Executive Summary This report compares IBM s new POWER8-based scale-out Power System to Intel E5 v2 x86- based scale-out systems. A follow-on report
Multi-core Curriculum Development at Georgia Tech: Experience and Future Steps
Multi-core Curriculum Development at Georgia Tech: Experience and Future Steps Ada Gavrilovska, Hsien-Hsin-Lee, Karsten Schwan, Sudha Yalamanchili, Matt Wolf CERCS Georgia Institute of Technology Background
QCD as a Video Game?
QCD as a Video Game? Sándor D. Katz Eötvös University Budapest in collaboration with Győző Egri, Zoltán Fodor, Christian Hoelbling Dániel Nógrádi, Kálmán Szabó Outline 1. Introduction 2. GPU architecture
A Case Study - Scaling Legacy Code on Next Generation Platforms
Available online at www.sciencedirect.com ScienceDirect Procedia Engineering 00 (2015) 000 000 www.elsevier.com/locate/procedia 24th International Meshing Roundtable (IMR24) A Case Study - Scaling Legacy
10 Gbps Line Speed Programmable Hardware for Open Source Network Applications*
10 Gbps Line Speed Programmable Hardware for Open Source Network Applications* Livio Ricciulli [email protected] (408) 399-2284 http://www.metanetworks.org *Supported by the Division of Design Manufacturing
Parallel Computing. Parallel shared memory computing with OpenMP
Parallel Computing Parallel shared memory computing with OpenMP Thorsten Grahs, 14.07.2014 Table of contents Introduction Directives Scope of data Synchronization OpenMP vs. MPI OpenMP & MPI 14.07.2014
Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA
Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA Dissertation submitted in partial fulfillment of the requirements for the degree of Master of Technology, Computer Engineering by Amol
GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics
GPU Architectures A CPU Perspective Derek Hower AMD Research 5/21/2013 Goals Data Parallelism: What is it, and how to exploit it? Workload characteristics Execution Models / GPU Architectures MIMD (SPMD),
Big Data Systems CS 5965/6965 FALL 2015
Big Data Systems CS 5965/6965 FALL 2015 Today General course overview Expectations from this course Q&A Introduction to Big Data Assignment #1 General Course Information Course Web Page http://www.cs.utah.edu/~hari/teaching/fall2015.html
ADVANCED COMPUTER ARCHITECTURE: Parallelism, Scalability, Programmability
ADVANCED COMPUTER ARCHITECTURE: Parallelism, Scalability, Programmability * Technische Hochschule Darmstadt FACHBEREiCH INTORMATIK Kai Hwang Professor of Electrical Engineering and Computer Science University
