A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin
|
|
|
- George Marshall
- 10 years ago
- Views:
Transcription
1 A Comparison Of Shared Memory Parallel Programming Models Jace A Mogill David Haglin 1
2 Parallel Programming Gap Not many innovations... Memory semantics unchanged for over 50 years 2010 Multi-Core x86 programming model identical to 1982 Symmetric Multi- Processor programming model Unwillingness to adopt new languages Users want to leverage existing investments in code Prefer to incrementally migrate to parallelism OpenMP/MicroTasking Data parallelism Pthreads Task parallelism Wishful Thinking Mixed task and data parallelism
3 Expressing Synchronization and Parallelism Parallelism and Synchronization are Orthogonal Synchronization Code Explicit Pthreads/ OpenMP Parallelism Implicit Vectorization Data MTA Dataflow Synchronization and parallelism primitives can be mixed OpenMP mixed with Atomic Memory Operations MTA synchronization mixed with OpenMP parallel loops OpenMP mixed with Pthread Mutex Locks OpenMP or Pthreads mixed with Vectorization All of the above mixed with MPI, UPC, CAF, etc. 3
4 Thread-Centric versus Data-Centric TASK: Map millions of degrees of parallelism onto tens of (thousands of) processors Thread-Centric Manage Threads Optimizes for specific machine organization Requires careful scheduling of moving data to/from thread Difficult to load balance dynamic and nested parallel regions Data-Centric Manage Data Dependencies Compiler is already doing this for ILP, loop optimization, and vectorization Optimizes for concurrency, which is performance portable Moving task to data is a natural option for load balancing
5 Lock-Free and Wait-Free Algorithms Lock Free and Wait Free Algorithms... Don t really exist Only embarrassingly Parallel algorithms don t use synchronization Compare And Swap... is not lock free or wait-free has no concurrency is a synchronization primitive which corresponds to mutex try-lock in the Pthreads programming model 5
6 Compare And Swap LOCK# CMPXCHG x86 Locked Compare and Exchange Programming Idioms Similar to mutex try-lock Mutex locks can spin try-lock or yield to the OS/runtime So Called Lock-Free Algorithms Manually coded secondary lock handler Manually coded tertiary lock handler... All this try-lock handler work is not algorithmically efficient... It s Lock-Free Turtles all the way down... Implementation Instruction in all i386 and later processors Efficient for processors sharing caches and memory controllers Not efficient or fair for non-uniform machine organizations Atomic Memory Operations Do Not Scale It is not possible to go 10,000 way parallel on one piece of data. 6
7 Thread-Centric Parallel Regions OpenMP Implied fork/join scaffolding Parallel regions are separate from loops Unannotated loops: Every thread executes all iterations Annotated loops: Loops are decomposed among existing threads Joining Threads Exit parallel region Barriers Pthreads Fully explicit scaffolding Forking Threads One new thread per PthreadCreate() Loops or trees required to start multiple threads Flow control PthreadBarrier Mutex Lock Joining Threads PthreadJoin return() 7
8 Data-Centric Parallel Regions on XMT Multiple loops, with different trip counts, after restructuring a reduction, all in a single parallel region: Parallel region 1 in foo Multiple processor implementation Requesting at least 50 streams Loop 2 in foo in region 1 In parallel phase 1 Dynamically scheduled, variable chunks, min size = 26 Compiler generated Loop 3 in foo at line 7 in loop 2 Loop summary: 1 loads, 0 stores, 1 floating point operations 1 instructions, needs 50 streams for full utilization 3 P:$ sum += a[i]; pipelined Loop 4 in foo in region 1 In parallel phase 2 Dynamically scheduled, variable chunks, min size = 8 Compiler generated Loop 5 in foo at line 10 in loop 4 Loop summary: 2 loads, 1 stores, 2 floating point operations 3 instructions, needs 44 streams for full utilization pipelined void foo(int n, double* restrict a, double* restrict b, double* restrict c, double* restrict d) { int i, j; double sum = 0.0; for (i = 0; i < n; i++) ** reduction moved out of 1 loop for (j = 0; j < n/2; j++) 5 P b[j] = c[j] + d[j] * sum; } 8
9 Parallel Histogram Thread Centric Time = N elements PARALLEL-DO i = 0.. Nelements-1 j = 0 while(j < Nbins && elem[i] < binmax[j]) j++ BEGIN CRITICAL-REGION counts[j]++ Only 1 thread at a time END CRITICAL-REGION Critical region around update to counts array Serial bottleneck in critical region Wastes potential concurrency Data Centric Time = N elements /N bins PARALLEL-DO i = 0.. Nelements-1 j = 0 while(j < Nbins && elem[i] < binmax[j]) j++ INT_FETCH_ADD(counts[j], 1) Updates are atomic Updates to count table are atomic: Requires abundant fine grained synchronization All concurrency can be exploited: Maximum concurrency limited to number of bins Every bin updated simultaneously
10 Linked List Manipulation Insertion Sort Time = N*N/2 Inserting from same side every time Time = N*N/4 Insert at head or tail, whichever is nearer Unlimited concurrency during search Concurrency during manipulations Thread Parallelism: One update to list at a time Data Parallelism: Between each pair of elements Grow list length by 50% on every step Two phase insertion (search, then lock and modify) 10
11 Two-Phase List Insertion (OpenMP) 1. Find site to insert at Do not lock list Traverse list serially 2. Perform List Update Enter Critical Region Re-Confirm site is unchanged Update list pointers End Critical Region Only One Insertion at a time Unlimited number of search threads More threads means... more failed inter-node locks more repeated (wasted) searches Wallclock Time = N Parallel Search 1 Serial updates N 11
12 Two-Phase List Insertion (MTA) 1. Find site to insert at Do not lock list Traverse list serially 2. Perform List Update Lock two elements inserted between Acquire locks in lexicographical order to avoid deadlocks Confirm link between nodes is unchanged Update link pointers of nodes Unlock two elements in reverse order of locking N/4 Maximum concurrent insertions Insert between every pair of nodes More insertion points means fewer failed internode locks Total Time = logn 12
13 Global HACHAR Initial Data Structure Region Head Non-full Region Chain Length 0 0 Table / Chunk size Next Free Slot = Next Free Slot = 0 Use two-step acquire on length, region linked list pointers, chain pointers. Use int_fetch_add on next free slot to allocate list node. Region 0 Region 1
14 Global HACHAR Two items inserted Region Head Non-full Region Chain Length 1 1 Hash Function Range Next Free Slot = Next Free Slot = 0 Word Word Id locked length and inserted into head of list Potential contention only on length List node shows example for Bag Of Words Region 0 Region 1
15 Global HACHAR Collisions Region Head Non-full Region Chain Length 3 1 Table / Chunk size Next Free Slot = Word Word Id Next Free Slot = x Lookup: walk chain, no locking Malloc and free limited to the few region buffer Growing a chain requires lock of only last pointer (int_fetch_add length) Region 0 Region 1
16 Global HACHAR Region Overflow Region Head Non-full Region Chain Length Next Free Slot = Next Free Slot = Next Free Slot = Table / Chunk size Word Word Id Region 0 Region 1 Region 2
17 Once Per Thread vs. Once Per Iteration Allocating Storage Once Per Thread PARALLEL-DO i = 0.. Nthreads-1 float *p = malloc(...) int n_iters = Nelements / Nthreads DO j = i*n_iters.. max(n,(i+1)*n_iters) p[j] =... x[j]... x[j] =... p[j]... ENDDO free(p); END-PARALLEL-DO OpenMP Parallel regions are separate from loops Loop decomposition idiom already captured in conventional pragma syntax Parallel loop, one iteration per thread malloc hoisted out of inner loop, or fused into outer loop Block of serial iterations XMT Abominable Kludge Non-portable pragma semantics Requires separate loops, possibly parallel region 17
18 Synchronization is a Natural Part of Parallelism Synchronization cannot be avoided, it must be made efficient and easy to use Lock Free algorithms aren t... Sequential execution of atomic ops (compare and swap, fetch and add) Hidden lock semantics in compiler or hardware (change either and you have a bug) Communication-free loop level data parallelism (ie: vectorization) is a limited kind of parallelism Synchronization needed for many purposes Atomicity Enforce order dependencies Manage threads Must be abundant Don t want to worry about allocating or rationing synchronization variables I m a lock free scalable parallel algorithm! 18
19 Conclusions Synchronization is a natural part of parallel programs Synchronization must be abundant, efficient, and easy to use Some latencies can be minimized, others must be tolerated Parallelism can mask latency Enough parallelism makes latency irrelevant Fine-grained synchronization improves utilization More opportunities for exploiting parallelism Proactively share resources Parallelism is performance portable Same programming model from desktop to supercomputer Quality of Service determined by amount of concurrency in hardware, not optimizations in software AMOs versus Tag Bits Tags make fine-grained parallelism possible More Data == More Concurrency AMOs do not scale 19
20 20
Spring 2011 Prof. Hyesoon Kim
Spring 2011 Prof. Hyesoon Kim Today, we will study typical patterns of parallel programming This is just one of the ways. Materials are based on a book by Timothy. Decompose Into tasks Original Problem
MPI and Hybrid Programming Models. William Gropp www.cs.illinois.edu/~wgropp
MPI and Hybrid Programming Models William Gropp www.cs.illinois.edu/~wgropp 2 What is a Hybrid Model? Combination of several parallel programming models in the same program May be mixed in the same source
Parallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
Multi-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
SWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri
SWARM: A Parallel Programming Framework for Multicore Processors David A. Bader, Varun N. Kanade and Kamesh Madduri Our Contributions SWARM: SoftWare and Algorithms for Running on Multicore, a portable
Introduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
Overview Motivating Examples Interleaving Model Semantics of Correctness Testing, Debugging, and Verification
Introduction Overview Motivating Examples Interleaving Model Semantics of Correctness Testing, Debugging, and Verification Advanced Topics in Software Engineering 1 Concurrent Programs Characterized by
David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems
David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems About me David Rioja Redondo Telecommunication Engineer - Universidad de Alcalá >2 years building and managing clusters UPM
SYSTEM ecos Embedded Configurable Operating System
BELONGS TO THE CYGNUS SOLUTIONS founded about 1989 initiative connected with an idea of free software ( commercial support for the free software ). Recently merged with RedHat. CYGNUS was also the original
OpenMP and Performance
Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group [email protected] IT Center der RWTH Aachen University Tuning Cycle Performance Tuning aims to improve the runtime of an
Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as
Intro to GPU computing. Spring 2015 Mark Silberstein, 048661, Technion 1
Intro to GPU computing Spring 2015 Mark Silberstein, 048661, Technion 1 Serial vs. parallel program One instruction at a time Multiple instructions in parallel Spring 2015 Mark Silberstein, 048661, Technion
Shared Address Space Computing: Programming
Shared Address Space Computing: Programming Alistair Rendell See Chapter 6 or Lin and Synder, Chapter 7 of Grama, Gupta, Karypis and Kumar, and Chapter 8 of Wilkinson and Allen Fork/Join Programming Model
Scheduling Task Parallelism" on Multi-Socket Multicore Systems"
Scheduling Task Parallelism" on Multi-Socket Multicore Systems" Stephen Olivier, UNC Chapel Hill Allan Porterfield, RENCI Kyle Wheeler, Sandia National Labs Jan Prins, UNC Chapel Hill Outline" Introduction
Parallel Computing. Shared memory parallel programming with OpenMP
Parallel Computing Shared memory parallel programming with OpenMP Thorsten Grahs, 27.04.2015 Table of contents Introduction Directives Scope of data Synchronization 27.04.2015 Thorsten Grahs Parallel Computing
Parallel Computing. Parallel shared memory computing with OpenMP
Parallel Computing Parallel shared memory computing with OpenMP Thorsten Grahs, 14.07.2014 Table of contents Introduction Directives Scope of data Synchronization OpenMP vs. MPI OpenMP & MPI 14.07.2014
Facing the Challenges for Real-Time Software Development on Multi-Cores
Facing the Challenges for Real-Time Software Development on Multi-Cores Dr. Fridtjof Siebert aicas GmbH Haid-und-Neu-Str. 18 76131 Karlsruhe, Germany [email protected] Abstract Multicore systems introduce
UTS: An Unbalanced Tree Search Benchmark
UTS: An Unbalanced Tree Search Benchmark LCPC 2006 1 Coauthors Stephen Olivier, UNC Jun Huan, UNC/Kansas Jinze Liu, UNC Jan Prins, UNC James Dinan, OSU P. Sadayappan, OSU Chau-Wen Tseng, UMD Also, thanks
OpenACC 2.0 and the PGI Accelerator Compilers
OpenACC 2.0 and the PGI Accelerator Compilers Michael Wolfe The Portland Group [email protected] This presentation discusses the additions made to the OpenACC API in Version 2.0. I will also present
Hybrid Programming with MPI and OpenMP
Hybrid Programming with and OpenMP Ricardo Rocha and Fernando Silva Computer Science Department Faculty of Sciences University of Porto Parallel Computing 2015/2016 R. Rocha and F. Silva (DCC-FCUP) Programming
Overview. CMSC 330: Organization of Programming Languages. Synchronization Example (Java 1.5) Lock Interface (Java 1.5) ReentrantLock Class (Java 1.
CMSC 330: Organization of Programming Languages Java 1.5, Ruby, and MapReduce Overview Java 1.5 ReentrantLock class Condition interface - await( ), signalall( ) Ruby multithreading Concurrent programming
OpenMP & MPI CISC 879. Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware
OpenMP & MPI CISC 879 Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware 1 Lecture Overview Introduction OpenMP MPI Model Language extension: directives-based
A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators
A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators Sandra Wienke 1,2, Christian Terboven 1,2, James C. Beyer 3, Matthias S. Müller 1,2 1 IT Center, RWTH Aachen University 2 JARA-HPC, Aachen
Software and the Concurrency Revolution
Software and the Concurrency Revolution A: The world s fastest supercomputer, with up to 4 processors, 128MB RAM, 942 MFLOPS (peak). 2 Q: What is a 1984 Cray X-MP? (Or a fractional 2005 vintage Xbox )
Bringing Big Data Modelling into the Hands of Domain Experts
Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks [email protected] 2015 The MathWorks, Inc. 1 Data is the sword of the
FLIX: Fast Relief for Performance-Hungry Embedded Applications
FLIX: Fast Relief for Performance-Hungry Embedded Applications Tensilica Inc. February 25 25 Tensilica, Inc. 25 Tensilica, Inc. ii Contents FLIX: Fast Relief for Performance-Hungry Embedded Applications...
COMP/CS 605: Introduction to Parallel Computing Lecture 21: Shared Memory Programming with OpenMP
COMP/CS 605: Introduction to Parallel Computing Lecture 21: Shared Memory Programming with OpenMP Mary Thomas Department of Computer Science Computational Science Research Center (CSRC) San Diego State
Chapter 1 Computer System Overview
Operating Systems: Internals and Design Principles Chapter 1 Computer System Overview Eighth Edition By William Stallings Operating System Exploits the hardware resources of one or more processors Provides
Operating Systems CSE 410, Spring 2004. File Management. Stephen Wagner Michigan State University
Operating Systems CSE 410, Spring 2004 File Management Stephen Wagner Michigan State University File Management File management system has traditionally been considered part of the operating system. Applications
Scalability evaluation of barrier algorithms for OpenMP
Scalability evaluation of barrier algorithms for OpenMP Ramachandra Nanjegowda, Oscar Hernandez, Barbara Chapman and Haoqiang H. Jin High Performance Computing and Tools Group (HPCTools) Computer Science
Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster
Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster Gabriele Jost and Haoqiang Jin NAS Division, NASA Ames Research Center, Moffett Field, CA 94035-1000 {gjost,hjin}@nas.nasa.gov
VLIW Processors. VLIW Processors
1 VLIW Processors VLIW ( very long instruction word ) processors instructions are scheduled by the compiler a fixed number of operations are formatted as one big instruction (called a bundle) usually LIW
STINGER: High Performance Data Structure for Streaming Graphs
STINGER: High Performance Data Structure for Streaming Graphs David Ediger Rob McColl Jason Riedy David A. Bader Georgia Institute of Technology Atlanta, GA, USA Abstract The current research focus on
Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011
Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis
An Implementation Of Multiprocessor Linux
An Implementation Of Multiprocessor Linux This document describes the implementation of a simple SMP Linux kernel extension and how to use this to develop SMP Linux kernels for architectures other than
Module: Software Instruction Scheduling Part I
Module: Software Instruction Scheduling Part I Sudhakar Yalamanchili, Georgia Institute of Technology Reading for this Module Loop Unrolling and Instruction Scheduling Section 2.2 Dependence Analysis Section
Lecture 10: Dynamic Memory Allocation 1: Into the jaws of malloc()
CS61: Systems Programming and Machine Organization Harvard University, Fall 2009 Lecture 10: Dynamic Memory Allocation 1: Into the jaws of malloc() Prof. Matt Welsh October 6, 2009 Topics for today Dynamic
Optimizing Loop-Level Parallelism in Cray XMT Applications
Optimizing Loop-Level Parallelism in Cray XMT Applications TM Abstract In this paper, we describe how to write efficient, parallel codes for the Cray XMT system, a massively multithreaded, shared memory
SHARED HASH TABLES IN PARALLEL MODEL CHECKING
SHARED HASH TABLES IN PARALLEL MODEL CHECKING IPA LENTEDAGEN 2010 ALFONS LAARMAN JOINT WORK WITH MICHAEL WEBER AND JACO VAN DE POL 23/4/2010 AGENDA Introduction Goal and motivation What is model checking?
How To Write A Multi Threaded Software On A Single Core (Or Multi Threaded) System
Multicore Systems Challenges for the Real-Time Software Developer Dr. Fridtjof Siebert aicas GmbH Haid-und-Neu-Str. 18 76131 Karlsruhe, Germany [email protected] Abstract Multicore systems have become
Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association
Making Multicore Work and Measuring its Benefits Markus Levy, president EEMBC and Multicore Association Agenda Why Multicore? Standards and issues in the multicore community What is Multicore Association?
COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters
COSC 6374 Parallel Computation Parallel I/O (I) I/O basics Spring 2008 Concept of a clusters Processor 1 local disks Compute node message passing network administrative network Memory Processor 2 Network
Chapter 13: Query Processing. Basic Steps in Query Processing
Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing
An Introduction to Parallel Computing/ Programming
An Introduction to Parallel Computing/ Programming Vicky Papadopoulou Lesta Astrophysics and High Performance Computing Research Group (http://ahpc.euc.ac.cy) Dep. of Computer Science and Engineering European
MAQAO Performance Analysis and Optimization Tool
MAQAO Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL [email protected] Performance Evaluation Team, University of Versailles S-Q-Y http://www.maqao.org VI-HPS 18 th Grenoble 18/22
UNIT 2 CLASSIFICATION OF PARALLEL COMPUTERS
UNIT 2 CLASSIFICATION OF PARALLEL COMPUTERS Structure Page Nos. 2.0 Introduction 27 2.1 Objectives 27 2.2 Types of Classification 28 2.3 Flynn s Classification 28 2.3.1 Instruction Cycle 2.3.2 Instruction
#pragma omp critical x = x + 1; !$OMP CRITICAL X = X + 1!$OMP END CRITICAL. (Very inefficiant) example using critical instead of reduction:
omp critical The code inside a CRITICAL region is executed by only one thread at a time. The order is not specified. This means that if a thread is currently executing inside a CRITICAL region and another
OMPT and OMPD: OpenMP Tools Application Programming Interfaces for Performance Analysis and Debugging
OMPT and OMPD: OpenMP Tools Application Programming Interfaces for Performance Analysis and Debugging Alexandre Eichenberger, John Mellor-Crummey, Martin Schulz, Nawal Copty, John DelSignore, Robert Dietrich,
Introduction to DISC and Hadoop
Introduction to DISC and Hadoop Alice E. Fischer April 24, 2009 Alice E. Fischer DISC... 1/20 1 2 History Hadoop provides a three-layer paradigm Alice E. Fischer DISC... 2/20 Parallel Computing Past and
Control 2004, University of Bath, UK, September 2004
Control, University of Bath, UK, September ID- IMPACT OF DEPENDENCY AND LOAD BALANCING IN MULTITHREADING REAL-TIME CONTROL ALGORITHMS M A Hossain and M O Tokhi Department of Computing, The University of
Lesson Objectives. To provide a grand tour of the major operating systems components To provide coverage of basic computer system organization
Lesson Objectives To provide a grand tour of the major operating systems components To provide coverage of basic computer system organization AE3B33OSD Lesson 1 / Page 2 What is an Operating System? A
Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes
Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes Eric Petit, Loïc Thebault, Quang V. Dinh May 2014 EXA2CT Consortium 2 WPs Organization Proto-Applications
Parallel Algorithm Engineering
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis
Effective Computing with SMP Linux
Effective Computing with SMP Linux Multi-processor systems were once a feature of high-end servers and mainframes, but today, even desktops for personal use have multiple processors. Linux is a popular
Software Pipelining - Modulo Scheduling
EECS 583 Class 12 Software Pipelining - Modulo Scheduling University of Michigan October 15, 2014 Announcements + Reading Material HW 2 Due this Thursday Today s class reading» Iterative Modulo Scheduling:
Comp 5311 Database Management Systems. 16. Review 2 (Physical Level)
Comp 5311 Database Management Systems 16. Review 2 (Physical Level) 1 Main Topics Indexing Join Algorithms Query Processing and Optimization Transactions and Concurrency Control 2 Indexing Used for faster
High Performance Computing. Course Notes 2007-2008. HPC Fundamentals
High Performance Computing Course Notes 2007-2008 2008 HPC Fundamentals Introduction What is High Performance Computing (HPC)? Difficult to define - it s a moving target. Later 1980s, a supercomputer performs
CS423 Spring 2015 MP4: Dynamic Load Balancer Due April 27 th at 9:00 am 2015
CS423 Spring 2015 MP4: Dynamic Load Balancer Due April 27 th at 9:00 am 2015 1. Goals and Overview 1. In this MP you will design a Dynamic Load Balancer architecture for a Distributed System 2. You will
MPLAB TM C30 Managed PSV Pointers. Beta support included with MPLAB C30 V3.00
MPLAB TM C30 Managed PSV Pointers Beta support included with MPLAB C30 V3.00 Contents 1 Overview 2 1.1 Why Beta?.............................. 2 1.2 Other Sources of Reference..................... 2 2
INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER
Course on: Advanced Computer Architectures INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER Prof. Cristina Silvano Politecnico di Milano [email protected] Prof. Silvano, Politecnico di Milano
COS 318: Operating Systems
COS 318: Operating Systems File Performance and Reliability Andy Bavier Computer Science Department Princeton University http://www.cs.princeton.edu/courses/archive/fall10/cos318/ Topics File buffer cache
Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming
Overview Lecture 1: an introduction to CUDA Mike Giles [email protected] hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.
COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters
COSC 6374 Parallel I/O (I) I/O basics Fall 2012 Concept of a clusters Processor 1 local disks Compute node message passing network administrative network Memory Processor 2 Network card 1 Network card
Petascale Software Challenges. William Gropp www.cs.illinois.edu/~wgropp
Petascale Software Challenges William Gropp www.cs.illinois.edu/~wgropp Petascale Software Challenges Why should you care? What are they? Which are different from non-petascale? What has changed since
Processes and Non-Preemptive Scheduling. Otto J. Anshus
Processes and Non-Preemptive Scheduling Otto J. Anshus 1 Concurrency and Process Challenge: Physical reality is Concurrent Smart to do concurrent software instead of sequential? At least we want to have
Multi-core Programming System Overview
Multi-core Programming System Overview Based on slides from Intel Software College and Multi-Core Programming increasing performance through software multi-threading by Shameem Akhter and Jason Roberts,
Binary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
HPC Wales Skills Academy Course Catalogue 2015
HPC Wales Skills Academy Course Catalogue 2015 Overview The HPC Wales Skills Academy provides a variety of courses and workshops aimed at building skills in High Performance Computing (HPC). Our courses
Course Development of Programming for General-Purpose Multicore Processors
Course Development of Programming for General-Purpose Multicore Processors Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University Richmond, VA 23284 [email protected]
Abstract Data Type. EECS 281: Data Structures and Algorithms. The Foundation: Data Structures and Abstract Data Types
EECS 281: Data Structures and Algorithms The Foundation: Data Structures and Abstract Data Types Computer science is the science of abstraction. Abstract Data Type Abstraction of a data structure on that
Parallel Computing for Data Science
Parallel Computing for Data Science With Examples in R, C++ and CUDA Norman Matloff University of California, Davis USA (g) CRC Press Taylor & Francis Group Boca Raton London New York CRC Press is an imprint
Using Eclipse CDT/PTP for Static Analysis
PTP User-Developer Workshop Sept 18-20, 2012 Using Eclipse CDT/PTP for Static Analysis Beth R. Tibbitts IBM STG [email protected] "This material is based upon work supported by the Defense Advanced Research
Elemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus
Elemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus A simple C/C++ language extension construct for data parallel operations Robert Geva [email protected] Introduction Intel
GPU Parallel Computing Architecture and CUDA Programming Model
GPU Parallel Computing Architecture and CUDA Programming Model John Nickolls Outline Why GPU Computing? GPU Computing Architecture Multithreading and Arrays Data Parallel Problem Decomposition Parallel
Performance Monitoring of Parallel Scientific Applications
Performance Monitoring of Parallel Scientific Applications Abstract. David Skinner National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory This paper introduces an infrastructure
CHAPTER 1 INTRODUCTION
1 CHAPTER 1 INTRODUCTION 1.1 MOTIVATION OF RESEARCH Multicore processors have two or more execution cores (processors) implemented on a single chip having their own set of execution and architectural recourses.
Intel Threading Building Blocks (Intel TBB) 2.2. In-Depth
Intel Threading Building Blocks (Intel TBB) 2.2 In-Depth Contents Intel Threading Building Blocks (Intel TBB) 2.2........... 3 Features................................................ 3 New in This Release...5
Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.
Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc. 2015 The MathWorks, Inc. 1 Challenges of Big Data Any collection of data sets so large and complex that it becomes difficult
Home Page. Data Structures. Title Page. Page 1 of 24. Go Back. Full Screen. Close. Quit
Data Structures Page 1 of 24 A.1. Arrays (Vectors) n-element vector start address + ielementsize 0 +1 +2 +3 +4... +n-1 start address continuous memory block static, if size is known at compile time dynamic,
Optimizing Load Balance Using Parallel Migratable Objects
Optimizing Load Balance Using Parallel Migratable Objects Laxmikant V. Kalé, Eric Bohm Parallel Programming Laboratory University of Illinois Urbana-Champaign 2012/9/25 Laxmikant V. Kalé, Eric Bohm (UIUC)
Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27
Logistics Week 1: Wednesday, Jan 27 Because of overcrowding, we will be changing to a new room on Monday (Snee 1120). Accounts on the class cluster (crocus.csuglab.cornell.edu) will be available next week.
Monitors, Java, Threads and Processes
Monitors, Java, Threads and Processes 185 An object-oriented view of shared memory A semaphore can be seen as a shared object accessible through two methods: wait and signal. The idea behind the concept
OpenACC Programming and Best Practices Guide
OpenACC Programming and Best Practices Guide June 2015 2015 openacc-standard.org. All Rights Reserved. Contents 1 Introduction 3 Writing Portable Code........................................... 3 What
Distributed Systems [Fall 2012]
Distributed Systems [Fall 2012] [W4995-2] Lec 6: YFS Lab Introduction and Local Synchronization Primitives Slide acks: Jinyang Li, Dave Andersen, Randy Bryant (http://www.news.cs.nyu.edu/~jinyang/fa10/notes/ds-lec2.ppt,
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
The Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud.
White Paper 021313-3 Page 1 : A Software Framework for Parallel Programming* The Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud. ABSTRACT Programming for Multicore,
CHAPTER 5 FINITE STATE MACHINE FOR LOOKUP ENGINE
CHAPTER 5 71 FINITE STATE MACHINE FOR LOOKUP ENGINE 5.1 INTRODUCTION Finite State Machines (FSMs) are important components of digital systems. Therefore, techniques for area efficiency and fast implementation
Parallel Computing. Benson Muite. [email protected] http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite [email protected] http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 IBM CELL. Politecnico di Milano Como Campus
Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 CELL INTRODUCTION 2 1 CELL SYNERGY Cell is not a collection of different processors, but a synergistic whole Operation paradigms,
Implementing MPI-IO Shared File Pointers without File System Support
Implementing MPI-IO Shared File Pointers without File System Support Robert Latham, Robert Ross, Rajeev Thakur, Brian Toonen Mathematics and Computer Science Division Argonne National Laboratory Argonne,
Graph Analytics in Big Data. John Feo Pacific Northwest National Laboratory
Graph Analytics in Big Data John Feo Pacific Northwest National Laboratory 1 A changing World The breadth of problems requiring graph analytics is growing rapidly Large Network Systems Social Networks
Getting OpenMP Up To Speed
1 Getting OpenMP Up To Speed Ruud van der Pas Senior Staff Engineer Oracle Solaris Studio Oracle Menlo Park, CA, USA IWOMP 2010 CCS, University of Tsukuba Tsukuba, Japan June 14-16, 2010 2 Outline The
