CS 575 Parallel Processing


 Dortha Watson
 2 years ago
 Views:
Transcription
1 CS 575 Parallel Processing Lecture one: Introduction Wim Bohm Colorado State University Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 license.
2 Course Topics Introduction, Background Orders of magnitude, Recurrences Models of Parallel Computing, communication Performance, Speedup, Efficiency Parallel Algorithms Dense Linear Algebra Sorting Graphs Search Fast Fourier Transform CS575 lecture 1 2
3 Course Organization Course reorganization Unite 575, 575dl Modernize: more // algorithms, GPUs We have separate course streams in networking and distributed systems Check the web page regularly Course organization is described on the web let's go look... Project changes regularly to stay fresh second half of the course GPUs/CUDA CS575 lecture 1 3
4 Cost effective Parallel Computing Off the shelf, commodity processors are very fast Memory is very cheap Building a processor that is a small factor faster costs an order of magnitude more Clusters: Cheapest way to get more performance: multiprocessor NoW: Networks of workstations Datacenters employ O(100K) simple processors with cheap interconnects Workstation can be an SMP Shared memory, Bus or Crossbar (eg. Cray) CS575 lecture 1 4
5 Wile E. Coyote s Parallel Computer Get a lot of the fastest processors Get a lot of memory per processor Get the fastest network Hook it all together And then what??? CS575 lecture 1 5
6 Now you gotta program it! Parallel programming introduces: CS575 lecture 1 6
7 Now you gotta program it! Parallel programming introduces: Task partitioning, task scheduling CS575 lecture 1 7
8 Now you gotta program it! Parallel programming introduces: Task partitioning, task scheduling Data partitioning, distribution CS575 lecture 1 8
9 Now you gotta program it! Parallel programming introduces: Task partitioning, task scheduling Data partitioning, distribution Synchronization CS575 lecture 1 9
10 Now you gotta program it! Parallel programming introduces: Task partitioning, task scheduling Data partitioning, distribution Synchronization Load balancing CS575 lecture 1 10
11 Now you gotta program it! Parallel programming introduces: Task partitioning, task scheduling Data partitioning, distribution Synchronization Load balancing Latency issues hiding tolerance CS575 lecture 1 11
12 Problem with Wile E. Coyote Architecture For high speed, processors have lots of state Cache, stack, global memory CS575 lecture 1 12
13 Problem with Wile E. Coyote Architecture For high speed, processors have lots of state Cache, stack, global memory To tolerate latency, we need fast context switch. WHY? CS575 lecture 1 13
14 Problem with Wile E. Coyote Architecture For high speed, processors have lots of state Cache, stack, global memory To tolerate latency, we need fast context switch. WHY? No free lunch: can t have both Certainly not if the processor was not designed for both CS575 lecture 1 14
15 Problem with Wile E. Coyote Architecture For high speed, processors have lots of state Cache, stack, global memory To tolerate latency, we need fast context switch. WHY? No free lunch: can t have both Certainly not if the processor was not designed for both Memory wall: memory gets slower and slower WHY? HOW? CS575 lecture 1 15
16 Problem with Wile E. Coyote Architecture For high speed, processors have lots of state Cache, stack, global memory To tolerate latency, we need fast context switch. WHY? No free lunch: can t have both Certainly not if the processor was not designed for both Memory wall: memory gets slower and slower in terms of number of cycles it takes to access Memory hierarchy gets more complex CS575 lecture 1 16
17 Sequential Algorithms Efficient Sequential Algorithms Minimize time, space Maximize state (avoiding recomputation) Efficiency is portable Efficient program on Pentium ~ Efficient program on Opteron CS575 lecture 1 17
18 Parallel Algorithms Efficient Parallel Algorithms Use efficient sequential algorithms Maximize parallelism recomputation is sometimes better than communication Minimize overhead synchronization, remote accesses Parallel efficiency is Architecture Dependent CS575 lecture 1 18
19 Speedup Ideal: n processors à n fold speed up Ideal not always possible. WHY? Tasks are data dependent Not all processors are always busy Remote data needs communication Memory wall PLUS Communication wall Linear speedup: α n speedup (α <= 1) CS575 lecture 1 19
20 Super linear speedup Super linear speedup: α > 1 Discuss... is it possible? CS575 lecture 1 20
21 Super linear speedup Super linear speedup: α > 1 Nonsense! Because we can execute the faster parallel program sequentially CS575 lecture 1 21
22 Super linear speedup Super linear speedup: α > 1 No nonsense!! Because parallel computers do not just have more processors, they have more local memory / caches CS575 lecture 1 22
23 Parallel Programming Paradigms Implicit parallel programming: Super Compilers Compiler extracts parallelism from sequential code Distributes data, creates and schedules tasks Complication: side effects: the sequential order of reads and writes to a memory location determines the program outcome a parallelizing compiler must obey the sequential order of side effecting statements and still create //ism  pointers, aliases, indirect array reference make analyzing which statements access which locations hard or impossible  40 years of compiler research for general purpose parallel computing has not brought much result. CS575 lecture 1 23
24 Paradigms cont Implicit parallel programming cont Simple, clean case: Functional Programming (FP) Functions: no side effects, order of execution less constrained F ( P(x,y), Q(y,z) ) P and Q can be executed in parallel Simple single assigment memory model: no pointers, no write after read or write after write hazards (dataflow semantics) FP was long doomed too high level too inefficient, because the simple memory model causes lots of copies FP is coming back: MapReduce approach in data centers (Google) is a data parallel functional paradigm CS575 lecture 1 24
25 Explicit parallel programming Explicit parallel programming Multithreading: OpenMP Message Passing: MPI Data parallel programming (important niche): CUDA Explicit Parallelism complicates programming creation, allocation, scheduling of processes data partitioning Synchronization ( semaphores, locks, messages ) CS575 lecture 1 25
26 Example 1: Weather Prediction Area, segments 3000*3000*11 cubic miles.1*.1*.1 cubic mile: ~ segments Two day prediction half hour time steps: ~ 100 time steps Computation per segment Temp, Pressure, Humidity, Wind speed, Wind direction for each time step in each segment Assume ~ 100 FLOPs per time step per segment CS575 lecture 1 26
27 Performance: Weather Prediction Computational requirement: FLOPs assume one FLOP per clock cycle 1 core: 4 GHz Total serial time: 25*10 4 sec ~ 70 hours Not too good for 48 hour weather prediction CS575 lecture 1 27
28 Parallel Weather Prediction 1 K workstations, grid connected 10 8 segment computations per processor 10 8 instructions per second 100 instructions per segment computation 100 time steps: 10 4 seconds = ~3 hours Much more acceptable Assumption: Communication not a problem here Why is this assumption reasonable? More workstations: finer grid, better accuracy CS575 lecture 1 28
29 Example 2: N body problem Astronomy: bodies in space Attract each other: Gravitational force Newtons law O(n 2 ) calculations per snapshot Galaxy: ~ bodies > ~ calculations/snapshot Calculation 1 micro sec Snapshot: secs = ~10 11 days = ~ 3*10 8 years Is parallelism going to help us? NO What does help? Better algorithm: Barnes Hut Divides the space in quad tree (or oct tree ) Treats far away quads as one body: O(n log n) How much time per snapshot now? CS575 lecture 1 29
30 Other Challenging Applications Satellite data acquisition: billions of bits / sec Pollution levels, Remote sensing of materials Image recognition Discrete optimization problems Planning, Scheduling, VLSI design Bioinformatics, computational chemistry Airplane/Satellite/Vehicle design Internet (Google search) CS575 lecture 1 30
31 Application Specific Architectures ASICs: Application Specific Integrated Circuits Levels of specificity Full custom ASICs Standard cell ASICs Field programmable gate arrays Computational models Dataflow graphs Systolic arrays Promising orders of magnitude better performance, lower power CS575 lecture 1 31
32 ASICS cont How much faster than General purpose? Example: 1D 1024 FFT General purpose machine (G4): 25 micro secs ASIC device (MIT Lincoln Labs): 32 nano secs ASIC device uses 20 milliwatts (100 * less power) Other applications Finite Impulse Response (FIR) Filters Matrix multiply QR decomposition What do these all have in common? CS575 lecture 1 32
33 Background If you do not have necessary background in analysis of algorithms See the book Introduction to Algorithms by Cormen, Leiserson, Rivest and Stein Or go online Topics to study Introduction Growth of functions Summations Recurrences CS575 lecture 1 33
34 O, Ω, Θ Background: Orders of Magnitude f(x) = O(g(x)) iff c, n 0 : f(x) < c.g(x) n> n 0 used for upper bound of algorithm complexity: this particular algorithm takes at most c.g(n) time f(x) = Ω(g(x)) iff c, n 0 : f(x) > c.g(x) n> n 0 used for lower bound of problem complexity: any algorithm for solving this problem takes at least c.g(n) time f(x) = Θ(g(x)) iff f(x)=o(g(x)) and f(x)=ω(g(x)) Tight bound CS575 lecture 1 34
35 Background: Closed problems Closed problem P: algorithm X with O(X) = Ω(P) eg. Sort has tight bound: Θ(nlog(n)) Problem P has algorithmic gap: P is not closed, eg., all NP Complete problems (problems with polynomial lower bound but currently exponential upper bound, such as TSP) CS575 lecture 1 35
36 Recurrence Relations Algorithmic complexity often described using recurrence relations: f(n) = R( f(1).. f(n1) ) Two important types of recurrence relations Linear Divide and Conquer cs420(dl) covers these CS575 lecture 1 36
37 Repeated substitution Simple recurrence relations (one recurrent term in the rhs) can sometimes be solved using repeated substitution Two types: Linear and DivCo Linear F(n) = af(nd)+g(n), base: F(1)=v 1 Divco F(n)= af(n/d)+g(n), base: F(1)=v 1 Two questions: what is the pattern how often is it applied until we hit the base case
38 Linear Example M(n)=2M(n1)+1, M(1)=1 recognize this recurrence? M(n) = 2M(n1)+1 = 2(2M(n2)+1)+1 = 4M(n2)+2+1 = 4(2M(n3)+1)+2+1= 8M(n3)+4+2+1= inductive step 2 k M(nk)+2 k1 +2 k = hit base for k = n1: = 2 n1 M(1)+2 n1 +2 n = 2 n 1 for more on Linear recurrence relations, see 420dl
39 DivCo example Merge sort: T(n) = 2T(n/2) + n, T(1)=1 n = 2 k T(n)=2(2(T(n/4)+n/2)+n = 4T(n/4) + 2n = 8T(n/8) + 3n... inductive step = 2 k T(n/2 k )+kn hit base for k = logn = 2 k T(n/2 k )+kn = n+kn = O(nlogn)
40 Another one: binary search f(n) = f(n/2)+c f(1)=1 let n = 2 k f(n)=f(n/2)+c = f(n/4)+2c = f(n/8)+3c = f(n/2 k )+kc = hit base for k=log n: f(1)+ c logn = O(log n)
41 Master Method Cookbook approach to solution, based on repeated substitution (Cormen et.al. or Rosen) A n = C A n/d +knp A n = O(n p ) if C < d p eg A n = 3 A n/2 +n2 A n = O(n p log(n)) if C = d p eg A n = 2A n/2 +n A n = O(n log d c ) if C > d p eg A n = 3 A n/2 +n Do binary search and merge sort with this method
42 Examples Merge Sort T(n) = 2T(n/2) + n, T(1)=1 C=? d=? p=? d p =? T(n) = O(??? ) Binary Search f(n) = f(n/2)+c f(1)=1 C=? d=? p=? d p =? f(n) = O(??? ) CS575 lecture 1 42
Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite benson.muite@ut.ee http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
More informationHigh Performance Computing. Course Notes 20072008. HPC Fundamentals
High Performance Computing Course Notes 20072008 2008 HPC Fundamentals Introduction What is High Performance Computing (HPC)? Difficult to define  it s a moving target. Later 1980s, a supercomputer performs
More informationIntroduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
More informationEfficiency of algorithms. Algorithms. Efficiency of algorithms. Binary search and linear search. Best, worst and average case.
Algorithms Efficiency of algorithms Computational resources: time and space Best, worst and average case performance How to compare algorithms: machineindependent measure of efficiency Growth rate Complexity
More informationScalability and Classifications
Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static
More informationReconfigurable Architecture Requirements for CoDesigned Virtual Machines
Reconfigurable Architecture Requirements for CoDesigned Virtual Machines Kenneth B. Kent University of New Brunswick Faculty of Computer Science Fredericton, New Brunswick, Canada ken@unb.ca Micaela Serra
More informationCSE373: Data Structures and Algorithms Lecture 3: Math Review; Algorithm Analysis. Linda Shapiro Winter 2015
CSE373: Data Structures and Algorithms Lecture 3: Math Review; Algorithm Analysis Linda Shapiro Today Registration should be done. Homework 1 due 11:59 pm next Wednesday, January 14 Review math essential
More informationBig Data Systems CS 5965/6965 FALL 2015
Big Data Systems CS 5965/6965 FALL 2015 Today General course overview Expectations from this course Q&A Introduction to Big Data Assignment #1 General Course Information Course Web Page http://www.cs.utah.edu/~hari/teaching/fall2015.html
More information18742 Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two
age 1 18742 Lecture 4 arallel rogramming II Spring 2005 rof. Babak Falsafi http://www.ece.cmu.edu/~ece742 write X Memory send X Memory read X Memory Slides developed in part by rofs. Adve, Falsafi, Hill,
More informationReminder: Complexity (1) Parallel Complexity Theory. Reminder: Complexity (2) Complexitynew
Reminder: Complexity (1) Parallel Complexity Theory Lecture 6 Number of steps or memory units required to compute some result In terms of input size Using a single processor O(1) says that regardless of
More informationReminder: Complexity (1) Parallel Complexity Theory. Reminder: Complexity (2) Complexitynew GAP (2) Graph Accessibility Problem (GAP) (1)
Reminder: Complexity (1) Parallel Complexity Theory Lecture 6 Number of steps or memory units required to compute some result In terms of input size Using a single processor O(1) says that regardless of
More informationParallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture SharedMemory
More informationParallel Computing for Data Science
Parallel Computing for Data Science With Examples in R, C++ and CUDA Norman Matloff University of California, Davis USA (g) CRC Press Taylor & Francis Group Boca Raton London New York CRC Press is an imprint
More informationGPU Computing with CUDA Lecture 2  CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile
GPU Computing with CUDA Lecture 2  CUDA Memories Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 1 Warp scheduling CUDA Memory hierarchy
More information2: Computer Performance
2: Computer Performance http://people.sc.fsu.edu/ jburkardt/presentations/ fdi 2008 lecture2.pdf... John Information Technology Department Virginia Tech... FDI Summer Track V: Parallel Programming 1012
More informationNext Generation GPU Architecture Codenamed Fermi
Next Generation GPU Architecture Codenamed Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
More informationParallel Algorithm Engineering
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis
More informationPrinciples and characteristics of distributed systems and environments
Principles and characteristics of distributed systems and environments Definition of a distributed system Distributed system is a collection of independent computers that appears to its users as a single
More informationOverview of High Performance Computing
Overview of High Performance Computing Timothy H. Kaiser, PH.D. tkaiser@mines.edu http://geco.mines.edu/workshop 1 This tutorial will cover all three time slots. In the first session we will discuss the
More informationAchieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro and even nanoseconds.
More informationIMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE OPERATORS
Volume 2, No. 3, March 2011 Journal of Global Research in Computer Science RESEARCH PAPER Available Online at www.jgrcs.info IMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE
More informationReliable Systolic Computing through Redundancy
Reliable Systolic Computing through Redundancy Kunio Okuda 1, Siang Wun Song 1, and Marcos Tatsuo Yamamoto 1 Universidade de São Paulo, Brazil, {kunio,song,mty}@ime.usp.br, http://www.ime.usp.br/ song/
More informationGraph Analytics in Big Data. John Feo Pacific Northwest National Laboratory
Graph Analytics in Big Data John Feo Pacific Northwest National Laboratory 1 A changing World The breadth of problems requiring graph analytics is growing rapidly Large Network Systems Social Networks
More informationAPPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder
APPM4720/5720: Fast algorithms for big data Gunnar Martinsson The University of Colorado at Boulder Course objectives: The purpose of this course is to teach efficient algorithms for processing very large
More informationAnalysis of Computer Algorithms. Algorithm. Algorithm, Data Structure, Program
Analysis of Computer Algorithms Hiroaki Kobayashi Input Algorithm Output 12/13/02 Algorithm Theory 1 Algorithm, Data Structure, Program Algorithm Welldefined, a finite stepbystep computational procedure
More informationChapter 2 Parallel Architecture, Software And Performance
Chapter 2 Parallel Architecture, Software And Performance UCSB CS140, T. Yang, 2014 Modified from texbook slides Roadmap Parallel hardware Parallel software Input and output Performance Parallel program
More informationHigh Performance Computing for Operation Research
High Performance Computing for Operation Research IEF  Paris Sud University claude.tadonki@upsud.fr INRIAAlchemy seminar, Thursday March 17 Research topics Fundamental Aspects of Algorithms and Complexity
More informationBig Data Technology MapReduce Motivation: Indexing in Search Engines
Big Data Technology MapReduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process
More informationCS/COE 1501 http://cs.pitt.edu/~bill/1501/
CS/COE 1501 http://cs.pitt.edu/~bill/1501/ Lecture 01 Course Introduction Metanotes These notes are intended for use by students in CS1501 at the University of Pittsburgh. They are provided free of charge
More informationOutline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary
OpenCL Optimization Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary 2 Overall Optimization Strategies Maximize parallel
More informationBig Graph Processing: Some Background
Big Graph Processing: Some Background Bo Wu Colorado School of Mines Part of slides from: Paul Burkhardt (National Security Agency) and Carlos Guestrin (Washington University) Mines CSCI580, Bo Wu Graphs
More informationPerformance Characteristics of Large SMP Machines
Performance Characteristics of Large SMP Machines Dirk Schmidl, Dieter an Mey, Matthias S. Müller schmidl@rz.rwthaachen.de Rechen und Kommunikationszentrum (RZ) Agenda Investigated Hardware Kernel Benchmark
More informationWhite Paper The Numascale Solution: Extreme BIG DATA Computing
White Paper The Numascale Solution: Extreme BIG DATA Computing By: Einar Rustad ABOUT THE AUTHOR Einar Rustad is CTO of Numascale and has a background as CPU, Computer Systems and HPC Systems Designer
More informationParallel Scalable Algorithms Performance Parameters
www.bsc.es Parallel Scalable Algorithms Performance Parameters Vassil Alexandrov, ICREA  Barcelona Supercomputing Center, Spain Overview Sources of Overhead in Parallel Programs Performance Metrics for
More informationPerformance metrics for parallelism
Performance metrics for parallelism 8th of November, 2013 Sources Rob H. Bisseling; Parallel Scientific Computing, Oxford Press. Grama, Gupta, Karypis, Kumar; Parallel Computing, Addison Wesley. Definition
More informationLSDYNA Scalability on Cray Supercomputers. TinTing Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.
LSDYNA Scalability on Cray Supercomputers TinTing Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp. WPLSDYNA12213 www.cray.com Table of Contents Abstract... 3 Introduction... 3 Scalability
More informationAssessment Plan for CS and CIS Degree Programs Computer Science Dept. Texas A&M University  Commerce
Assessment Plan for CS and CIS Degree Programs Computer Science Dept. Texas A&M University  Commerce Program Objective #1 (PO1):Students will be able to demonstrate a broad knowledge of Computer Science
More informationMemory Hierarchy. Arquitectura de Computadoras. Centro de Investigación n y de Estudios Avanzados del IPN. adiaz@cinvestav.mx. MemoryHierarchy 1
Hierarchy Arturo Díaz D PérezP Centro de Investigación n y de Estudios Avanzados del IPN adiaz@cinvestav.mx Hierarchy 1 The Big Picture: Where are We Now? The Five Classic Components of a Computer Processor
More informationCSC148 Lecture 8. Algorithm Analysis Binary Search Sorting
CSC148 Lecture 8 Algorithm Analysis Binary Search Sorting Algorithm Analysis Recall definition of Big Oh: We say a function f(n) is O(g(n)) if there exists positive constants c,b such that f(n)
More information64Bit versus 32Bit CPUs in Scientific Computing
64Bit versus 32Bit CPUs in Scientific Computing Axel Kohlmeyer Lehrstuhl für Theoretische Chemie RuhrUniversität Bochum March 2004 1/25 Outline 64Bit and 32Bit CPU Examples
More information08  Address Generator Unit (AGU)
September 30, 2013 Todays lecture Memory subsystem Address Generator Unit (AGU) Memory subsystem Applications may need from kilobytes to gigabytes of memory Having large amounts of memory onchip is expensive
More informationRecommended hardware system configurations for ANSYS users
Recommended hardware system configurations for ANSYS users The purpose of this document is to recommend system configurations that will deliver high performance for ANSYS users across the entire range
More informationRAMCloud and the Low Latency Datacenter. John Ousterhout Stanford University
RAMCloud and the Low Latency Datacenter John Ousterhout Stanford University Most important driver for innovation in computer systems: Rise of the datacenter Phase 1: large scale Phase 2: low latency Introduction
More informationA Brief Review of Processor Architecture. Why are Modern Processors so Complicated? Basic Structure
A Brief Review of Processor Architecture Why are Modern Processors so Complicated? Basic Structure CPU PC IR Regs ALU Memory Fetch PC > Mem addr [addr] > IR PC ++ Decode Select regs Execute Perform op
More informationParallel Computing. Frank McKenna. UC Berkeley. OpenSees Parallel Workshop Berkeley, CA
Parallel Computing Frank McKenna UC Berkeley OpenSees Parallel Workshop Berkeley, CA Overview Introduction to Parallel Computers Parallel Programming Models Race Conditions and Deadlock Problems Performance
More informationHigh Performance Computing
High Performance Computing Trey Breckenridge Computing Systems Manager Engineering Research Center Mississippi State University What is High Performance Computing? HPC is ill defined and context dependent.
More informationSWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri
SWARM: A Parallel Programming Framework for Multicore Processors David A. Bader, Varun N. Kanade and Kamesh Madduri Our Contributions SWARM: SoftWare and Algorithms for Running on Multicore, a portable
More informationnumascale White Paper The Numascale Solution: Extreme BIG DATA Computing Hardware Accellerated Data Intensive Computing By: Einar Rustad ABSTRACT
numascale Hardware Accellerated Data Intensive Computing White Paper The Numascale Solution: Extreme BIG DATA Computing By: Einar Rustad www.numascale.com Supemicro delivers 108 node system with Numascale
More informationMulticore Architectures
Multicore Architectures Week 1, Lecture 2 Multicore Landscape Intel Dual and quadcore Pentium family. 80core demonstration last year. AMD Dual, triple (?!), and quadcore Opteron family. IBM Dual and
More informationOptimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as
More informationMPI and Hybrid Programming Models. William Gropp www.cs.illinois.edu/~wgropp
MPI and Hybrid Programming Models William Gropp www.cs.illinois.edu/~wgropp 2 What is a Hybrid Model? Combination of several parallel programming models in the same program May be mixed in the same source
More informationCOMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)
COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook) Vivek Sarkar Department of Computer Science Rice University vsarkar@rice.edu COMP
More informationBinary Search. Search for x in a sorted array A.
Divide and Conquer A general paradigm for algorithm design; inspired by emperors and colonizers. Threestep process: 1. Divide the problem into smaller problems. 2. Conquer by solving these problems. 3.
More informationThe Running Time of Programs
CHAPTER 3 The Running Time of Programs In Chapter 2, we saw two radically different algorithms for sorting: selection sort and merge sort. There are, in fact, scores of algorithms for sorting. This situation
More informationIntroduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware
More informationMulticore Programming with LabVIEW Technical Resource Guide
Multicore Programming with LabVIEW Technical Resource Guide 2 INTRODUCTORY TOPICS UNDERSTANDING PARALLEL HARDWARE: MULTIPROCESSORS, HYPERTHREADING, DUAL CORE, MULTICORE AND FPGAS... 5 DIFFERENCES BETWEEN
More informationList of courses MEngg (Computer Systems)
List of courses MEngg (Computer Systems) Course No. Course Title NonCredit Courses CS401 CS402 CS403 CS404 CS405 CS406 Introduction to Programming Systems Design System Design using Microprocessors
More informationPARALLEL PROGRAMMING
PARALLEL PROGRAMMING TECHNIQUES AND APPLICATIONS USING NETWORKED WORKSTATIONS AND PARALLEL COMPUTERS 2nd Edition BARRY WILKINSON University of North Carolina at Charlotte Western Carolina University MICHAEL
More informationA Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment
A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment Panagiotis D. Michailidis and Konstantinos G. Margaritis Parallel and Distributed
More informationDistributed Operating Systems Introduction
Distributed Operating Systems Introduction Ewa NiewiadomskaSzynkiewicz and Adam Kozakiewicz ens@ia.pw.edu.pl, akozakie@ia.pw.edu.pl Institute of Control and Computation Engineering Warsaw University of
More informationProgram Optimization for Multicore Architectures
Program Optimization for Multicore Architectures Sanjeev K Aggarwal (ska@iitk.ac.in) M Chaudhuri (mainak@iitk.ac.in) R Moona (moona@iitk.ac.in) Department of Computer Science and Engineering, IIT Kanpur
More informationParallelism and Cloud Computing
Parallelism and Cloud Computing Kai Shen Parallel Computing Parallel computing: Process sub tasks simultaneously so that work can be completed faster. For instances: divide the work of matrix multiplication
More informationSystolic Computing. Fundamentals
Systolic Computing Fundamentals Motivations for Systolic Processing PARALLEL ALGORITHMS WHICH MODEL OF COMPUTATION IS THE BETTER TO USE? HOW MUCH TIME WE EXPECT TO SAVE USING A PARALLEL ALGORITHM? HOW
More informationCOMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING
COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING 2013/2014 1 st Semester Sample Exam January 2014 Duration: 2h00  No extra material allowed. This includes notes, scratch paper, calculator, etc.
More informationPartitioning and Divide and Conquer Strategies
and Divide and Conquer Strategies Lecture 4 and Strategies Strategies Data partitioning aka domain decomposition Functional decomposition Lecture 4 and Strategies Quiz 4.1 For nuclear reactor simulation,
More informationGPU File System Encryption Kartik Kulkarni and Eugene Linkov
GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through
More informationCOSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters
COSC 6374 Parallel Computation Parallel I/O (I) I/O basics Spring 2008 Concept of a clusters Processor 1 local disks Compute node message passing network administrative network Memory Processor 2 Network
More informationAn Overview of a Compiler
An Overview of a Compiler Department of Computer Science and Automation Indian Institute of Science Bangalore 560 012 NPTEL Course on Principles of Compiler Design Outline of the Lecture About the course
More informationAn Introduction to Parallel Computing/ Programming
An Introduction to Parallel Computing/ Programming Vicky Papadopoulou Lesta Astrophysics and High Performance Computing Research Group (http://ahpc.euc.ac.cy) Dep. of Computer Science and Engineering European
More informationPerformance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for HighEnd Computing October 1, 2013 Lyon, France
More informationImproving Scalability of OpenMP Applications on Multicore Systems Using Large Page Support
Improving Scalability of OpenMP Applications on Multicore Systems Using Large Page Support Ranjit Noronha and Dhabaleswar K. Panda Network Based Computing Laboratory (NBCL) The Ohio State University Outline
More informationLecture 2 Parallel Programming Platforms
Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple
More informationCOSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters
COSC 6374 Parallel I/O (I) I/O basics Fall 2012 Concept of a clusters Processor 1 local disks Compute node message passing network administrative network Memory Processor 2 Network card 1 Network card
More informationLoad Balancing on a Nondedicated Heterogeneous Network of Workstations
Load Balancing on a Nondedicated Heterogeneous Network of Workstations Dr. Maurice Eggen Nathan Franklin Department of Computer Science Trinity University San Antonio, Texas 78212 Dr. Roger Eggen Department
More informationMapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research
MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With
More informationGPU Parallel Computing Architecture and CUDA Programming Model
GPU Parallel Computing Architecture and CUDA Programming Model John Nickolls Outline Why GPU Computing? GPU Computing Architecture Multithreading and Arrays Data Parallel Problem Decomposition Parallel
More informationSpring 2011 Prof. Hyesoon Kim
Spring 2011 Prof. Hyesoon Kim Today, we will study typical patterns of parallel programming This is just one of the ways. Materials are based on a book by Timothy. Decompose Into tasks Original Problem
More informationLecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?
Lecture 3: Evaluating Computer Architectures Announcements  Reminder: Homework 1 due Thursday 2/2 Last Time technology back ground Computer elements Circuits and timing Virtuous cycle of the past and
More informationCS473  Algorithms I
CS473  Algorithms I Lecture 4 The DivideandConquer Design Paradigm View in slideshow mode 1 Reminder: Merge Sort Input array A sort this half sort this half Divide Conquer merge two sorted halves Combine
More informationDARPA, NSFNGS/ITR,ACR,CPA,
Spiral Automating Library Development Markus Püschel and the Spiral team (only part shown) With: Srinivas Chellappa Frédéric de Mesmay Franz Franchetti Daniel McFarlin Yevgen Voronenko Electrical and Computer
More informationScheduling Task Parallelism" on MultiSocket Multicore Systems"
Scheduling Task Parallelism" on MultiSocket Multicore Systems" Stephen Olivier, UNC Chapel Hill Allan Porterfield, RENCI Kyle Wheeler, Sandia National Labs Jan Prins, UNC Chapel Hill Outline" Introduction
More informationOpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA
OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization
More informationMassive Streaming Data Analytics: A Case Study with Clustering Coefficients. David Ediger, Karl Jiang, Jason Riedy and David A.
Massive Streaming Data Analytics: A Case Study with Clustering Coefficients David Ediger, Karl Jiang, Jason Riedy and David A. Bader Overview Motivation A Framework for Massive Streaming hello Data Analytics
More informationPetascale Software Challenges. William Gropp www.cs.illinois.edu/~wgropp
Petascale Software Challenges William Gropp www.cs.illinois.edu/~wgropp Petascale Software Challenges Why should you care? What are they? Which are different from nonpetascale? What has changed since
More informationUTS: An Unbalanced Tree Search Benchmark
UTS: An Unbalanced Tree Search Benchmark LCPC 2006 1 Coauthors Stephen Olivier, UNC Jun Huan, UNC/Kansas Jinze Liu, UNC Jan Prins, UNC James Dinan, OSU P. Sadayappan, OSU ChauWen Tseng, UMD Also, thanks
More informationCHAPTER 5 FINITE STATE MACHINE FOR LOOKUP ENGINE
CHAPTER 5 71 FINITE STATE MACHINE FOR LOOKUP ENGINE 5.1 INTRODUCTION Finite State Machines (FSMs) are important components of digital systems. Therefore, techniques for area efficiency and fast implementation
More informationSystem Design and Methodology/ Embedded Systems Design (Modeling and Design of Embedded Systems)
System Design&Methodologies Fö 1&21 System Design&Methodologies Fö 1&22 Course Information System Design and Methodology/ Embedded Systems Design (Modeling and Design of Embedded Systems) TDTS30/TDDI08
More informationIMCM: A Flexible FineGrained Adaptive Framework for Parallel Mobile Hybrid Cloud Applications
Open System Laboratory of University of Illinois at Urbana Champaign presents: Outline: IMCM: A Flexible FineGrained Adaptive Framework for Parallel Mobile Hybrid Cloud Applications A FineGrained Adaptive
More informationControl 2004, University of Bath, UK, September 2004
Control, University of Bath, UK, September ID IMPACT OF DEPENDENCY AND LOAD BALANCING IN MULTITHREADING REALTIME CONTROL ALGORITHMS M A Hossain and M O Tokhi Department of Computing, The University of
More informationApplications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61
F# Applications to Computational Financial and GPU Computing May 16th Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 Today! Why care about F#? Just another fashion?! Three success stories! How Alea.cuBase
More informationChapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup
Chapter 12: Multiprocessor Architectures Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup Objective Be familiar with basic multiprocessor architectures and be able to
More informationClass Overview. CSE 326: Data Structures. Goals. Goals. Data Structures. Goals. Introduction
Class Overview CSE 326: Data Structures Introduction Introduction to many of the basic data structures used in computer software Understand the data structures Analyze the algorithms that use them Know
More informationLecture 2: Universality
CS 710: Complexity Theory 1/21/2010 Lecture 2: Universality Instructor: Dieter van Melkebeek Scribe: Tyson Williams In this lecture, we introduce the notion of a universal machine, develop efficient universal
More informationModule: Software Instruction Scheduling Part I
Module: Software Instruction Scheduling Part I Sudhakar Yalamanchili, Georgia Institute of Technology Reading for this Module Loop Unrolling and Instruction Scheduling Section 2.2 Dependence Analysis Section
More informationA3 Computer Architecture
A3 Computer Architecture Engineering Science 3rd year A3 Lectures Prof David Murray david.murray@eng.ox.ac.uk www.robots.ox.ac.uk/ dwm/courses/3co Michaelmas 2000 1 / 1 6. Stacks, Subroutines, and Memory
More informationChapter 18: Database System Architectures. Centralized Systems
Chapter 18: Database System Architectures! Centralized Systems! ClientServer Systems! Parallel Systems! Distributed Systems! Network Types 18.1 Centralized Systems! Run on a single computer system and
More informationLoad Imbalance Analysis
With CrayPat Load Imbalance Analysis Imbalance time is a metric based on execution time and is dependent on the type of activity: User functions Imbalance time = Maximum time Average time Synchronization
More informationHow to make the computer understand? Lecture 15: Putting it all together. Example (Output assembly code) Example (input program) Anatomy of a Computer
How to make the computer understand? Fall 2005 Lecture 15: Putting it all together From parsing to code generation Write a program using a programming language Microprocessors talk in assembly language
More informationGRID SEARCHING Novel way of Searching 2D Array
GRID SEARCHING Novel way of Searching 2D Array Rehan Guha Institute of Engineering & Management Kolkata, India Abstract: Linear/Sequential searching is the basic search algorithm used in data structures.
More informationOperation Count; Numerical Linear Algebra
10 Operation Count; Numerical Linear Algebra 10.1 Introduction Many computations are limited simply by the sheer number of required additions, multiplications, or function evaluations. If floatingpoint
More information