CS 575 Parallel Processing
|
|
- Dortha Watson
- 8 years ago
- Views:
Transcription
1 CS 575 Parallel Processing Lecture one: Introduction Wim Bohm Colorado State University Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 license.
2 Course Topics Introduction, Background Orders of magnitude, Recurrences Models of Parallel Computing, communication Performance, Speedup, Efficiency Parallel Algorithms Dense Linear Algebra Sorting Graphs Search Fast Fourier Transform CS575 lecture 1 2
3 Course Organization Course reorganization Unite 575, 575dl Modernize: more // algorithms, GPUs We have separate course streams in networking and distributed systems Check the web page regularly Course organization is described on the web let's go look... Project changes regularly to stay fresh second half of the course GPUs/CUDA CS575 lecture 1 3
4 Cost effective Parallel Computing Off the shelf, commodity processors are very fast Memory is very cheap Building a processor that is a small factor faster costs an order of magnitude more Clusters: Cheapest way to get more performance: multiprocessor NoW: Networks of workstations Datacenters employ O(100K) simple processors with cheap interconnects Workstation can be an SMP Shared memory, Bus or Crossbar (eg. Cray) CS575 lecture 1 4
5 Wile E. Coyote s Parallel Computer Get a lot of the fastest processors Get a lot of memory per processor Get the fastest network Hook it all together And then what??? CS575 lecture 1 5
6 Now you gotta program it! Parallel programming introduces: CS575 lecture 1 6
7 Now you gotta program it! Parallel programming introduces: Task partitioning, task scheduling CS575 lecture 1 7
8 Now you gotta program it! Parallel programming introduces: Task partitioning, task scheduling Data partitioning, distribution CS575 lecture 1 8
9 Now you gotta program it! Parallel programming introduces: Task partitioning, task scheduling Data partitioning, distribution Synchronization CS575 lecture 1 9
10 Now you gotta program it! Parallel programming introduces: Task partitioning, task scheduling Data partitioning, distribution Synchronization Load balancing CS575 lecture 1 10
11 Now you gotta program it! Parallel programming introduces: Task partitioning, task scheduling Data partitioning, distribution Synchronization Load balancing Latency issues hiding tolerance CS575 lecture 1 11
12 Problem with Wile E. Coyote Architecture For high speed, processors have lots of state Cache, stack, global memory CS575 lecture 1 12
13 Problem with Wile E. Coyote Architecture For high speed, processors have lots of state Cache, stack, global memory To tolerate latency, we need fast context switch. WHY? CS575 lecture 1 13
14 Problem with Wile E. Coyote Architecture For high speed, processors have lots of state Cache, stack, global memory To tolerate latency, we need fast context switch. WHY? No free lunch: can t have both Certainly not if the processor was not designed for both CS575 lecture 1 14
15 Problem with Wile E. Coyote Architecture For high speed, processors have lots of state Cache, stack, global memory To tolerate latency, we need fast context switch. WHY? No free lunch: can t have both Certainly not if the processor was not designed for both Memory wall: memory gets slower and slower WHY? HOW? CS575 lecture 1 15
16 Problem with Wile E. Coyote Architecture For high speed, processors have lots of state Cache, stack, global memory To tolerate latency, we need fast context switch. WHY? No free lunch: can t have both Certainly not if the processor was not designed for both Memory wall: memory gets slower and slower in terms of number of cycles it takes to access Memory hierarchy gets more complex CS575 lecture 1 16
17 Sequential Algorithms Efficient Sequential Algorithms Minimize time, space Maximize state (avoiding re-computation) Efficiency is portable Efficient program on Pentium ~ Efficient program on Opteron CS575 lecture 1 17
18 Parallel Algorithms Efficient Parallel Algorithms Use efficient sequential algorithms Maximize parallelism re-computation is sometimes better than communication Minimize overhead synchronization, remote accesses Parallel efficiency is Architecture Dependent CS575 lecture 1 18
19 Speedup Ideal: n processors à n fold speed up Ideal not always possible. WHY? Tasks are data dependent Not all processors are always busy Remote data needs communication Memory wall PLUS Communication wall Linear speedup: α n speedup (α <= 1) CS575 lecture 1 19
20 Super linear speedup Super linear speedup: α > 1 Discuss... is it possible? CS575 lecture 1 20
21 Super linear speedup Super linear speedup: α > 1 Nonsense! Because we can execute the faster parallel program sequentially CS575 lecture 1 21
22 Super linear speedup Super linear speedup: α > 1 No nonsense!! Because parallel computers do not just have more processors, they have more local memory / caches CS575 lecture 1 22
23 Parallel Programming Paradigms Implicit parallel programming: Super Compilers Compiler extracts parallelism from sequential code Distributes data, creates and schedules tasks Complication: side effects: -the sequential order of reads and writes to a memory location determines the program outcome -a parallelizing compiler must obey the sequential order of side effecting statements and still create //ism - pointers, aliases, indirect array reference make analyzing which statements access which locations hard or impossible - 40 years of compiler research for general purpose parallel computing has not brought much result. CS575 lecture 1 23
24 Paradigms cont Implicit parallel programming cont Simple, clean case: Functional Programming (FP) Functions: no side effects, order of execution less constrained F ( P(x,y), Q(y,z) ) P and Q can be executed in parallel Simple single assigment memory model: no pointers, no write after read or write after write hazards (dataflow semantics) FP was long doomed too high level too inefficient, because the simple memory model causes lots of copies FP is coming back: MapReduce approach in data centers (Google) is a data parallel functional paradigm CS575 lecture 1 24
25 Explicit parallel programming Explicit parallel programming Multithreading: OpenMP Message Passing: MPI Data parallel programming (important niche): CUDA Explicit Parallelism complicates programming creation, allocation, scheduling of processes data partitioning Synchronization ( semaphores, locks, messages ) CS575 lecture 1 25
26 Example 1: Weather Prediction Area, segments 3000*3000*11 cubic miles.1*.1*.1 cubic mile: ~ segments Two day prediction half hour time steps: ~ 100 time steps Computation per segment Temp, Pressure, Humidity, Wind speed, Wind direction for each time step in each segment Assume ~ 100 FLOPs per time step per segment CS575 lecture 1 26
27 Performance: Weather Prediction Computational requirement: FLOPs assume one FLOP per clock cycle 1 core: 4 GHz Total serial time: 25*10 4 sec ~ 70 hours Not too good for 48 hour weather prediction CS575 lecture 1 27
28 Parallel Weather Prediction 1 K workstations, grid connected 10 8 segment computations per processor 10 8 instructions per second 100 instructions per segment computation 100 time steps: 10 4 seconds = ~3 hours Much more acceptable Assumption: Communication not a problem here Why is this assumption reasonable? More workstations: finer grid, better accuracy CS575 lecture 1 28
29 Example 2: N body problem Astronomy: bodies in space Attract each other: Gravitational force Newtons law O(n 2 ) calculations per snapshot Galaxy: ~ bodies -> ~ calculations/snapshot Calculation 1 micro sec Snapshot: secs = ~10 11 days = ~ 3*10 8 years Is parallelism going to help us? NO What does help? Better algorithm: Barnes Hut Divides the space in quad tree (or oct tree ) Treats far away quads as one body: O(n log n) How much time per snapshot now? CS575 lecture 1 29
30 Other Challenging Applications Satellite data acquisition: billions of bits / sec Pollution levels, Remote sensing of materials Image recognition Discrete optimization problems Planning, Scheduling, VLSI design Bio-informatics, computational chemistry Airplane/Satellite/Vehicle design Internet (Google search) CS575 lecture 1 30
31 Application Specific Architectures ASICs: Application Specific Integrated Circuits Levels of specificity Full custom ASICs Standard cell ASICs Field programmable gate arrays Computational models Dataflow graphs Systolic arrays Promising orders of magnitude better performance, lower power CS575 lecture 1 31
32 ASICS cont How much faster than General purpose? Example: 1D 1024 FFT General purpose machine (G4): 25 micro secs ASIC device (MIT Lincoln Labs): 32 nano secs ASIC device uses 20 milliwatts (100 * less power) Other applications Finite Impulse Response (FIR) Filters Matrix multiply QR decomposition What do these all have in common? CS575 lecture 1 32
33 Background If you do not have necessary background in analysis of algorithms See the book Introduction to Algorithms by Cormen, Leiserson, Rivest and Stein Or go online Topics to study Introduction Growth of functions Summations Recurrences CS575 lecture 1 33
34 O, Ω, Θ Background: Orders of Magnitude f(x) = O(g(x)) iff c, n 0 : f(x) < c.g(x) n> n 0 used for upper bound of algorithm complexity: this particular algorithm takes at most c.g(n) time f(x) = Ω(g(x)) iff c, n 0 : f(x) > c.g(x) n> n 0 used for lower bound of problem complexity: any algorithm for solving this problem takes at least c.g(n) time f(x) = Θ(g(x)) iff f(x)=o(g(x)) and f(x)=ω(g(x)) Tight bound CS575 lecture 1 34
35 Background: Closed problems Closed problem P: algorithm X with O(X) = Ω(P) eg. Sort has tight bound: Θ(nlog(n)) Problem P has algorithmic gap: P is not closed, eg., all NP Complete problems (problems with polynomial lower bound but currently exponential upper bound, such as TSP) CS575 lecture 1 35
36 Recurrence Relations Algorithmic complexity often described using recurrence relations: f(n) = R( f(1).. f(n-1) ) Two important types of recurrence relations Linear Divide and Conquer cs420(dl) covers these CS575 lecture 1 36
37 Repeated substitution Simple recurrence relations (one recurrent term in the rhs) can sometimes be solved using repeated substitution Two types: Linear and DivCo Linear F(n) = af(n-d)+g(n), base: F(1)=v 1 Divco F(n)= af(n/d)+g(n), base: F(1)=v 1 Two questions: what is the pattern how often is it applied until we hit the base case
38 Linear Example M(n)=2M(n-1)+1, M(1)=1 recognize this recurrence? M(n) = 2M(n-1)+1 = 2(2M(n-2)+1)+1 = 4M(n-2)+2+1 = 4(2M(n-3)+1)+2+1= 8M(n-3)+4+2+1= inductive step 2 k M(n-k)+2 k-1 +2 k = hit base for k = n-1: = 2 n-1 M(1)+2 n-1 +2 n = 2 n -1 for more on Linear recurrence relations, see 420dl
39 DivCo example Merge sort: T(n) = 2T(n/2) + n, T(1)=1 n = 2 k T(n)=2(2(T(n/4)+n/2)+n = 4T(n/4) + 2n = 8T(n/8) + 3n... inductive step = 2 k T(n/2 k )+kn hit base for k = logn = 2 k T(n/2 k )+kn = n+kn = O(nlogn)
40 Another one: binary search f(n) = f(n/2)+c f(1)=1 let n = 2 k f(n)=f(n/2)+c = f(n/4)+2c = f(n/8)+3c = f(n/2 k )+kc = hit base for k=log n: f(1)+ c logn = O(log n)
41 Master Method Cookbook approach to solution, based on repeated substitution (Cormen et.al. or Rosen) A n = C A n/d +knp A n = O(n p ) if C < d p eg A n = 3 A n/2 +n2 A n = O(n p log(n)) if C = d p eg A n = 2A n/2 +n A n = O(n log d c ) if C > d p eg A n = 3 A n/2 +n Do binary search and merge sort with this method
42 Examples Merge Sort T(n) = 2T(n/2) + n, T(1)=1 C=? d=? p=? d p =? T(n) = O(??? ) Binary Search f(n) = f(n/2)+c f(1)=1 C=? d=? p=? d p =? f(n) = O(??? ) CS575 lecture 1 42
Introduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
More informationHigh Performance Computing. Course Notes 2007-2008. HPC Fundamentals
High Performance Computing Course Notes 2007-2008 2008 HPC Fundamentals Introduction What is High Performance Computing (HPC)? Difficult to define - it s a moving target. Later 1980s, a supercomputer performs
More informationParallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite benson.muite@ut.ee http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
More informationEfficiency of algorithms. Algorithms. Efficiency of algorithms. Binary search and linear search. Best, worst and average case.
Algorithms Efficiency of algorithms Computational resources: time and space Best, worst and average case performance How to compare algorithms: machine-independent measure of efficiency Growth rate Complexity
More informationScalability and Classifications
Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static
More informationCSE373: Data Structures and Algorithms Lecture 3: Math Review; Algorithm Analysis. Linda Shapiro Winter 2015
CSE373: Data Structures and Algorithms Lecture 3: Math Review; Algorithm Analysis Linda Shapiro Today Registration should be done. Homework 1 due 11:59 pm next Wednesday, January 14 Review math essential
More informationReconfigurable Architecture Requirements for Co-Designed Virtual Machines
Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Kenneth B. Kent University of New Brunswick Faculty of Computer Science Fredericton, New Brunswick, Canada ken@unb.ca Micaela Serra
More informationBig Data Systems CS 5965/6965 FALL 2015
Big Data Systems CS 5965/6965 FALL 2015 Today General course overview Expectations from this course Q&A Introduction to Big Data Assignment #1 General Course Information Course Web Page http://www.cs.utah.edu/~hari/teaching/fall2015.html
More information18-742 Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two
age 1 18-742 Lecture 4 arallel rogramming II Spring 2005 rof. Babak Falsafi http://www.ece.cmu.edu/~ece742 write X Memory send X Memory read X Memory Slides developed in part by rofs. Adve, Falsafi, Hill,
More informationReminder: Complexity (1) Parallel Complexity Theory. Reminder: Complexity (2) Complexity-new
Reminder: Complexity (1) Parallel Complexity Theory Lecture 6 Number of steps or memory units required to compute some result In terms of input size Using a single processor O(1) says that regardless of
More informationReminder: Complexity (1) Parallel Complexity Theory. Reminder: Complexity (2) Complexity-new GAP (2) Graph Accessibility Problem (GAP) (1)
Reminder: Complexity (1) Parallel Complexity Theory Lecture 6 Number of steps or memory units required to compute some result In terms of input size Using a single processor O(1) says that regardless of
More informationGPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile
GPU Computing with CUDA Lecture 2 - CUDA Memories Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 1 Warp scheduling CUDA Memory hierarchy
More informationParallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
More informationParallel Computing for Data Science
Parallel Computing for Data Science With Examples in R, C++ and CUDA Norman Matloff University of California, Davis USA (g) CRC Press Taylor & Francis Group Boca Raton London New York CRC Press is an imprint
More information2: Computer Performance
2: Computer Performance http://people.sc.fsu.edu/ jburkardt/presentations/ fdi 2008 lecture2.pdf... John Information Technology Department Virginia Tech... FDI Summer Track V: Parallel Programming 10-12
More informationNext Generation GPU Architecture Code-named Fermi
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
More informationPrinciples and characteristics of distributed systems and environments
Principles and characteristics of distributed systems and environments Definition of a distributed system Distributed system is a collection of independent computers that appears to its users as a single
More informationIMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE OPERATORS
Volume 2, No. 3, March 2011 Journal of Global Research in Computer Science RESEARCH PAPER Available Online at www.jgrcs.info IMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE
More informationGraph Analytics in Big Data. John Feo Pacific Northwest National Laboratory
Graph Analytics in Big Data John Feo Pacific Northwest National Laboratory 1 A changing World The breadth of problems requiring graph analytics is growing rapidly Large Network Systems Social Networks
More informationReliable Systolic Computing through Redundancy
Reliable Systolic Computing through Redundancy Kunio Okuda 1, Siang Wun Song 1, and Marcos Tatsuo Yamamoto 1 Universidade de São Paulo, Brazil, {kunio,song,mty}@ime.usp.br, http://www.ime.usp.br/ song/
More informationAPPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder
APPM4720/5720: Fast algorithms for big data Gunnar Martinsson The University of Colorado at Boulder Course objectives: The purpose of this course is to teach efficient algorithms for processing very large
More informationAnalysis of Computer Algorithms. Algorithm. Algorithm, Data Structure, Program
Analysis of Computer Algorithms Hiroaki Kobayashi Input Algorithm Output 12/13/02 Algorithm Theory 1 Algorithm, Data Structure, Program Algorithm Well-defined, a finite step-by-step computational procedure
More informationParallel Algorithm Engineering
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis
More informationChapter 2 Parallel Architecture, Software And Performance
Chapter 2 Parallel Architecture, Software And Performance UCSB CS140, T. Yang, 2014 Modified from texbook slides Roadmap Parallel hardware Parallel software Input and output Performance Parallel program
More informationCS/COE 1501 http://cs.pitt.edu/~bill/1501/
CS/COE 1501 http://cs.pitt.edu/~bill/1501/ Lecture 01 Course Introduction Meta-notes These notes are intended for use by students in CS1501 at the University of Pittsburgh. They are provided free of charge
More informationWhite Paper The Numascale Solution: Extreme BIG DATA Computing
White Paper The Numascale Solution: Extreme BIG DATA Computing By: Einar Rustad ABOUT THE AUTHOR Einar Rustad is CTO of Numascale and has a background as CPU, Computer Systems and HPC Systems De-signer
More informationAchieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.
More informationPerformance Characteristics of Large SMP Machines
Performance Characteristics of Large SMP Machines Dirk Schmidl, Dieter an Mey, Matthias S. Müller schmidl@rz.rwth-aachen.de Rechen- und Kommunikationszentrum (RZ) Agenda Investigated Hardware Kernel Benchmark
More informationBig Graph Processing: Some Background
Big Graph Processing: Some Background Bo Wu Colorado School of Mines Part of slides from: Paul Burkhardt (National Security Agency) and Carlos Guestrin (Washington University) Mines CSCI-580, Bo Wu Graphs
More informationParallel Scalable Algorithms- Performance Parameters
www.bsc.es Parallel Scalable Algorithms- Performance Parameters Vassil Alexandrov, ICREA - Barcelona Supercomputing Center, Spain Overview Sources of Overhead in Parallel Programs Performance Metrics for
More informationBig Data Technology Map-Reduce Motivation: Indexing in Search Engines
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process
More informationPerformance metrics for parallelism
Performance metrics for parallelism 8th of November, 2013 Sources Rob H. Bisseling; Parallel Scientific Computing, Oxford Press. Grama, Gupta, Karypis, Kumar; Parallel Computing, Addison Wesley. Definition
More informationnumascale White Paper The Numascale Solution: Extreme BIG DATA Computing Hardware Accellerated Data Intensive Computing By: Einar Rustad ABSTRACT
numascale Hardware Accellerated Data Intensive Computing White Paper The Numascale Solution: Extreme BIG DATA Computing By: Einar Rustad www.numascale.com Supemicro delivers 108 node system with Numascale
More informationHigh Performance Computing for Operation Research
High Performance Computing for Operation Research IEF - Paris Sud University claude.tadonki@u-psud.fr INRIA-Alchemy seminar, Thursday March 17 Research topics Fundamental Aspects of Algorithms and Complexity
More informationLS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.
LS-DYNA Scalability on Cray Supercomputers Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp. WP-LS-DYNA-12213 www.cray.com Table of Contents Abstract... 3 Introduction... 3 Scalability
More informationList of courses MEngg (Computer Systems)
List of courses MEngg (Computer Systems) Course No. Course Title Non-Credit Courses CS-401 CS-402 CS-403 CS-404 CS-405 CS-406 Introduction to Programming Systems Design System Design using Microprocessors
More informationCOMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)
COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook) Vivek Sarkar Department of Computer Science Rice University vsarkar@rice.edu COMP
More information64-Bit versus 32-Bit CPUs in Scientific Computing
64-Bit versus 32-Bit CPUs in Scientific Computing Axel Kohlmeyer Lehrstuhl für Theoretische Chemie Ruhr-Universität Bochum March 2004 1/25 Outline 64-Bit and 32-Bit CPU Examples
More informationCSC148 Lecture 8. Algorithm Analysis Binary Search Sorting
CSC148 Lecture 8 Algorithm Analysis Binary Search Sorting Algorithm Analysis Recall definition of Big Oh: We say a function f(n) is O(g(n)) if there exists positive constants c,b such that f(n)
More informationMemory Hierarchy. Arquitectura de Computadoras. Centro de Investigación n y de Estudios Avanzados del IPN. adiaz@cinvestav.mx. MemoryHierarchy- 1
Hierarchy Arturo Díaz D PérezP Centro de Investigación n y de Estudios Avanzados del IPN adiaz@cinvestav.mx Hierarchy- 1 The Big Picture: Where are We Now? The Five Classic Components of a Computer Processor
More informationAssessment Plan for CS and CIS Degree Programs Computer Science Dept. Texas A&M University - Commerce
Assessment Plan for CS and CIS Degree Programs Computer Science Dept. Texas A&M University - Commerce Program Objective #1 (PO1):Students will be able to demonstrate a broad knowledge of Computer Science
More informationThe Running Time of Programs
CHAPTER 3 The Running Time of Programs In Chapter 2, we saw two radically different algorithms for sorting: selection sort and merge sort. There are, in fact, scores of algorithms for sorting. This situation
More informationMulticore Programming with LabVIEW Technical Resource Guide
Multicore Programming with LabVIEW Technical Resource Guide 2 INTRODUCTORY TOPICS UNDERSTANDING PARALLEL HARDWARE: MULTIPROCESSORS, HYPERTHREADING, DUAL- CORE, MULTICORE AND FPGAS... 5 DIFFERENCES BETWEEN
More informationPARALLEL PROGRAMMING
PARALLEL PROGRAMMING TECHNIQUES AND APPLICATIONS USING NETWORKED WORKSTATIONS AND PARALLEL COMPUTERS 2nd Edition BARRY WILKINSON University of North Carolina at Charlotte Western Carolina University MICHAEL
More informationRAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University
RAMCloud and the Low- Latency Datacenter John Ousterhout Stanford University Most important driver for innovation in computer systems: Rise of the datacenter Phase 1: large scale Phase 2: low latency Introduction
More informationHow To Understand The Concept Of A Distributed System
Distributed Operating Systems Introduction Ewa Niewiadomska-Szynkiewicz and Adam Kozakiewicz ens@ia.pw.edu.pl, akozakie@ia.pw.edu.pl Institute of Control and Computation Engineering Warsaw University of
More informationParallelism and Cloud Computing
Parallelism and Cloud Computing Kai Shen Parallel Computing Parallel computing: Process sub tasks simultaneously so that work can be completed faster. For instances: divide the work of matrix multiplication
More informationOptimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as
More informationCHAPTER 5 FINITE STATE MACHINE FOR LOOKUP ENGINE
CHAPTER 5 71 FINITE STATE MACHINE FOR LOOKUP ENGINE 5.1 INTRODUCTION Finite State Machines (FSMs) are important components of digital systems. Therefore, techniques for area efficiency and fast implementation
More informationProgram Optimization for Multi-core Architectures
Program Optimization for Multi-core Architectures Sanjeev K Aggarwal (ska@iitk.ac.in) M Chaudhuri (mainak@iitk.ac.in) R Moona (moona@iitk.ac.in) Department of Computer Science and Engineering, IIT Kanpur
More informationHigh Performance Computing
High Performance Computing Trey Breckenridge Computing Systems Manager Engineering Research Center Mississippi State University What is High Performance Computing? HPC is ill defined and context dependent.
More informationRecommended hardware system configurations for ANSYS users
Recommended hardware system configurations for ANSYS users The purpose of this document is to recommend system configurations that will deliver high performance for ANSYS users across the entire range
More informationSWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri
SWARM: A Parallel Programming Framework for Multicore Processors David A. Bader, Varun N. Kanade and Kamesh Madduri Our Contributions SWARM: SoftWare and Algorithms for Running on Multicore, a portable
More informationMPI and Hybrid Programming Models. William Gropp www.cs.illinois.edu/~wgropp
MPI and Hybrid Programming Models William Gropp www.cs.illinois.edu/~wgropp 2 What is a Hybrid Model? Combination of several parallel programming models in the same program May be mixed in the same source
More informationAn Introduction to Parallel Computing/ Programming
An Introduction to Parallel Computing/ Programming Vicky Papadopoulou Lesta Astrophysics and High Performance Computing Research Group (http://ahpc.euc.ac.cy) Dep. of Computer Science and Engineering European
More informationPartitioning and Divide and Conquer Strategies
and Divide and Conquer Strategies Lecture 4 and Strategies Strategies Data partitioning aka domain decomposition Functional decomposition Lecture 4 and Strategies Quiz 4.1 For nuclear reactor simulation,
More informationGPU File System Encryption Kartik Kulkarni and Eugene Linkov
GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through
More informationControl 2004, University of Bath, UK, September 2004
Control, University of Bath, UK, September ID- IMPACT OF DEPENDENCY AND LOAD BALANCING IN MULTITHREADING REAL-TIME CONTROL ALGORITHMS M A Hossain and M O Tokhi Department of Computing, The University of
More informationPerformance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
More informationLecture 2 Parallel Programming Platforms
Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple
More informationMapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research
MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With
More informationLecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?
Lecture 3: Evaluating Computer Architectures Announcements - Reminder: Homework 1 due Thursday 2/2 Last Time technology back ground Computer elements Circuits and timing Virtuous cycle of the past and
More informationPetascale Software Challenges. William Gropp www.cs.illinois.edu/~wgropp
Petascale Software Challenges William Gropp www.cs.illinois.edu/~wgropp Petascale Software Challenges Why should you care? What are they? Which are different from non-petascale? What has changed since
More informationSpring 2011 Prof. Hyesoon Kim
Spring 2011 Prof. Hyesoon Kim Today, we will study typical patterns of parallel programming This is just one of the ways. Materials are based on a book by Timothy. Decompose Into tasks Original Problem
More informationSystolic Computing. Fundamentals
Systolic Computing Fundamentals Motivations for Systolic Processing PARALLEL ALGORITHMS WHICH MODEL OF COMPUTATION IS THE BETTER TO USE? HOW MUCH TIME WE EXPECT TO SAVE USING A PARALLEL ALGORITHM? HOW
More informationA Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment
A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment Panagiotis D. Michailidis and Konstantinos G. Margaritis Parallel and Distributed
More informationScheduling Task Parallelism" on Multi-Socket Multicore Systems"
Scheduling Task Parallelism" on Multi-Socket Multicore Systems" Stephen Olivier, UNC Chapel Hill Allan Porterfield, RENCI Kyle Wheeler, Sandia National Labs Jan Prins, UNC Chapel Hill Outline" Introduction
More informationDARPA, NSF-NGS/ITR,ACR,CPA,
Spiral Automating Library Development Markus Püschel and the Spiral team (only part shown) With: Srinivas Chellappa Frédéric de Mesmay Franz Franchetti Daniel McFarlin Yevgen Voronenko Electrical and Computer
More informationMassive Streaming Data Analytics: A Case Study with Clustering Coefficients. David Ediger, Karl Jiang, Jason Riedy and David A.
Massive Streaming Data Analytics: A Case Study with Clustering Coefficients David Ediger, Karl Jiang, Jason Riedy and David A. Bader Overview Motivation A Framework for Massive Streaming hello Data Analytics
More informationBenchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
More informationCS473 - Algorithms I
CS473 - Algorithms I Lecture 4 The Divide-and-Conquer Design Paradigm View in slide-show mode 1 Reminder: Merge Sort Input array A sort this half sort this half Divide Conquer merge two sorted halves Combine
More informationOpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA
OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization
More informationIMCM: A Flexible Fine-Grained Adaptive Framework for Parallel Mobile Hybrid Cloud Applications
Open System Laboratory of University of Illinois at Urbana Champaign presents: Outline: IMCM: A Flexible Fine-Grained Adaptive Framework for Parallel Mobile Hybrid Cloud Applications A Fine-Grained Adaptive
More informationCOMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING
COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING 2013/2014 1 st Semester Sample Exam January 2014 Duration: 2h00 - No extra material allowed. This includes notes, scratch paper, calculator, etc.
More informationImproving System Scalability of OpenMP Applications Using Large Page Support
Improving Scalability of OpenMP Applications on Multi-core Systems Using Large Page Support Ranjit Noronha and Dhabaleswar K. Panda Network Based Computing Laboratory (NBCL) The Ohio State University Outline
More informationCOSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters
COSC 6374 Parallel Computation Parallel I/O (I) I/O basics Spring 2008 Concept of a clusters Processor 1 local disks Compute node message passing network administrative network Memory Processor 2 Network
More informationLoad Balancing on a Non-dedicated Heterogeneous Network of Workstations
Load Balancing on a Non-dedicated Heterogeneous Network of Workstations Dr. Maurice Eggen Nathan Franklin Department of Computer Science Trinity University San Antonio, Texas 78212 Dr. Roger Eggen Department
More informationModule: Software Instruction Scheduling Part I
Module: Software Instruction Scheduling Part I Sudhakar Yalamanchili, Georgia Institute of Technology Reading for this Module Loop Unrolling and Instruction Scheduling Section 2.2 Dependence Analysis Section
More informationApplications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61
F# Applications to Computational Financial and GPU Computing May 16th Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 Today! Why care about F#? Just another fashion?! Three success stories! How Alea.cuBase
More informationChapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup
Chapter 12: Multiprocessor Architectures Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup Objective Be familiar with basic multiprocessor architectures and be able to
More informationLecture 2: Universality
CS 710: Complexity Theory 1/21/2010 Lecture 2: Universality Instructor: Dieter van Melkebeek Scribe: Tyson Williams In this lecture, we introduce the notion of a universal machine, develop efficient universal
More informationCOSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters
COSC 6374 Parallel I/O (I) I/O basics Fall 2012 Concept of a clusters Processor 1 local disks Compute node message passing network administrative network Memory Processor 2 Network card 1 Network card
More informationChapter 18: Database System Architectures. Centralized Systems
Chapter 18: Database System Architectures! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems! Network Types 18.1 Centralized Systems! Run on a single computer system and
More informationGPU Parallel Computing Architecture and CUDA Programming Model
GPU Parallel Computing Architecture and CUDA Programming Model John Nickolls Outline Why GPU Computing? GPU Computing Architecture Multithreading and Arrays Data Parallel Problem Decomposition Parallel
More informationIntroduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware
More informationLoad Imbalance Analysis
With CrayPat Load Imbalance Analysis Imbalance time is a metric based on execution time and is dependent on the type of activity: User functions Imbalance time = Maximum time Average time Synchronization
More informationParallel Computing: Strategies and Implications. Dori Exterman CTO IncrediBuild.
Parallel Computing: Strategies and Implications Dori Exterman CTO IncrediBuild. In this session we will discuss Multi-threaded vs. Multi-Process Choosing between Multi-Core or Multi- Threaded development
More informationPhysical Data Organization
Physical Data Organization Database design using logical model of the database - appropriate level for users to focus on - user independence from implementation details Performance - other major factor
More informationGRID SEARCHING Novel way of Searching 2D Array
GRID SEARCHING Novel way of Searching 2D Array Rehan Guha Institute of Engineering & Management Kolkata, India Abstract: Linear/Sequential searching is the basic search algorithm used in data structures.
More informationPerformance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis
Performance Metrics and Scalability Analysis 1 Performance Metrics and Scalability Analysis Lecture Outline Following Topics will be discussed Requirements in performance and cost Performance metrics Work
More informationSupercomputing and Big Data: Where are the Real Boundaries and Opportunities for Synergy?
HPC2012 Workshop Cetraro, Italy Supercomputing and Big Data: Where are the Real Boundaries and Opportunities for Synergy? Bill Blake CTO Cray, Inc. The Big Data Challenge Supercomputing minimizes data
More informationLecture 1 Introduction to Parallel Programming
Lecture 1 Introduction to Parallel Programming EN 600.320/420 Instructor: Randal Burns 4 September 2008 Department of Computer Science, Johns Hopkins University Pipelined Processor From http://arstechnica.com/articles/paedia/cpu/pipelining-2.ars
More informationInterconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!
Interconnection Networks Interconnection Networks Interconnection networks are used everywhere! Supercomputers connecting the processors Routers connecting the ports can consider a router as a parallel
More informationOperation Count; Numerical Linear Algebra
10 Operation Count; Numerical Linear Algebra 10.1 Introduction Many computations are limited simply by the sheer number of required additions, multiplications, or function evaluations. If floating-point
More informationUTS: An Unbalanced Tree Search Benchmark
UTS: An Unbalanced Tree Search Benchmark LCPC 2006 1 Coauthors Stephen Olivier, UNC Jun Huan, UNC/Kansas Jinze Liu, UNC Jan Prins, UNC James Dinan, OSU P. Sadayappan, OSU Chau-Wen Tseng, UMD Also, thanks
More informationInformation Processing, Big Data, and the Cloud
Information Processing, Big Data, and the Cloud James Horey Computational Sciences & Engineering Oak Ridge National Laboratory Fall Creek Falls 2010 Information Processing Systems Model Parameters Data-intensive
More informationWe r e going to play Final (exam) Jeopardy! "Answers:" "Questions:" - 1 -
. (0 pts) We re going to play Final (exam) Jeopardy! Associate the following answers with the appropriate question. (You are given the "answers": Pick the "question" that goes best with each "answer".)
More informationTrends in High-Performance Computing for Power Grid Applications
Trends in High-Performance Computing for Power Grid Applications Franz Franchetti ECE, Carnegie Mellon University www.spiral.net Co-Founder, SpiralGen www.spiralgen.com This talk presents my personal views
More informationMiddleware and Distributed Systems. Introduction. Dr. Martin v. Löwis
Middleware and Distributed Systems Introduction Dr. Martin v. Löwis 14 3. Software Engineering What is Middleware? Bauer et al. Software Engineering, Report on a conference sponsored by the NATO SCIENCE
More informationPerformance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries
Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries Shin Morishima 1 and Hiroki Matsutani 1,2,3 1Keio University, 3 14 1 Hiyoshi, Kohoku ku, Yokohama, Japan 2National Institute
More information