# Chapter 2 Parallel Architecture, Software And Performance

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Chapter 2 Parallel Architecture, Software And Performance UCSB CS140, T. Yang, 2014 Modified from texbook slides

2 Roadmap Parallel hardware Parallel software Input and output Performance Parallel program design # Chapter Subtitle

3 Flynn s Taxonomy SISD Single instruction stream Single data stream (SIMD) Single instruction stream Multiple data stream MISD Multiple instruction stream Single data stream (MIMD) Multiple instruction stream Multiple data stream

4 SIMD Parallelism achieved by dividing data among the processors. Applies the same instruction to multiple data items. Called data parallelism. control unit n data items n ALUs x[1] x[2] x[n] ALU 1 ALU 2 ALU n for (i = 0; i < n; i++) x[i] += y[i];

5 SIMD What if we don t have as many ALUs (Arithmetic Logic Units) as data items? Divide the work and process iteratively. Ex. m = 4 ALUs (arithmetic logic unit) and n = 15 data items. Round ALU 1 ALU 2 ALU 3 ALU 4 1 X[0] X[1] X[2] X[3] 2 X[4] X[5] X[6] X[7] 3 X[8] X[9] X[10] X[11] 4 X[12] X[13] X[14]

6 SIMD drawbacks All ALUs are required to execute the same instruction, or remain idle. In classic design, they must also operate synchronously. The ALUs have no instruction storage. Efficient for large data parallel problems, but not flexible for more complex parallel problems.

7 Vector Processors Operate on vectors (arrays) with vector instructions conventional CPU s operate on individual data elements or scalars. Vectorized and pipelined functional units. Use vector registers to store data Example: A[1:10]=B[1:10] + C[1:10] Instruction execution Read instruction and decode it Fetch these 10 A numbers and 10 B numbers Add them and save results.

8 Vector processors Pros/Cons Pros Fast. Easy to use. Vectorizing compilers are good at identifying code to exploit. Compilers also can provide information about code that cannot be vectorized. Helps the programmer re-evaluate code. High memory bandwidth. Use every item in a cache line. Cons Don t handle irregular data structures well Limited ability to handle larger problems (scalability)

9 Graphics Processing Units (GPU) Real time graphics application programming interfaces or API s use points, lines, and triangles to internally represent the surface of an object. A graphics processing pipeline converts the internal representation into an array of pixels that can be sent to a computer screen. Several stages of this pipeline (called shader functions) are programmable. Typically just a few lines of C code.

10 GPUs Shader functions are also implicitly parallel, since they can be applied to multiple elements in the graphics stream. GPU s can often optimize performance by using SIMD parallelism. The current generation of GPU s use SIMD parallelism. Although they are not pure SIMD systems. Market shares: Intel: 62% NVIDIA 17%, AMD. 21%

11 MIMD Supports multiple simultaneous instruction streams operating on multiple data streams. Typically consist of a collection of fully independent processing units or cores, each of which has its own control unit and its own ALU. Types of MIMD systems Shared-memory systems Most popular ones use multicore processors. (multiple CPU s or cores on a single chip) Distributed-memory systems Computer clusters are the most popular

12 Shared Memory System Each processor can access each memory location. The processors usually communicate implicitly by accessing shared data structures Two designs: UMA (Uniform Memory Access) and NUMA (Non-uniform Memory Access)

13 UMA Multicore Systems Time to access all the memory locations will be the same for all the cores. Figure 2.5

14 AMD 8-core CPU Bulldozer

15 NUMA Multicore System A memory location a core is directly connected to can be accessed faster than a memory location that must be accessed through another chip. Figure 2.6

16 Distributed Memory System Clusters (most popular) A collection of commodity systems. Connected by a commodity interconnection network.

17 Clustered Machines at a Lab and a Datacenter

18 Interconnection networks Affects performance of both distributed and shared memory systems. Two categories: Shared memory interconnects Distributed memory interconnects

19 Shared memory interconnects: Bus Parallel communication wires together with some hardware that controls access to the bus. As the number of devices connected to the bus increases, contention for shared bus use increases, and performance decreases.

20 Shared memory interconnects: Switched Interconnect Uses switches to control the routing of data among the connected devices. Crossbar Allows simultaneous communication among different devices. Faster than buses. But higher cost.

21 Distributed memory interconnects Two groups Direct interconnect Each switch is directly connected to a processor memory pair, and the switches are connected to each other. Indirect interconnect Switches may not be directly connected to a processor.

22 Direct interconnect ring 2D torus (toroidal mesh)

23 Direct interconnect: 2D Mesh vs 2D Torus

24 Bandwidth How to measure network quality? The rate at which a link can transmit data. Usually given in megabits or megabytes per second. Bisection width A measure of number of simultaneous communications between two subnetworks within a network The minimum number of links that must be removed to partition the network into two equal halves 2 for a ring Typically divide a network by a line or plane (bisection cut)

25 More definitions on network performance Any time data is transmitted, we re interested in how long it will take for the data to reach its destination. Latency The time that elapses between the source s beginning to transmit the data and the destination s starting to receive the first byte. Sometime it is called startup cost. Bandwidth The rate at which the destination receives data after it has started to receive the first byte.

26 Network transmission cost Message transmission time = α + m β latency (seconds) length of message (bytes) 1/bandwidth (bytes per second) Typical latency/startup cost: 2 microseconds ~ 1 millisecond Typical bandwidth: 100 MB ~ 1GB per second

27 Bisection width vs Bisection bandwidth Example of bisection width bisection cut not a bisection cut Bisection bandwidth Sum bandwidth of links that cut the network into two equal halves. Choose the minimum one. 27

28 Bisection width: more examples A bisection of a 2D torus with p nodes:? 2 sqrt(p)

29 Fully connected network Each switch is directly connected to every other switch. bisection width = p 2 /4 Figure 2.11

30 Hypercube Built inductively: A one-dimensional hypercube is a fully-connected system with two processors. A two-dimensional hypercube is built from two onedimensional hypercubes by joining corresponding switches. Similarly a three-dimensional hypercube is built from two twodimensional hypercubes.

31 Hypercubes Figure 2.12 one- two- three-dimensional

32 Indirect interconnects Simple examples of indirect networks: Crossbar Omega network A generic indirect network

33 Crossbar indrect interconnect Figure 2.14

34 An omega network Figure 2.15

35 Commodity Computing Clusters Use already available computing components Commodity servers, interconnection network, & storage Less expensive while Upgradable with standardization Great computing power at low cost

36 Typical network for a cluster Aggregation switch Rack switch 40 nodes/rack, nodes in cluster 1 Gbps bandwidth in rack, 8 Gbps out of rack Node specs : 8-16 cores, 32 GB RAM, TB disks

37 Layered Network in Clustered Machines A layered example from Cisco: core, aggregation, the edge or top-of-rack switch.

38 Hybrid Clusters with GPU Node in a CPU/GPU cluster host GPU A Maryland cluster couples CPUs, GPUs, displays, and storage. Applications in visual and scientific computing

39 Cloud Computing with Amazon EC2 On-demand elastic computing Allocate a Linux or windows cluster only when you need. Pay based on time usage of computing instance/storage Expandable or shrinkable

40 Usage Examples with Amazon EC2 A 32-node, 64-GPU cluster with 8TB storage Each node is a AWS computing instance extended with 2 Nvidia M2050 GPUs, 22 GB of memory, and a 10Gbps Ethernet interconnect. \$82/hour to operate (based on Cycle Computing blog) Annual cost example: 82*8*52=\$34,112 Otherwise: ~\$150+K to purchase + annual datacenter cost. Another example from Cycle Computing Inc Run 205,000 molecule simulation with 156,000 Amazon cores for 18 hours -- \$33,000.

41 Cache coherence Programmers have no control over caches and when they get updated. Hardware makes cache updated cache coherently Snooping bus Directory-based

42 Cache coherence issue x = 2; /* shared variable */ y0 eventually ends up = 2 y1 eventually ends up = 6 X=7 X=2 z1 = 4*7 or 4*2? X=2

43 Snooping Cache Coherence All cores share a bus. When core 0 updates the copy of x stored in its cache it also broadcasts this information across the bus. If core 1 is snooping the bus, it will see that x has been updated and it can mark its copy of x as invalid. X=7

44 PARALLEL SOFTWARE

45 The burden is on software Hardware and compilers can keep up the pace needed. From now on In shared memory programs: Start a single process and fork threads. Threads carry out tasks. In distributed memory programs: Start multiple processes. Processes carry out tasks.

46 SPMD single program multiple data A SPMD programs consists of a single executable that can behave as if it were multiple different programs through the use of conditional branches. if (I m thread/process i) do this; else do that; Shared memory machines threads (or processes) Distributed memory machines processes

47 Challenges: Nondeterminism of execution order... printf ( "Thread %d > my_val = %d\n", my_rank, my_x ) ;... Execution 2: Execution 1: Thread 1 > my_val = 19 Thread 0 > my_val = 7 Thread 0 > my_val = 7 Thread 1 > my_val = 19

48 Input and Output Two options for I/O Option 1: In distributed memory programs, only process 0 will access stdin. In shared memory programs, only the master thread or thread 0 will access stdin. Option 2: all the processes/threads can access stdout and stderr. Because of the indeterminacy of the order of output to stdout, in most cases only a single process/thread will be used for all output to stdout other than debugging output.

49 Input and Output: Practical Strategies Debug output should always include the rank or id of the process/thread that s generating the output. Only a single process/thread will attempt to access any single file other than stdin, stdout, or stderr. So, for example, each process/thread can open its own, private file for reading or writing, but no two processes/threads will open the same file. fflush(stdout) may be necessary to ensure output is not delayed when order is important. printf( hello \n ); fflush(stdout);

50 PERFORMANCE

51 Speedup Number of cores = p Serial run-time = T serial Parallel run-time = T parallel T parallel = T serial / p

52 Speedup of a parallel program S = T serial T parallel Perfect speedup Actual speedup

53 Speedup Graph Interpretation Linear speedup Speedup proportionally increases as p increases Perfect linear speedup Speedup =p Superlinear speedup Speedup >p It is not possible in theory. It is possible in practice Data in sequential code does not fit into memory. Parallel code divides data into many machines and they fit into memory.

54 Efficiency of a parallel program T serial Speedup E = = p T parallel p = T serial. p T parallel Measure how well-utilized the processors are, compared to effort wasted in communication and synchronization. Example:

55 Typical speedup and efficiency of parallel code

56 Impact of Problem Sizes on Speedups and efficiencies

57 Problem Size Impact on Speedup and Efficiency

58 Strongly scalable If we increase the number of processors (processes/threads), efficiency is fixed without increasing problem size. This solution is strongly scalable. Ex. Seq = n 2 PT = (n+n 2 )/p Efficiency = n 2 /(n 2 +n) Not strongly scalable: PT = n+n 2 /p Efficiency = n 2 /(n 2 +np)

59 Weak Scalable Efficiency is fixed when increasing the problem size at the same rate as increasing the number of processors, the solution is weakly scalable. Ex. Seq = n 2 PT = n+n 2 /p Efficiency = n 2 /(n 2 +np) Increase problem size and #processors by k (kn) 2 /((kn) 2 +knkp) = n 2 /(n 2 +np)

60 Amdahl Law: Limitation of Parallel Performance Unless virtually all of a serial program is parallelized, the possible speedup is going to be limited regardless of the number of cores available.

61 Example of Amdahl s Law Example: We can parallelize 90% of a serial program. Parallelization is perfect regardless of the number of cores p we use. T serial = 20 seconds Runtime of parallelizable part is 0.9 x T serial / p = 18 / p Runtime of unparallelizable part is 0.1 x T serial = 2 Overall parallel run-time is T parallel = 0.9 x T serial / p x T serial = 18 / p + 2

62 Example (cont.) Speed up T serial S = = 0.9 x T serial / p x T serial / p + 2 S < 20/2 =

63 How to measure sequential and parallel time? What time? CPU time vs wall clock time A program segment of interest? Setup startup time Measure finish time

64 Taking Timings for a Code Segment Example function MPI_Wtime() in MPI gettimeofday() in Linux

65 Taking Timings

66 Measure parallel time with a barrier

67 PARALLEL PROGRAM DESIGN

68 Foster s methodology: 4-stage design 1. Partitioning: divide the computation to be performed and the data operated on by the computation into small tasks. The focus here should be on identifying tasks that can be executed in parallel. Computation Tasks Data

69 Foster s methodology 2. Communication: Identify dependence among tasks Determine inter-task communication

70 Foster s methodology 3. Agglomeration or aggregation: combine tasks and communications identified in the first step into larger tasks. Reduce communication overhead Coarse grain tasks May reduce parallelism sometime

71 Foster s methodology 4. Mapping: assign the composite tasks identified in the previous step to processes/threads. This should be done so that communication is minimized, and each process/thread gets roughly the same amount of work. Tasks Proc Proc

72 Concluding Remarks (1) Parallel hardware Shared memory and distributed memory architectures Network topology for interconnect Parallel software We focus on software for homogeneous MIMD systems, consisting of a single program that obtains parallelism by branching. SPMD programs.

73 Concluding Remarks (2) Input and Output One process or thread can access stdin, and all processes can access stdout and stderr. However, because of nondeterminism, except for debug output we ll usually have a single process or thread accessing stdout. Performance Speedup/Efficiency Amdahl s law Scalability Parallel Program Design Foster s methodology

### An Introduction to Parallel Computing/ Programming

An Introduction to Parallel Computing/ Programming Vicky Papadopoulou Lesta Astrophysics and High Performance Computing Research Group (http://ahpc.euc.ac.cy) Dep. of Computer Science and Engineering European

### Parallel Programming

Parallel Programming Parallel Architectures Diego Fabregat-Traver and Prof. Paolo Bientinesi HPAC, RWTH Aachen fabregat@aices.rwth-aachen.de WS15/16 Parallel Architectures Acknowledgements Prof. Felix

### Lecture 2 Parallel Programming Platforms

Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple

### Scalability and Classifications

Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static

### Introduction to Cloud Computing

Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic

### Lecture 23: Multiprocessors

Lecture 23: Multiprocessors Today s topics: RAID Multiprocessor taxonomy Snooping-based cache coherence protocol 1 RAID 0 and RAID 1 RAID 0 has no additional redundancy (misnomer) it uses an array of disks

### CMSC 611: Advanced Computer Architecture

CMSC 611: Advanced Computer Architecture Parallel Computation Most slides adapted from David Patterson. Some from Mohomed Younis Parallel Computers Definition: A parallel computer is a collection of processing

### UNIT 2 CLASSIFICATION OF PARALLEL COMPUTERS

UNIT 2 CLASSIFICATION OF PARALLEL COMPUTERS Structure Page Nos. 2.0 Introduction 27 2.1 Objectives 27 2.2 Types of Classification 28 2.3 Flynn s Classification 28 2.3.1 Instruction Cycle 2.3.2 Instruction

### Binary search tree with SIMD bandwidth optimization using SSE

Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous

### Parallel Programming Survey

Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory

### Introduction to Parallel Computing. George Karypis Parallel Programming Platforms

Introduction to Parallel Computing George Karypis Parallel Programming Platforms Elements of a Parallel Computer Hardware Multiple Processors Multiple Memories Interconnection Network System Software Parallel

### Why the Network Matters

Week 2, Lecture 2 Copyright 2009 by W. Feng. Based on material from Matthew Sottile. So Far Overview of Multicore Systems Why Memory Matters Memory Architectures Emerging Chip Multiprocessors (CMP) Increasing

### Principles and characteristics of distributed systems and environments

Principles and characteristics of distributed systems and environments Definition of a distributed system Distributed system is a collection of independent computers that appears to its users as a single

### Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!

Interconnection Networks Interconnection Networks Interconnection networks are used everywhere! Supercomputers connecting the processors Routers connecting the ports can consider a router as a parallel

### Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide

### Introduction to GPU Programming Languages

CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure

### Parallel Architectures and Interconnection

Chapter 2 Networks Parallel Architectures and Interconnection The interconnection network is the heart of parallel architecture. Feng [1] - Chuan-Lin and Tse-Yun 2.1 Introduction You cannot really design

### LSN 2 Computer Processors

LSN 2 Computer Processors Department of Engineering Technology LSN 2 Computer Processors Microprocessors Design Instruction set Processor organization Processor performance Bandwidth Clock speed LSN 2

### Annotation to the assignments and the solution sheet. Note the following points

Computer rchitecture 2 / dvanced Computer rchitecture Seite: 1 nnotation to the assignments and the solution sheet This is a multiple choice examination, that means: Solution approaches are not assessed

### Parallel Algorithm Engineering

Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis

### System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1

System Interconnect Architectures CSCI 8150 Advanced Computer Architecture Hwang, Chapter 2 Program and Network Properties 2.4 System Interconnect Architectures Direct networks for static connections Indirect

### GPU System Architecture. Alan Gray EPCC The University of Edinburgh

GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems

### High Performance Computing. Course Notes 2007-2008. HPC Fundamentals

High Performance Computing Course Notes 2007-2008 2008 HPC Fundamentals Introduction What is High Performance Computing (HPC)? Difficult to define - it s a moving target. Later 1980s, a supercomputer performs

### COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook) Vivek Sarkar Department of Computer Science Rice University vsarkar@rice.edu COMP

### GPU File System Encryption Kartik Kulkarni and Eugene Linkov

GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through

### Interconnection Networks Programmierung Paralleler und Verteilter Systeme (PPV)

Interconnection Networks Programmierung Paralleler und Verteilter Systeme (PPV) Sommer 2015 Frank Feinbube, M.Sc., Felix Eberhardt, M.Sc., Prof. Dr. Andreas Polze Interconnection Networks 2 SIMD systems

### Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Making Multicore Work and Measuring its Benefits Markus Levy, president EEMBC and Multicore Association Agenda Why Multicore? Standards and issues in the multicore community What is Multicore Association?

### Enhancing Cloud-based Servers by GPU/CPU Virtualization Management

Enhancing Cloud-based Servers by GPU/CPU Virtualiz Management Tin-Yu Wu 1, Wei-Tsong Lee 2, Chien-Yu Duan 2 Department of Computer Science and Inform Engineering, Nal Ilan University, Taiwan, ROC 1 Department

### Computer Architecture TDTS10

why parallelism? Performance gain from increasing clock frequency is no longer an option. Outline Computer Architecture TDTS10 Superscalar Processors Very Long Instruction Word Processors Parallel computers

### Multi-Threading Performance on Commodity Multi-Core Processors

Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction

### 10- High Performance Compu5ng

10- High Performance Compu5ng (Herramientas Computacionales Avanzadas para la Inves6gación Aplicada) Rafael Palacios, Fernando de Cuadra MRE Contents Implemen8ng computa8onal tools 1. High Performance

### GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

GPU Architectures A CPU Perspective Derek Hower AMD Research 5/21/2013 Goals Data Parallelism: What is it, and how to exploit it? Workload characteristics Execution Models / GPU Architectures MIMD (SPMD),

### Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.

### Next Generation GPU Architecture Code-named Fermi

Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time

### - An Essential Building Block for Stable and Reliable Compute Clusters

Ferdinand Geier ParTec Cluster Competence Center GmbH, V. 1.4, March 2005 Cluster Middleware - An Essential Building Block for Stable and Reliable Compute Clusters Contents: Compute Clusters a Real Alternative

### Enabling Technologies for Distributed and Cloud Computing

Enabling Technologies for Distributed and Cloud Computing Dr. Sanjay P. Ahuja, Ph.D. 2010-14 FIS Distinguished Professor of Computer Science School of Computing, UNF Multi-core CPUs and Multithreading

### Chapter 2 Parallel Computer Architecture

Chapter 2 Parallel Computer Architecture The possibility for a parallel execution of computations strongly depends on the architecture of the execution platform. This chapter gives an overview of the general

### Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit.

Objectives The Central Processing Unit: What Goes on Inside the Computer Chapter 4 Identify the components of the central processing unit and how they work together and interact with memory Describe how

### TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 7 th CALL (Tier-0)

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 7 th CALL (Tier-0) Contributing sites and the corresponding computer systems for this call are: GCS@Jülich, Germany IBM Blue Gene/Q GENCI@CEA, France Bull Bullx

### Some Computer Organizations and Their Effectiveness. Michael J Flynn. IEEE Transactions on Computers. Vol. c-21, No.

Some Computer Organizations and Their Effectiveness Michael J Flynn IEEE Transactions on Computers. Vol. c-21, No.9, September 1972 Introduction Attempts to codify a computer have been from three points

### Lecture 18: Interconnection Networks. CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012)

Lecture 18: Interconnection Networks CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Announcements Project deadlines: - Mon, April 2: project proposal: 1-2 page writeup - Fri,

### RevoScaleR Speed and Scalability

EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution

### IMCM: A Flexible Fine-Grained Adaptive Framework for Parallel Mobile Hybrid Cloud Applications

Open System Laboratory of University of Illinois at Urbana Champaign presents: Outline: IMCM: A Flexible Fine-Grained Adaptive Framework for Parallel Mobile Hybrid Cloud Applications A Fine-Grained Adaptive

### Lecture 23: Interconnection Networks. Topics: communication latency, centralized and decentralized switches (Appendix E)

Lecture 23: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Appendix E) 1 Topologies Internet topologies are not very regular they grew incrementally Supercomputers

### Chapter 18: Database System Architectures. Centralized Systems

Chapter 18: Database System Architectures! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems! Network Types 18.1 Centralized Systems! Run on a single computer system and

### Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Identify a problem Review approaches to the problem Propose a novel approach to the problem Define, design, prototype an implementation to evaluate your approach Could be a real system, simulation and/or

### Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it Informa(on & Communica(on Technology Sec(on (ICTS) Interna(onal Centre for Theore(cal Physics (ICTP) Mul(ple Socket

### High Performance Computing

High Performance Computing Trey Breckenridge Computing Systems Manager Engineering Research Center Mississippi State University What is High Performance Computing? HPC is ill defined and context dependent.

### Interconnection Network Design

Interconnection Network Design Vida Vukašinović 1 Introduction Parallel computer networks are interesting topic, but they are also difficult to understand in an overall sense. The topological structure

### Topological Properties

Advanced Computer Architecture Topological Properties Routing Distance: Number of links on route Node degree: Number of channels per node Network diameter: Longest minimum routing distance between any

### Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France

### Computer Graphics Hardware An Overview

Computer Graphics Hardware An Overview Graphics System Monitor Input devices CPU/Memory GPU Raster Graphics System Raster: An array of picture elements Based on raster-scan TV technology The screen (and

### Symmetric Multiprocessing

Multicore Computing A multi-core processor is a processing system composed of two or more independent cores. One can describe it as an integrated circuit to which two or more individual processors (called

### Systolic Computing. Fundamentals

Systolic Computing Fundamentals Motivations for Systolic Processing PARALLEL ALGORITHMS WHICH MODEL OF COMPUTATION IS THE BETTER TO USE? HOW MUCH TIME WE EXPECT TO SAVE USING A PARALLEL ALGORITHM? HOW

### VII ENCUENTRO IBÉRICO DE ELECTROMAGNETISMO COMPUTACIONAL, MONFRAGÜE, CÁCERES, 19-21 MAYO 2010 29

VII ENCUENTRO IBÉRICO DE ELECTROMAGNETISMO COMPUTACIONAL, MONFRAGÜE, CÁCERES, 19-21 MAYO 2010 29 Shared Memory Supercomputing as Technique for Computational Electromagnetics César Gómez-Martín, José-Luis

### Enabling Technologies for Distributed Computing

Enabling Technologies for Distributed Computing Dr. Sanjay P. Ahuja, Ph.D. Fidelity National Financial Distinguished Professor of CIS School of Computing, UNF Multi-core CPUs and Multithreading Technologies

### Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer Stan Posey, MSc and Bill Loewe, PhD Panasas Inc., Fremont, CA, USA Paul Calleja, PhD University of Cambridge,

### Vorlesung Rechnerarchitektur 2 Seite 178 DASH

Vorlesung Rechnerarchitektur 2 Seite 178 Architecture for Shared () The -architecture is a cache coherent, NUMA multiprocessor system, developed at CSL-Stanford by John Hennessy, Daniel Lenoski, Monica

### Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis

### Part V Applications. What is cloud computing? SaaS has been around for awhile. Cloud Computing: General concepts

Part V Applications Cloud Computing: General concepts Copyright K.Goseva 2010 CS 736 Software Performance Engineering Slide 1 What is cloud computing? SaaS: Software as a Service Cloud: Datacenters hardware

### CHAPTER 7: The CPU and Memory

CHAPTER 7: The CPU and Memory The Architecture of Computer Hardware, Systems Software & Networking: An Information Technology Approach 4th Edition, Irv Englander John Wiley and Sons 2010 PowerPoint slides

### Stream Processing on GPUs Using Distributed Multimedia Middleware

Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research

### Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007

Multi-core architectures Jernej Barbic 15-213, Spring 2007 May 3, 2007 1 Single-core computer 2 Single-core CPU chip the single core 3 Multi-core architectures This lecture is about a new trend in computer

### 22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics Luke Tierney Department of Statistics & Actuarial Science University of Iowa August 30, 2007 Luke Tierney (U. of Iowa) HPC

### 10 Gbps Line Speed Programmable Hardware for Open Source Network Applications*

10 Gbps Line Speed Programmable Hardware for Open Source Network Applications* Livio Ricciulli livio@metanetworks.org (408) 399-2284 http://www.metanetworks.org *Supported by the Division of Design Manufacturing

### White Paper The Numascale Solution: Extreme BIG DATA Computing

White Paper The Numascale Solution: Extreme BIG DATA Computing By: Einar Rustad ABOUT THE AUTHOR Einar Rustad is CTO of Numascale and has a background as CPU, Computer Systems and HPC Systems De-signer

### Basic Concepts in Parallelization

1 Basic Concepts in Parallelization Ruud van der Pas Senior Staff Engineer Oracle Solaris Studio Oracle Menlo Park, CA, USA IWOMP 2010 CCS, University of Tsukuba Tsukuba, Japan June 14-16, 2010 2 Outline

### Cloud Storage. Parallels. Performance Benchmark Results. White Paper. www.parallels.com

Parallels Cloud Storage White Paper Performance Benchmark Results www.parallels.com Table of Contents Executive Summary... 3 Architecture Overview... 3 Key Features... 4 No Special Hardware Requirements...

### UNIVERSITY OF CALIFORNIA RIVERSIDE. The SpiceC Parallel Programming System

UNIVERSITY OF CALIFORNIA RIVERSIDE The SpiceC Parallel Programming System A Dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science

### Distributed Systems LEEC (2005/06 2º Sem.)

Distributed Systems LEEC (2005/06 2º Sem.) Introduction João Paulo Carvalho Universidade Técnica de Lisboa / Instituto Superior Técnico Outline Definition of a Distributed System Goals Connecting Users

### David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems About me David Rioja Redondo Telecommunication Engineer - Universidad de Alcalá >2 years building and managing clusters UPM

### Chapter 6. Inside the System Unit. What You Will Learn... Computers Are Your Future. What You Will Learn... Describing Hardware Performance

What You Will Learn... Computers Are Your Future Chapter 6 Understand how computers represent data Understand the measurements used to describe data transfer rates and data storage capacity List the components

### Big Graph Processing: Some Background

Big Graph Processing: Some Background Bo Wu Colorado School of Mines Part of slides from: Paul Burkhardt (National Security Agency) and Carlos Guestrin (Washington University) Mines CSCI-580, Bo Wu Graphs

### Parallel and Distributed Computing Programming Assignment 1

Parallel and Distributed Computing Programming Assignment 1 Due Monday, February 7 For programming assignment 1, you should write two C programs. One should provide an estimate of the performance of ping-pong

### NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

NVIDIA CUDA Software and GPU Parallel Computing Architecture David B. Kirk, Chief Scientist Outline Applications of GPU Computing CUDA Programming Model Overview Programming in CUDA The Basics How to Get

### SOC architecture and design

SOC architecture and design system-on-chip (SOC) processors: become components in a system SOC covers many topics processor: pipelined, superscalar, VLIW, array, vector storage: cache, embedded and external

### FPGA-based Multithreading for In-Memory Hash Joins

FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded

### Chapter 19 Cloud Computing for Multimedia Services

Chapter 19 Cloud Computing for Multimedia Services 19.1 Cloud Computing Overview 19.2 Multimedia Cloud Computing 19.3 Cloud-Assisted Media Sharing 19.4 Computation Offloading for Multimedia Services 19.5

### Overlapping Data Transfer With Application Execution on Clusters

Overlapping Data Transfer With Application Execution on Clusters Karen L. Reid and Michael Stumm reid@cs.toronto.edu stumm@eecg.toronto.edu Department of Computer Science Department of Electrical and Computer

### AMD WHITE PAPER GETTING STARTED WITH SEQUENCEL. AMD Embedded Solutions 1

AMD WHITE PAPER GETTING STARTED WITH SEQUENCEL AMD Embedded Solutions 1 Optimizing Parallel Processing Performance and Coding Efficiency with AMD APUs and Texas Multicore Technologies SequenceL Auto-parallelizing

### Parallel Computing: Strategies and Implications. Dori Exterman CTO IncrediBuild.

Parallel Computing: Strategies and Implications Dori Exterman CTO IncrediBuild. In this session we will discuss Multi-threaded vs. Multi-Process Choosing between Multi-Core or Multi- Threaded development

### Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:

Zadara Storage Cloud A whitepaper @ZadaraStorage Zadara delivers two solutions to its customers: On- premises storage arrays Storage as a service from 31 locations globally (and counting) Some Zadara customers

### Middleware and Distributed Systems. Introduction. Dr. Martin v. Löwis

Middleware and Distributed Systems Introduction Dr. Martin v. Löwis 14 3. Software Engineering What is Middleware? Bauer et al. Software Engineering, Report on a conference sponsored by the NATO SCIENCE

### OpenMP Programming on ScaleMP

OpenMP Programming on ScaleMP Dirk Schmidl schmidl@rz.rwth-aachen.de Rechen- und Kommunikationszentrum (RZ) MPI vs. OpenMP MPI distributed address space explicit message passing typically code redesign

### Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

### Turbomachinery CFD on many-core platforms experiences and strategies

Turbomachinery CFD on many-core platforms experiences and strategies Graham Pullan Whittle Laboratory, Department of Engineering, University of Cambridge MUSAF Colloquium, CERFACS, Toulouse September 27-29

### numascale White Paper The Numascale Solution: Extreme BIG DATA Computing Hardware Accellerated Data Intensive Computing By: Einar Rustad ABSTRACT

numascale Hardware Accellerated Data Intensive Computing White Paper The Numascale Solution: Extreme BIG DATA Computing By: Einar Rustad www.numascale.com Supemicro delivers 108 node system with Numascale

### Intro to GPU computing. Spring 2015 Mark Silberstein, 048661, Technion 1

Intro to GPU computing Spring 2015 Mark Silberstein, 048661, Technion 1 Serial vs. parallel program One instruction at a time Multiple instructions in parallel Spring 2015 Mark Silberstein, 048661, Technion

### Lecture 1. Course Introduction

Lecture 1 Course Introduction Welcome to CSE 262! Your instructor is Scott B. Baden Office hours (week 1) Tues/Thurs 3.30 to 4.30 Room 3244 EBU3B 2010 Scott B. Baden / CSE 262 /Spring 2011 2 Content Our

### A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

11 th International LS-DYNA Users Conference Computing Technology A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures Yih-Yih Lin Hewlett-Packard Company Abstract In this paper, the

### Flash Memory Arrays Enabling the Virtualized Data Center. July 2010

Flash Memory Arrays Enabling the Virtualized Data Center July 2010 2 Flash Memory Arrays Enabling the Virtualized Data Center This White Paper describes a new product category, the flash Memory Array,

### Performance Analysis and Optimization Tool

Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL andres.charif@uvsq.fr Performance Analysis Team, University of Versailles http://www.maqao.org Introduction Performance Analysis Develop

### Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview

### Recommended hardware system configurations for ANSYS users

Recommended hardware system configurations for ANSYS users The purpose of this document is to recommend system configurations that will deliver high performance for ANSYS users across the entire range

### GPGPU accelerated Computational Fluid Dynamics

t e c h n i s c h e u n i v e r s i t ä t b r a u n s c h w e i g Carl-Friedrich Gauß Faculty GPGPU accelerated Computational Fluid Dynamics 5th GACM Colloquium on Computational Mechanics Hamburg Institute

### IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 IBM CELL. Politecnico di Milano Como Campus

Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 CELL INTRODUCTION 2 1 CELL SYNERGY Cell is not a collection of different processors, but a synergistic whole Operation paradigms,

### Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?

Lecture 3: Evaluating Computer Architectures Announcements - Reminder: Homework 1 due Thursday 2/2 Last Time technology back ground Computer elements Circuits and timing Virtuous cycle of the past and

### Communicating with devices

Introduction to I/O Where does the data for our CPU and memory come from or go to? Computers communicate with the outside world via I/O devices. Input devices supply computers with data to operate on.