Modellierung, Simulation, Optimierung Computer Architektur



Similar documents
Architecture of Hitachi SR-8000

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Chapter 2 Parallel Architecture, Software And Performance

Parallel Programming Survey

Lecture 2 Parallel Programming Platforms

Computer Architecture TDTS10

Introduction to Cloud Computing

Lecture 23: Multiprocessors

Multi-Threading Performance on Commodity Multi-Core Processors

Cray Gemini Interconnect. Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak

Communicating with devices

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)

Putting it all together: Intel Nehalem.

Kriterien für ein PetaFlop System

UNIT 2 CLASSIFICATION OF PARALLEL COMPUTERS

Scalability and Classifications

Intel Xeon Processor E5-2600

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

PSAM, NEC PCIe SSD Appliance for Microsoft SQL Server (Reference Architecture) September 11 th, 2014 NEC Corporation

Parallel Algorithm Engineering

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Building a Top500-class Supercomputing Cluster at LNS-BUAP

Parallel Programming

Hadoop on the Gordon Data Intensive Cluster

Lecture 1: the anatomy of a supercomputer

PCI Express Impact on Storage Architectures and Future Data Centers. Ron Emerick, Oracle Corporation

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Introduction to GPU Programming Languages

LSN 2 Computer Processors

AMD Opteron Quad-Core

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

An Introduction to Parallel Computing/ Programming

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

PCI Express Impact on Storage Architectures and Future Data Centers. Ron Emerick, Oracle Corporation

Building an Inexpensive Parallel Computer

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

PCIe Over Cable Provides Greater Performance for Less Cost for High Performance Computing (HPC) Clusters. from One Stop Systems (OSS)

CMSC 611: Advanced Computer Architecture

ANALYSIS OF SUPERCOMPUTER DESIGN

Next Generation GPU Architecture Code-named Fermi

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Chapter 2 Parallel Computer Architecture

Binary search tree with SIMD bandwidth optimization using SSE

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Sun Constellation System: The Open Petascale Computing Architecture

PCI Express and Storage. Ron Emerick, Sun Microsystems

Introduction to GPGPU. Tiziano Diamanti

Trends in High-Performance Computing for Power Grid Applications

HP Z Turbo Drive PCIe SSD

Cloud Data Center Acceleration 2015

WHITE PAPER FUJITSU PRIMERGY SERVERS PERFORMANCE REPORT PRIMERGY BX620 S6

Pedraforca: ARM + GPU prototype

OBJECTIVE ANALYSIS WHITE PAPER MATCH FLASH. TO THE PROCESSOR Why Multithreading Requires Parallelized Flash ATCHING

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

PCI Express Impact on Storage Architectures. Ron Emerick, Sun Microsystems

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

EDUCATION. PCI Express, InfiniBand and Storage Ron Emerick, Sun Microsystems Paul Millard, Xyratex Corporation

Building Clusters for Gromacs and other HPC applications

HPC Update: Engagement Model

Memory Architecture and Management in a NoC Platform

The Bus (PCI and PCI-Express)

OpenMP Programming on ScaleMP

ECLIPSE Performance Benchmarks and Profiling. January 2009

CHAPTER 7: The CPU and Memory

Storage Architectures. Ron Emerick, Oracle Corporation

Parallel Computing. Benson Muite. benson.


Hadoop: Embracing future hardware

Architecting High-Speed Data Streaming Systems. Sujit Basu

RAID. RAID 0 No redundancy ( AID?) Just stripe data over multiple disks But it does improve performance. Chapter 6 Storage and Other I/O Topics 29

Michael Kagan.

Emerging storage and HPC technologies to accelerate big data analytics Jerome Gaysse JG Consulting

Intel Data Direct I/O Technology (Intel DDIO): A Primer >

Computer Graphics Hardware An Overview

FPGA-based Multithreading for In-Memory Hash Joins

Annotation to the assignments and the solution sheet. Note the following points

A+ Guide to Managing and Maintaining Your PC, 7e. Chapter 1 Introducing Hardware

Learning Outcomes. Simple CPU Operation and Buses. Composition of a CPU. A simple CPU design

Cisco 7816-I5 Media Convergence Server

Evaluation of CUDA Fortran for the CFD code Strukti

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

HP ProLiant SL270s Gen8 Server. Evaluation Report

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

and RISC Optimization Techniques for the Hitachi SR8000 Architecture

Current Status of FEFS for the K computer

LS DYNA Performance Benchmarks and Profiling. January 2009

Stream Processing on GPUs Using Distributed Multimedia Middleware

CPU Session 1. Praktikum Parallele Rechnerarchtitekturen. Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14,

ICRI-CI Retreat Architecture track

Computer Organization

Cisco MCS 7825-H3 Unified Communications Manager Appliance

Computer Systems Structure Input/Output

HPC and Big Data. EPCC The University of Edinburgh. Adrian Jackson Technical Architect

Transcription:

Modellierung, Simulation, Optimierung Computer Architektur Prof. Michael Resch Dr. Martin Bernreuther, Dr. Natalia Currle-Linde, Dr. Martin Hecht, Uwe Küster, Dr. Oliver Mangold, Melanie Mochmann, Christoph Niethammer, Ralf Schneider HLRS, IHR 28. Juni 2012 1/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

Outline A simple computer Computer and programs Modern computers in detail principle parallel architectures examples of large machines parameters describing hardware Types of Parallelism performance modeling 2/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

A simple computer 3/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

von Neumann Machine (!) 4 units Arithmetic Logical Unit Control Unit Memory for instructions and data Input and Output Unit the instructions are executed in order by incrementing an instruction pointer branches to make jumps in the instruction sequence by changing the instruction pointer no parallelism memory is linearly addressed typical for computer architecture http: //de.wikipedia.org/wiki/john_von_neumann http://de.wikipedia.org/wiki/ Von-Neumann-Architektur 4/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

design of a simple computer (!) processor does all operations memory is a fast but volatile storage bus connecting processor, memory and IO Input and output devices: disk storing data permanently keyboard display network interface for communication processor keyboard display mouse interface devices disk memory network 5/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

Computer and progams 6/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

executing a program on a computer (!) all the work a computer can do is specified by executable programs a program consists on a large sequence of instructions combined with data the processor or core loads an instruction from memory the instruction is analysed and executed by the hardware the instruction may contain data, memory addresses, and the specification of an operation what to do with the data the instruction pointer points to the current instruction to be loaded and executed after issueing the instruction it is incremented and points to the next instruction; in this way the machine operates on a sequence of instructions it can be changed by special branch and jump instructions to point to a different location in the instruction stream; in this way loops and alternatives are formulated the same instruction sequence may executed on different data typical instruction classes are load, store integer add, multiply floating point add, multiply branch and jump (also calling procedures) compare stack instructions (push, pop) 7/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

purpose of the operating system (!) the operating system is a collection of executable programs for the administration of all devices as disk, keyboard, display, mouse, hardware interfaces,... scheduling and starting executable programs, assigning resources to them, finishing them handling data, files and the file system connecting to the outer world, e.g. other computing nodes, the internet hides the complexity of the machine to the user examples: Microsoft Windows, Unix/Linux, NEC SuperUX, Android,... 8/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

from the source code to the start of the executable (!) the source code of a program is written in a high level programming language because machine code is too complicated to read, understand and handle machine code is architecture dependent the source code is translated to the executable program by the following steps compilation of files by the compiler generates objects in machine code; relations between objects still lacking the linker binds the objects the executable program which is written to disk to run the executable the loader load the executable from disk to memory and starts the executable by setting the instruction pointer to the first instruction high level languages are for example C, C++ Fortran C#, Visual Basic, Delphi Java; works differently: source code will be translated to an intermediate byte code which is independent on the machine; the intermediate code will be executed on a run time system on the computer 9/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

Pseudo Assembler Assembler is near to the machine code but better readable; only for special purposes, e.g. drivers here we are using some kind of pseudo assembler Fortran: Java: do i=1,imax a(i)=b(i)+c(i) enddo for (i = 1; i <= imax; i++) a[i] = b[i] + c[i]; 1. i=0 2. start: 3. i=i+1 4. if i > imax goto end 5. load b(i) to register 0 6. load c(i) to register 1 7. add register 0 to register 1 and write to register 2 8. store register 2 to memory at location a(i) 9. goto start 10. end: 10/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

in X86 Assembler 1. Block 7: 2. movsdq (%rsi,%rcx,8), %xmm0 ; c(i ) -> xmm0[63..0] 3. movsdq 0x10(%rsi,%rcx,8), %xmm1 ; c(i+2) -> xmm1[63..0] 4. movsdq 0x20(%rsi,%rcx,8), %xmm2 ; c(i+4) -> xmm2[63..0] 5. movsdq 0x30(%rsi,%rcx,8), %xmm3 ; c(i+6) -> xmm3[63..0] 6. movhpdq 0x8(%rsi,%rcx,8), %xmm0 ; c(i+1) -> xmm0[127..64] 7. movhpdq 0x18(%rsi,%rcx,8), %xmm1 ; c(i+3) -> xmm1[127..64] 8. movhpdq 0x28(%rsi,%rcx,8), %xmm2 ; c(i+5) -> xmm2[127..64] 9. movhpdq 0x38(%rsi,%rcx,8), %xmm3 ; c(i+7) -> xmm3[127..64] 10. addpdx (%rdx,%rcx,8), %xmm0 ; b(i ),b(i+1) + xmm0 ->xmm0 11. addpdx 0x10(%rdx,%rcx,8), %xmm1 ; b(i+2),b(i+3) + xmm1 ->xmm1 12. addpdx 0x20(%rdx,%rcx,8), %xmm2 ; b(i+4),b(i+5) + xmm2 ->xmm2 13. addpdx 0x30(%rdx,%rcx,8), %xmm3 ; b(i+6),b(i+7) + xmm3 ->xmm3 14. movapsx %xmm0, (%rdi,%rcx,8) ; xmm0->a(i ),a(i+1) 15. movapsx %xmm1, 0x10(%rdi,%rcx,8) ; xmm1->a(i+2),a(i+3) 16. movapsx %xmm2, 0x20(%rdi,%rcx,8) ; xmm2->a(i+4),a(i+5) 17. movapsx %xmm3, 0x30(%rdi,%rcx,8) ; xmm3->a(i+6),a(i+7) 18. add $0x8, %rcx ; i=i+8 19. cmp %rax, %rcx ; imax > i? 20. jb 0x303f <Block 7> ; if true -> goto <Block 7> 11/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

some instruction of the X86 Assembler 1. Block 7: begin of a block 2. movsdq (%rsi,%rcx,8), %xmm0 load from address rsi+ 8*rcx in register xmm0 [63:0] 3. movhpdq 0x8(%rsi,%rcx,8), %xmm0 load from address 0x8+rsi+ 8*rcx in xmm0 [127:64] 4. addpdx (%rdx,%rcx,8), %xmm0 add from rdx+8*rcx to register xmm0 writing to xmm0 5. addpdx 0x10(%rdx,%rcx,8), %xmm1 add from 0x10+rdx+8*rcx to register xmm1 6. movapsx %xmm0, (%rdi,%rcx,8) store xmm0 to address rdi+8*rcx 7. movapsx %xmm1, 0x10(%rdi,%rcx,8) store xmm1 to address 0x10+rdi+8*rcx 8. add $0x8, %rcx add 0x8 to rcx with target rcx 9. cmp %rax, %rcx compare rax and rcx 10. jb 0x303f <Block 7> jump to 0x303f if rax > rcx 12/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

Test program in Fortran; part 1 1. module test_module 2. contains 3. subroutine func(a,b,c,imax) 4. integer :: imax, i 5. real(kind=8), dimension(imax) :: a,b,c 6. do i=1,imax 7. a(i)=b(i)+c(i) 8. enddo 9. end subroutine func 10. end module test_module 13/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

Test program in Fortran; part 2 1. program test_prog 2. use test_module 3. implicit none 4. integer :: i,j,k=200 5. integer,parameter :: imax=100000 6. real(kind=8),allocatable, dimension(:) :: a,b,c! are dynamic arrays 7. allocate(a(imax))! reserving space for array a 8. allocate(b(imax)) 9. allocate(c(imax)) 10. do i=1,imax! Definition of array elements 11. b(i)=i+j 12. c(i)=i*5 13. enddo 14. call func(a,b,c,imax)! call the procedure 15. write(*,*)a(1) 16. end program test_prog 14/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

Test program in Java; part 1 1. package test.module; 2. public class TestModule 3. { 4. 5. public static void func(double a[],double b[],double c[],int imax) 6. { 7. for (int i = 0; i < imax; i++) 8. a[i] = b[i] + c[i]; 9. } 10. } 15/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

Test program in Java; part 2 1. package test.prog; 2. import test.module.testmodule; 3. public class TestProg 4. { 5. public static void main(string[] args) 6. { 7. int i, j = 0, k = 200; 8. final int imax = 100000; 9. double a[], b[], c[]; 10. a = new double[imax]; // reserving space for array a 11. b = new double[imax]; 12. c = new double[imax]; 13. for (i = 0; i < imax; i++) 14. { 15. b[i] = i + j; 16. c[i] = i * 5; 17. } 18. TestModule.func(a, b, c, imax); // call the procedure 19. System.out.println(a[0]); 20. } 21. } 16/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

Modern computers in detail 17/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

Core with Caches and Memory (!) data are stored in a connected memory hierarchy L1 cache ( Level 1 cache) L2 cache L3 cache ( not all architectures ) memory the cache is an intermediate memory with the same addressing scheme replicating the data in memory has a smaller access latency and provides higher bandwidth a load/store first looks in the (low level) cache; if data is not there, the next level will tested all cached data are organized in cache lines (of 64 B for X86, X86 64) load/store moves cachelines, not single data old data will be overwritten, but saved to higher level caches or to memory if changed writing to a cacheline assumes that the cacheline is present in cache lower level caches are smaller than higher level caches nearby caches have higher bandwidth and smaller access latency load/store instructions can reach all memory locations C L1 L2 L3 M 18/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

Processor with Cores and Memory (!) here: Intel Sandy Bridge (2012) multi core processor with 8 cores L3 cache is shared but partitioned communication ring with high bandwidth 4 rings for data, requests, acknowledgments, snooping memory controller in ring QPI (Quick Path Interconnect) interface in ring snooping enables all cores to hear which shared cachelines are changed by other cores ( cache coherence protocol) QPI C L1 L2 L3 L3 L2 L1 C C L1 L2 L3 L3 L2 L1 C C L1 L2 L3 L3 L2 L1 C C L1 L2 L3 L3 L2 L1 C MC 4 rings: data ackn requ snoop M 19/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

Intel Sandy Bridge architectural data 1 8 cores AVX vector unit per core with 4 mult-plus operations/cp core with L1, L2 cache shared L3-cache 4 memory channels with 1600 MHz DDR3 2 QPI links up two 8 GT/s 51.2 GB/s nominal bandwidth >2.0 GHz frequency daughter card on PCIe 3.0 up to 16 GB/s (was ist PCI) PCIe ( Peripheral Component Interconnect Express) is an interface standard for connecting external devices as graphics cards, network interface controllers, sound cards,... 20/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

Intel Sandy Bridge architectural data 2 http://www.computerbase.de/news/2010-05/details-zum-sandy-bridge-high-end-portfolio/ 21/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

AMD Interlagos Processor 16 cores in 2 dies with 4 core pairs AVX vector unit per core pair(!) with 4 mult-plus operations/cp core with L1 cache 2 MB L2-cache per core pair 8 MB shared L3-cache per die, in total 16 MB 2 memory channels with 1600 MHz DDR3 per die one Hyper-Transport link between dies one Hyper-Transport link per die going outwards 51.2 GB/s nominal bandwidth for 4 channels >2.0 GHz frequency picture from AMD 22/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

Intel Knights Ferry 30 cores 4 vector units per core Pentium II based 100 GB/s nominal bandwidth 1.05 GHz frequency daughter card on PCIe 2.0 prototype; will be followed by Knights Corner http://www.computerbase.de/bildstrecke/30921/3/ 23/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

Nvidia Graphics Card nicht mehr aktuell 448-512 cores 1 8Byte FP-unit per core 14-16 multiprocessors 144-177 GB/s nominal bandwidth 1.15-1.3 GHz frequency 515-665 DP peak performance daughter card on PCIe 2.0 Oliver Mangold 24/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

principle parallel architectures 25/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

Shared memory system (!) all processors/cores are using the same memory the address space is identical for all cores memory load/store instructions can reach all locations each thread may share his memory part with others shared memory may consist on different banks on different channels can be local to a processsor but accessible by others Intel: Quick Path Interconnect (QPI) AMD: Hyper-Transport (HT) bus/interconnect is a bottleneck memory bandwidth is insufficient M M M M P IO QPI / HT P M M M M 26/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

Distributed memory system (!) nodes are using different memories the address space is different for all cores memory load/store instructions cannot reach other locations memory bus/interconnect is independent memory bandwidth scales with the number of nodes network access via a Network Interface Controller (NIC) network links connected via Routers (R) M P NIC R M P NIC R M P NIC R M P NIC R 27/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

Hierarchical Parallel Computer System (!) the different levels of a parallel system: cores, processors, nodes, supernodes, system 28/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

Hierarchical Parallel Computer System (!) vector floating point units (FPU) ( 2-8 per core ) cores (C) ( 4-64 per CPU, 500 per GPU) processors (CPU) ( 2-8 per node ) nodes (N) ( 1-8 per supernode ) supernodes, blades, cages,... (2-8 per rack ) rack (R) (10-1000 per system) distributed system by multiplying small numbers we get a very large number 29/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

addressing of the total memory (!) shared memory Uniform Memory Access (UMA) Non-Uniform Memory Access (NUMA) distributed memory locally addressable nodes may have shared memory Partitioned Global Address Space (PGAS) globally addressable, every processor can address the memory in all other processors or nodes Remote Direct Memory Access (RDMA) done by the Network Interface Controller (NIC) enables simpler handling of global data structures 30/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

examples of large machines 31/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

HLRS machine Cray XE6 Cray XE6 3500 nodes with two AMD processors Interlagos Gemini interconnect working since November 2011 16 C x 2 CPU x 96 N x 38 R = 116736 Cores (HLRS 2011) 32 GB x 96 N x 38 R = 116 TB memory 1.8 PB disk Cray XE6 at HLRS 32/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

K-computer at Riken Fujitsu 10-13 MW finished in 2012 8 C x 4 N x 24 SN x 864 R = 82944 CPUs = 663552 Cores 10.6 PetaFlop/s peak performance K-computer at Riken near Kobe/Japan Rack of K-computer 33/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: Blue Waters System I > 235 Cray XE6 cabinets I > 30 Cray XK6 cabinets I > 25000 compute nodes I > 1 TB/s usable storage bandwidth I > 1.5 PB aggregate system memory I 4 GB Memory per core I > 9000 Gemini network cables I 3D torus interconnect topology I > 17000 disks I > 190000 memory DIMMS I > 25 PB usable storage I > 11.5 PF peak performance I > 49000 AMD processors I > 380000 cores I > 3000 GPUs I 100 GB/s bandwidth to near-line storage 34/55 :: Modellierung, Simulation, Optimierung, Computer Architektur Blue Waters machine under construction http://timelapse.ncsa.illinois.edu/ pcf/inside2/index.php http://www.ncsa.illinois.edu/ BlueWaters/system.html :: 28. Juni 2012 ::

parameters describing hardware 35/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

Moore s Law 1 (!) 36/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

Moore s Law 2 (!) Moore s Law from 1965 states that number of components on an integrated circuit doubles every 18 (24) months this is the reason that computers are inexpensive despite their complexity for some time it was understood as doubling the frequency every 18 month http://de.wikipedia.org/wiki/mooresches_gesetz 37/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

stagnating frequency (!) the power consumption depends heavily on the frequency of the processor power frequency 3 the number of cores can be increased, if the frequency is reduced List of Intel processors: http://www.intel.com/ pressroom/kits/ quickreffam.htm data from Gerhard Wellein / Erlangen 38/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

Floating Point Performance Parameters (!) clock rate of the core/processor inverse of frequency for scheduling the instructions; typical 1 4 GHz number of floating point operations per clock tic instruction parallelism in the core/processor [FLOPs/FP unit]; typical 2 8 peak performance of processor maximum number of Floating Point Operations per second [MFLOPs, GFLOPs, TFLOPs]; typical 4 GFLOPs 100 GFLOPs total peak performance of machine number of nodes defines the size of the cluster; typical 1-25000 total performance=frequency x # FP-units x #cores x #processors x #nodes 39/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

Communication Parameters: Bandwidth (!) bandwidth: amount of data transferred per time through a transportation path typical parameter for all transportation systems for computers L1 cache bandwidth ( 4x8 B/CP ) L2 cache bandwidth ( 4x8 B/CP ) L3 cache bandwidth ( 4x8 B/CP ) memory bandwidth o (1,2,3,4) channels x 8 B x Bus frequency (at 1066, 1333, 1600 MHz) up to 51,2 GB/s interconnect bandwidth o InfiniBand (SDR 1 GB/s, DDR 2 GB/s, QDR 4 GB/s, EDR 12.5 GB/s) o Cray Gemini ( 5 GB/s x 2 directional ) o Ethernet ( 1Gbit, 10 GBit, 40 GBit) IO bandwidth o SATA disk ( 60 MB/s) o Solid State Device SSD ( 120 700 MB/s) o RAID Storage Subsystem ( 1.5 10 GB/s) o parallel Lustre Filesystem ( > 100 GB/s) 40/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

Communication Parameters: Latency (!) Latency: time to get the first data after sending the request connected to the length of the pipeline all system parts show a typical latency L1 cache (2-4 CP) L2 cache ( 10-20 CP) L3 cache (25-70 CP) memory (100-300 CP) interconnect ( 1-50 µ s) disc ( 1-5 ms) the number of open transfers is limited by number of open requests the achievable bandwidth is limited by the bandwidth number of open requests transferred buffer size/latency the latency is the start up of the transfer pipeline 41/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

Effective Memory Bandwidth (!) The nominal memory bandwidth of a modern PC-system can be calculated by BW nom = path width number of channels bus frequency path width = 8B, number of channels = 1, 2, 3, 4, bus frequency = 1066, 1333, 1600, 1866, 2133MHz The effective memory bandwidth is smaller than the nominal bandwidth BW eff = size of data transferred latency + transfer time The transfer time is related to the nominal bandwidth by transfer time = size of data transferred BW nom The effective bandwidth is influenced by the memory or cache latency size of data transferred BW eff = size of data transferred latency + BW nom 42/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

Example 1 Intels Sandy Bridge processor has 4 channels running at a bus frequency of 1600MHz. The channels are 8B wide. BW nom = 8B 4 1600MHz = 51.2GB/s The memory latency is 70nsec. The resulting effective bandwidth for a single cache line of 64B is 64B 64B BW eff = 70nsec + 64B = = 0, 898GB/s(!) 70nsec + 1.25nsec 51.2GB/s We are far from the nominal bandwidth! The estimations are not completely correct: the cacheline uses a single channel; the cacheline is transported with bus clock periods. 43/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

Example 2 The resulting effective bandwidth for 100 cache lines of 64B is BW eff = 100 64B 70nsec + 100 64B 51.2GB/s = 100 64B = 32, 82GB/s(!) 70nsec + 125nsec How to overcome this problem? 44/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

Prefetching (!) The data access has to be started before consumption Prefetching initiates data access independent on using the data difficult to handle may be contraproductive for data already in caches 45/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

Types of Parallelism(!) single instruction single data (SISD) simple or traditional processor single instruction multiple data (SIMD) vector model in a multi processor of a graphic card multiple instructions multiple data (MIMD) multicore processors as shared memory processors different multiprocessors of graphic cards multiple programs multiple data (MPMD) multicore processors for different processes distributed memory processors multiple instructions single data (MISD)?? SISD, SIMD, MIMD, MISD: Flynns notation pipeline parallelism Bild! 46/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

pipeline parallelism (!) phase shifted parallelism very general principle in production, transport, flow of goods assembly line in production systems important model in bureaucracy ( first come first serve) different parts are in operation at different stages at the same time dominant in modern computer architectures in computer: accessing caches or memory within a loop accessing data from a remote processor input and output system description model of the pipeline the pipeline is filled after startup steps. This is the length of the pipeline and at the same the degree of its parallelism. Could be understood also as latency. after the pipeline is filled it produces one result per step. 47/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

performance modeling 48/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

performance modeling 49/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

performance behaviour of the pipeline (!) The pipeline mechanism is characterized by the fixed startup or latency costs or time for filling the pipeline time per unit T operations per unit op performance perf The resulting performance is perf (n) = n op startup + n T peak performance is the limit for increasing n op perf = lim perf (n) = n T getting half of the peak performance for n 12 = startup T startup op T perf (n 12 ) = startup + startup T = 1 op 2 T = 1 2 perf T 50/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

Amdahls Law: limits of parallel speedup (!) Assume a parallelizable program running a fixed size test case having a parallel part taking a total time t par serial (non parallelizable) part taking a total time t ser total time is t tot = t par + t ser We will run the program with p parallel processes The speedup is t tot speedup(p) = t par /p + t ser t tot t ser and the efficiency efficiency(p) = speedup(p) p 51/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

Amdahls Law: limits of parallel speedup (!) These assumptions are correct of any type of parallel work! the pictures show speedup in dependence on t ser /(t par + t ser ) in percent. The behaviour is not encouraging! Why large scale parallel computing can be successful? the speedup behaviour of parallel program for a fixed sized test case is named strong scaling 52/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

Amdahls Law with communication even worse: assume that communication times increase with the number of processes parallel part taking a total time t par assume communication time com p is proportional to the number of processes; other dependencies to be discussed serial part taking a total time t ser total time is t tot = t par + t ser We will run the program with p parallel processes The speedup is t tot speedup(p) = t par /p + com p + t ser t tot com p + t ser 53/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

Gustafson: increase the work 1 (!) Amdahls assumes a fixed amount of work. But what can we do in a fixed amount of time? parallel part taking a total time t par = t procp proportional to the number of processors serial part taking a total time t ser The speedup is tprocp + tser t proc t ser speedup(p) = = p + t procp/p + t ser t proc + t ser t proc + t ser and the efficiency efficiency(p) = t proc t ser 1 + t proc + t ser t proc + t ser p The efficiency decreases to a lower limit. 54/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

Gustafson: increase the work 2 (!) There are a lot of simulation problems of this type! But we are neglecting any necessary communication. the speedup behaviour of parallel program for a test case which is increasing with the number of active processors/cores is named weak scaling 55/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

Danke für die Aufmerksamkeit! Kuester[at]hlrs.de 55/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::