Modellierung, Simulation, Optimierung Computer Architektur
|
|
|
- Kimberly Cummings
- 10 years ago
- Views:
Transcription
1 Modellierung, Simulation, Optimierung Computer Architektur Prof. Michael Resch Dr. Martin Bernreuther, Dr. Natalia Currle-Linde, Dr. Martin Hecht, Uwe Küster, Dr. Oliver Mangold, Melanie Mochmann, Christoph Niethammer, Ralf Schneider HLRS, IHR 28. Juni /55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::
2 Outline A simple computer Computer and programs Modern computers in detail principle parallel architectures examples of large machines parameters describing hardware Types of Parallelism performance modeling 2/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::
3 A simple computer 3/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::
4 von Neumann Machine (!) 4 units Arithmetic Logical Unit Control Unit Memory for instructions and data Input and Output Unit the instructions are executed in order by incrementing an instruction pointer branches to make jumps in the instruction sequence by changing the instruction pointer no parallelism memory is linearly addressed typical for computer architecture http: //de.wikipedia.org/wiki/john_von_neumann Von-Neumann-Architektur 4/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::
5 design of a simple computer (!) processor does all operations memory is a fast but volatile storage bus connecting processor, memory and IO Input and output devices: disk storing data permanently keyboard display network interface for communication processor keyboard display mouse interface devices disk memory network 5/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::
6 Computer and progams 6/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::
7 executing a program on a computer (!) all the work a computer can do is specified by executable programs a program consists on a large sequence of instructions combined with data the processor or core loads an instruction from memory the instruction is analysed and executed by the hardware the instruction may contain data, memory addresses, and the specification of an operation what to do with the data the instruction pointer points to the current instruction to be loaded and executed after issueing the instruction it is incremented and points to the next instruction; in this way the machine operates on a sequence of instructions it can be changed by special branch and jump instructions to point to a different location in the instruction stream; in this way loops and alternatives are formulated the same instruction sequence may executed on different data typical instruction classes are load, store integer add, multiply floating point add, multiply branch and jump (also calling procedures) compare stack instructions (push, pop) 7/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::
8 purpose of the operating system (!) the operating system is a collection of executable programs for the administration of all devices as disk, keyboard, display, mouse, hardware interfaces,... scheduling and starting executable programs, assigning resources to them, finishing them handling data, files and the file system connecting to the outer world, e.g. other computing nodes, the internet hides the complexity of the machine to the user examples: Microsoft Windows, Unix/Linux, NEC SuperUX, Android,... 8/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::
9 from the source code to the start of the executable (!) the source code of a program is written in a high level programming language because machine code is too complicated to read, understand and handle machine code is architecture dependent the source code is translated to the executable program by the following steps compilation of files by the compiler generates objects in machine code; relations between objects still lacking the linker binds the objects the executable program which is written to disk to run the executable the loader load the executable from disk to memory and starts the executable by setting the instruction pointer to the first instruction high level languages are for example C, C++ Fortran C#, Visual Basic, Delphi Java; works differently: source code will be translated to an intermediate byte code which is independent on the machine; the intermediate code will be executed on a run time system on the computer 9/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::
10 Pseudo Assembler Assembler is near to the machine code but better readable; only for special purposes, e.g. drivers here we are using some kind of pseudo assembler Fortran: Java: do i=1,imax a(i)=b(i)+c(i) enddo for (i = 1; i <= imax; i++) a[i] = b[i] + c[i]; 1. i=0 2. start: 3. i=i+1 4. if i > imax goto end 5. load b(i) to register 0 6. load c(i) to register 1 7. add register 0 to register 1 and write to register 2 8. store register 2 to memory at location a(i) 9. goto start 10. end: 10/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::
11 in X86 Assembler 1. Block 7: 2. movsdq (%rsi,%rcx,8), %xmm0 ; c(i ) -> xmm0[63..0] 3. movsdq 0x10(%rsi,%rcx,8), %xmm1 ; c(i+2) -> xmm1[63..0] 4. movsdq 0x20(%rsi,%rcx,8), %xmm2 ; c(i+4) -> xmm2[63..0] 5. movsdq 0x30(%rsi,%rcx,8), %xmm3 ; c(i+6) -> xmm3[63..0] 6. movhpdq 0x8(%rsi,%rcx,8), %xmm0 ; c(i+1) -> xmm0[ ] 7. movhpdq 0x18(%rsi,%rcx,8), %xmm1 ; c(i+3) -> xmm1[ ] 8. movhpdq 0x28(%rsi,%rcx,8), %xmm2 ; c(i+5) -> xmm2[ ] 9. movhpdq 0x38(%rsi,%rcx,8), %xmm3 ; c(i+7) -> xmm3[ ] 10. addpdx (%rdx,%rcx,8), %xmm0 ; b(i ),b(i+1) + xmm0 ->xmm0 11. addpdx 0x10(%rdx,%rcx,8), %xmm1 ; b(i+2),b(i+3) + xmm1 ->xmm1 12. addpdx 0x20(%rdx,%rcx,8), %xmm2 ; b(i+4),b(i+5) + xmm2 ->xmm2 13. addpdx 0x30(%rdx,%rcx,8), %xmm3 ; b(i+6),b(i+7) + xmm3 ->xmm3 14. movapsx %xmm0, (%rdi,%rcx,8) ; xmm0->a(i ),a(i+1) 15. movapsx %xmm1, 0x10(%rdi,%rcx,8) ; xmm1->a(i+2),a(i+3) 16. movapsx %xmm2, 0x20(%rdi,%rcx,8) ; xmm2->a(i+4),a(i+5) 17. movapsx %xmm3, 0x30(%rdi,%rcx,8) ; xmm3->a(i+6),a(i+7) 18. add $0x8, %rcx ; i=i cmp %rax, %rcx ; imax > i? 20. jb 0x303f <Block 7> ; if true -> goto <Block 7> 11/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::
12 some instruction of the X86 Assembler 1. Block 7: begin of a block 2. movsdq (%rsi,%rcx,8), %xmm0 load from address rsi+ 8*rcx in register xmm0 [63:0] 3. movhpdq 0x8(%rsi,%rcx,8), %xmm0 load from address 0x8+rsi+ 8*rcx in xmm0 [127:64] 4. addpdx (%rdx,%rcx,8), %xmm0 add from rdx+8*rcx to register xmm0 writing to xmm0 5. addpdx 0x10(%rdx,%rcx,8), %xmm1 add from 0x10+rdx+8*rcx to register xmm1 6. movapsx %xmm0, (%rdi,%rcx,8) store xmm0 to address rdi+8*rcx 7. movapsx %xmm1, 0x10(%rdi,%rcx,8) store xmm1 to address 0x10+rdi+8*rcx 8. add $0x8, %rcx add 0x8 to rcx with target rcx 9. cmp %rax, %rcx compare rax and rcx 10. jb 0x303f <Block 7> jump to 0x303f if rax > rcx 12/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::
13 Test program in Fortran; part 1 1. module test_module 2. contains 3. subroutine func(a,b,c,imax) 4. integer :: imax, i 5. real(kind=8), dimension(imax) :: a,b,c 6. do i=1,imax 7. a(i)=b(i)+c(i) 8. enddo 9. end subroutine func 10. end module test_module 13/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::
14 Test program in Fortran; part 2 1. program test_prog 2. use test_module 3. implicit none 4. integer :: i,j,k= integer,parameter :: imax= real(kind=8),allocatable, dimension(:) :: a,b,c! are dynamic arrays 7. allocate(a(imax))! reserving space for array a 8. allocate(b(imax)) 9. allocate(c(imax)) 10. do i=1,imax! Definition of array elements 11. b(i)=i+j 12. c(i)=i*5 13. enddo 14. call func(a,b,c,imax)! call the procedure 15. write(*,*)a(1) 16. end program test_prog 14/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::
15 Test program in Java; part 1 1. package test.module; 2. public class TestModule 3. { public static void func(double a[],double b[],double c[],int imax) 6. { 7. for (int i = 0; i < imax; i++) 8. a[i] = b[i] + c[i]; 9. } 10. } 15/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::
16 Test program in Java; part 2 1. package test.prog; 2. import test.module.testmodule; 3. public class TestProg 4. { 5. public static void main(string[] args) 6. { 7. int i, j = 0, k = 200; 8. final int imax = ; 9. double a[], b[], c[]; 10. a = new double[imax]; // reserving space for array a 11. b = new double[imax]; 12. c = new double[imax]; 13. for (i = 0; i < imax; i++) 14. { 15. b[i] = i + j; 16. c[i] = i * 5; 17. } 18. TestModule.func(a, b, c, imax); // call the procedure 19. System.out.println(a[0]); 20. } 21. } 16/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::
17 Modern computers in detail 17/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::
18 Core with Caches and Memory (!) data are stored in a connected memory hierarchy L1 cache ( Level 1 cache) L2 cache L3 cache ( not all architectures ) memory the cache is an intermediate memory with the same addressing scheme replicating the data in memory has a smaller access latency and provides higher bandwidth a load/store first looks in the (low level) cache; if data is not there, the next level will tested all cached data are organized in cache lines (of 64 B for X86, X86 64) load/store moves cachelines, not single data old data will be overwritten, but saved to higher level caches or to memory if changed writing to a cacheline assumes that the cacheline is present in cache lower level caches are smaller than higher level caches nearby caches have higher bandwidth and smaller access latency load/store instructions can reach all memory locations C L1 L2 L3 M 18/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::
19 Processor with Cores and Memory (!) here: Intel Sandy Bridge (2012) multi core processor with 8 cores L3 cache is shared but partitioned communication ring with high bandwidth 4 rings for data, requests, acknowledgments, snooping memory controller in ring QPI (Quick Path Interconnect) interface in ring snooping enables all cores to hear which shared cachelines are changed by other cores ( cache coherence protocol) QPI C L1 L2 L3 L3 L2 L1 C C L1 L2 L3 L3 L2 L1 C C L1 L2 L3 L3 L2 L1 C C L1 L2 L3 L3 L2 L1 C MC 4 rings: data ackn requ snoop M 19/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::
20 Intel Sandy Bridge architectural data 1 8 cores AVX vector unit per core with 4 mult-plus operations/cp core with L1, L2 cache shared L3-cache 4 memory channels with 1600 MHz DDR3 2 QPI links up two 8 GT/s 51.2 GB/s nominal bandwidth >2.0 GHz frequency daughter card on PCIe 3.0 up to 16 GB/s (was ist PCI) PCIe ( Peripheral Component Interconnect Express) is an interface standard for connecting external devices as graphics cards, network interface controllers, sound cards,... 20/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::
21 Intel Sandy Bridge architectural data /55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::
22 AMD Interlagos Processor 16 cores in 2 dies with 4 core pairs AVX vector unit per core pair(!) with 4 mult-plus operations/cp core with L1 cache 2 MB L2-cache per core pair 8 MB shared L3-cache per die, in total 16 MB 2 memory channels with 1600 MHz DDR3 per die one Hyper-Transport link between dies one Hyper-Transport link per die going outwards 51.2 GB/s nominal bandwidth for 4 channels >2.0 GHz frequency picture from AMD 22/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::
23 Intel Knights Ferry 30 cores 4 vector units per core Pentium II based 100 GB/s nominal bandwidth 1.05 GHz frequency daughter card on PCIe 2.0 prototype; will be followed by Knights Corner 23/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::
24 Nvidia Graphics Card nicht mehr aktuell cores 1 8Byte FP-unit per core multiprocessors GB/s nominal bandwidth GHz frequency DP peak performance daughter card on PCIe 2.0 Oliver Mangold 24/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::
25 principle parallel architectures 25/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::
26 Shared memory system (!) all processors/cores are using the same memory the address space is identical for all cores memory load/store instructions can reach all locations each thread may share his memory part with others shared memory may consist on different banks on different channels can be local to a processsor but accessible by others Intel: Quick Path Interconnect (QPI) AMD: Hyper-Transport (HT) bus/interconnect is a bottleneck memory bandwidth is insufficient M M M M P IO QPI / HT P M M M M 26/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::
27 Distributed memory system (!) nodes are using different memories the address space is different for all cores memory load/store instructions cannot reach other locations memory bus/interconnect is independent memory bandwidth scales with the number of nodes network access via a Network Interface Controller (NIC) network links connected via Routers (R) M P NIC R M P NIC R M P NIC R M P NIC R 27/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::
28 Hierarchical Parallel Computer System (!) the different levels of a parallel system: cores, processors, nodes, supernodes, system 28/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::
29 Hierarchical Parallel Computer System (!) vector floating point units (FPU) ( 2-8 per core ) cores (C) ( 4-64 per CPU, 500 per GPU) processors (CPU) ( 2-8 per node ) nodes (N) ( 1-8 per supernode ) supernodes, blades, cages,... (2-8 per rack ) rack (R) ( per system) distributed system by multiplying small numbers we get a very large number 29/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::
30 addressing of the total memory (!) shared memory Uniform Memory Access (UMA) Non-Uniform Memory Access (NUMA) distributed memory locally addressable nodes may have shared memory Partitioned Global Address Space (PGAS) globally addressable, every processor can address the memory in all other processors or nodes Remote Direct Memory Access (RDMA) done by the Network Interface Controller (NIC) enables simpler handling of global data structures 30/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::
31 examples of large machines 31/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::
32 HLRS machine Cray XE6 Cray XE nodes with two AMD processors Interlagos Gemini interconnect working since November C x 2 CPU x 96 N x 38 R = Cores (HLRS 2011) 32 GB x 96 N x 38 R = 116 TB memory 1.8 PB disk Cray XE6 at HLRS 32/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::
33 K-computer at Riken Fujitsu MW finished in C x 4 N x 24 SN x 864 R = CPUs = Cores 10.6 PetaFlop/s peak performance K-computer at Riken near Kobe/Japan Rack of K-computer 33/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::
34 ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: Blue Waters System I > 235 Cray XE6 cabinets I > 30 Cray XK6 cabinets I > compute nodes I > 1 TB/s usable storage bandwidth I > 1.5 PB aggregate system memory I 4 GB Memory per core I > 9000 Gemini network cables I 3D torus interconnect topology I > disks I > memory DIMMS I > 25 PB usable storage I > 11.5 PF peak performance I > AMD processors I > cores I > 3000 GPUs I 100 GB/s bandwidth to near-line storage 34/55 :: Modellierung, Simulation, Optimierung, Computer Architektur Blue Waters machine under construction pcf/inside2/index.php BlueWaters/system.html :: 28. Juni 2012 ::
35 parameters describing hardware 35/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::
36 Moore s Law 1 (!) 36/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::
37 Moore s Law 2 (!) Moore s Law from 1965 states that number of components on an integrated circuit doubles every 18 (24) months this is the reason that computers are inexpensive despite their complexity for some time it was understood as doubling the frequency every 18 month 37/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::
38 stagnating frequency (!) the power consumption depends heavily on the frequency of the processor power frequency 3 the number of cores can be increased, if the frequency is reduced List of Intel processors: pressroom/kits/ quickreffam.htm data from Gerhard Wellein / Erlangen 38/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::
39 Floating Point Performance Parameters (!) clock rate of the core/processor inverse of frequency for scheduling the instructions; typical 1 4 GHz number of floating point operations per clock tic instruction parallelism in the core/processor [FLOPs/FP unit]; typical 2 8 peak performance of processor maximum number of Floating Point Operations per second [MFLOPs, GFLOPs, TFLOPs]; typical 4 GFLOPs 100 GFLOPs total peak performance of machine number of nodes defines the size of the cluster; typical total performance=frequency x # FP-units x #cores x #processors x #nodes 39/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::
40 Communication Parameters: Bandwidth (!) bandwidth: amount of data transferred per time through a transportation path typical parameter for all transportation systems for computers L1 cache bandwidth ( 4x8 B/CP ) L2 cache bandwidth ( 4x8 B/CP ) L3 cache bandwidth ( 4x8 B/CP ) memory bandwidth o (1,2,3,4) channels x 8 B x Bus frequency (at 1066, 1333, 1600 MHz) up to 51,2 GB/s interconnect bandwidth o InfiniBand (SDR 1 GB/s, DDR 2 GB/s, QDR 4 GB/s, EDR 12.5 GB/s) o Cray Gemini ( 5 GB/s x 2 directional ) o Ethernet ( 1Gbit, 10 GBit, 40 GBit) IO bandwidth o SATA disk ( 60 MB/s) o Solid State Device SSD ( MB/s) o RAID Storage Subsystem ( GB/s) o parallel Lustre Filesystem ( > 100 GB/s) 40/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::
41 Communication Parameters: Latency (!) Latency: time to get the first data after sending the request connected to the length of the pipeline all system parts show a typical latency L1 cache (2-4 CP) L2 cache ( CP) L3 cache (25-70 CP) memory ( CP) interconnect ( 1-50 µ s) disc ( 1-5 ms) the number of open transfers is limited by number of open requests the achievable bandwidth is limited by the bandwidth number of open requests transferred buffer size/latency the latency is the start up of the transfer pipeline 41/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::
42 Effective Memory Bandwidth (!) The nominal memory bandwidth of a modern PC-system can be calculated by BW nom = path width number of channels bus frequency path width = 8B, number of channels = 1, 2, 3, 4, bus frequency = 1066, 1333, 1600, 1866, 2133MHz The effective memory bandwidth is smaller than the nominal bandwidth BW eff = size of data transferred latency + transfer time The transfer time is related to the nominal bandwidth by transfer time = size of data transferred BW nom The effective bandwidth is influenced by the memory or cache latency size of data transferred BW eff = size of data transferred latency + BW nom 42/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::
43 Example 1 Intels Sandy Bridge processor has 4 channels running at a bus frequency of 1600MHz. The channels are 8B wide. BW nom = 8B MHz = 51.2GB/s The memory latency is 70nsec. The resulting effective bandwidth for a single cache line of 64B is 64B 64B BW eff = 70nsec + 64B = = 0, 898GB/s(!) 70nsec nsec 51.2GB/s We are far from the nominal bandwidth! The estimations are not completely correct: the cacheline uses a single channel; the cacheline is transported with bus clock periods. 43/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::
44 Example 2 The resulting effective bandwidth for 100 cache lines of 64B is BW eff = B 70nsec B 51.2GB/s = B = 32, 82GB/s(!) 70nsec + 125nsec How to overcome this problem? 44/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::
45 Prefetching (!) The data access has to be started before consumption Prefetching initiates data access independent on using the data difficult to handle may be contraproductive for data already in caches 45/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::
46 Types of Parallelism(!) single instruction single data (SISD) simple or traditional processor single instruction multiple data (SIMD) vector model in a multi processor of a graphic card multiple instructions multiple data (MIMD) multicore processors as shared memory processors different multiprocessors of graphic cards multiple programs multiple data (MPMD) multicore processors for different processes distributed memory processors multiple instructions single data (MISD)?? SISD, SIMD, MIMD, MISD: Flynns notation pipeline parallelism Bild! 46/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::
47 pipeline parallelism (!) phase shifted parallelism very general principle in production, transport, flow of goods assembly line in production systems important model in bureaucracy ( first come first serve) different parts are in operation at different stages at the same time dominant in modern computer architectures in computer: accessing caches or memory within a loop accessing data from a remote processor input and output system description model of the pipeline the pipeline is filled after startup steps. This is the length of the pipeline and at the same the degree of its parallelism. Could be understood also as latency. after the pipeline is filled it produces one result per step. 47/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::
48 performance modeling 48/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::
49 performance modeling 49/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::
50 performance behaviour of the pipeline (!) The pipeline mechanism is characterized by the fixed startup or latency costs or time for filling the pipeline time per unit T operations per unit op performance perf The resulting performance is perf (n) = n op startup + n T peak performance is the limit for increasing n op perf = lim perf (n) = n T getting half of the peak performance for n 12 = startup T startup op T perf (n 12 ) = startup + startup T = 1 op 2 T = 1 2 perf T 50/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::
51 Amdahls Law: limits of parallel speedup (!) Assume a parallelizable program running a fixed size test case having a parallel part taking a total time t par serial (non parallelizable) part taking a total time t ser total time is t tot = t par + t ser We will run the program with p parallel processes The speedup is t tot speedup(p) = t par /p + t ser t tot t ser and the efficiency efficiency(p) = speedup(p) p 51/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::
52 Amdahls Law: limits of parallel speedup (!) These assumptions are correct of any type of parallel work! the pictures show speedup in dependence on t ser /(t par + t ser ) in percent. The behaviour is not encouraging! Why large scale parallel computing can be successful? the speedup behaviour of parallel program for a fixed sized test case is named strong scaling 52/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::
53 Amdahls Law with communication even worse: assume that communication times increase with the number of processes parallel part taking a total time t par assume communication time com p is proportional to the number of processes; other dependencies to be discussed serial part taking a total time t ser total time is t tot = t par + t ser We will run the program with p parallel processes The speedup is t tot speedup(p) = t par /p + com p + t ser t tot com p + t ser 53/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::
54 Gustafson: increase the work 1 (!) Amdahls assumes a fixed amount of work. But what can we do in a fixed amount of time? parallel part taking a total time t par = t procp proportional to the number of processors serial part taking a total time t ser The speedup is tprocp + tser t proc t ser speedup(p) = = p + t procp/p + t ser t proc + t ser t proc + t ser and the efficiency efficiency(p) = t proc t ser 1 + t proc + t ser t proc + t ser p The efficiency decreases to a lower limit. 54/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::
55 Gustafson: increase the work 2 (!) There are a lot of simulation problems of this type! But we are neglecting any necessary communication. the speedup behaviour of parallel program for a test case which is increasing with the number of active processors/cores is named weak scaling 55/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::
56 Danke für die Aufmerksamkeit! Kuester[at]hlrs.de 55/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::
Architecture of Hitachi SR-8000
Architecture of Hitachi SR-8000 University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Slide 1 Most of the slides from Hitachi Slide 2 the problem modern computer are data
GPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
Chapter 2 Parallel Architecture, Software And Performance
Chapter 2 Parallel Architecture, Software And Performance UCSB CS140, T. Yang, 2014 Modified from texbook slides Roadmap Parallel hardware Parallel software Input and output Performance Parallel program
Parallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
Lecture 2 Parallel Programming Platforms
Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple
Computer Architecture TDTS10
why parallelism? Performance gain from increasing clock frequency is no longer an option. Outline Computer Architecture TDTS10 Superscalar Processors Very Long Instruction Word Processors Parallel computers
Introduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
Lecture 23: Multiprocessors
Lecture 23: Multiprocessors Today s topics: RAID Multiprocessor taxonomy Snooping-based cache coherence protocol 1 RAID 0 and RAID 1 RAID 0 has no additional redundancy (misnomer) it uses an array of disks
Multi-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
Cray Gemini Interconnect. Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak
Cray Gemini Interconnect Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak Outline 1. Introduction 2. Overview 3. Architecture 4. Gemini Blocks 5. FMA & BTA 6. Fault tolerance
Communicating with devices
Introduction to I/O Where does the data for our CPU and memory come from or go to? Computers communicate with the outside world via I/O devices. Input devices supply computers with data to operate on.
COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)
COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook) Vivek Sarkar Department of Computer Science Rice University [email protected] COMP
Putting it all together: Intel Nehalem. http://www.realworldtech.com/page.cfm?articleid=rwt040208182719
Putting it all together: Intel Nehalem http://www.realworldtech.com/page.cfm?articleid=rwt040208182719 Intel Nehalem Review entire term by looking at most recent microprocessor from Intel Nehalem is code
Kriterien für ein PetaFlop System
Kriterien für ein PetaFlop System Rainer Keller, HLRS :: :: :: Context: Organizational HLRS is one of the three national supercomputing centers in Germany. The national supercomputing centers are working
UNIT 2 CLASSIFICATION OF PARALLEL COMPUTERS
UNIT 2 CLASSIFICATION OF PARALLEL COMPUTERS Structure Page Nos. 2.0 Introduction 27 2.1 Objectives 27 2.2 Types of Classification 28 2.3 Flynn s Classification 28 2.3.1 Instruction Cycle 2.3.2 Instruction
Scalability and Classifications
Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static
Intel Xeon Processor E5-2600
Intel Xeon Processor E5-2600 Best combination of performance, power efficiency, and cost. Platform Microarchitecture Processor Socket Chipset Intel Xeon E5 Series Processors and the Intel C600 Chipset
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
PSAM, NEC PCIe SSD Appliance for Microsoft SQL Server (Reference Architecture) September 11 th, 2014 NEC Corporation
PSAM, NEC PCIe SSD Appliance for Microsoft SQL Server (Reference Architecture) September 11 th, 2014 NEC Corporation 1. Overview of NEC PCIe SSD Appliance for Microsoft SQL Server Page 2 NEC Corporation
Parallel Algorithm Engineering
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis
HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK
HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK Steve Oberlin CTO, Accelerated Computing US to Build Two Flagship Supercomputers SUMMIT SIERRA Partnership for Science 100-300 PFLOPS Peak Performance
Building a Top500-class Supercomputing Cluster at LNS-BUAP
Building a Top500-class Supercomputing Cluster at LNS-BUAP Dr. José Luis Ricardo Chávez Dr. Humberto Salazar Ibargüen Dr. Enrique Varela Carlos Laboratorio Nacional de Supercómputo Benemérita Universidad
Parallel Programming
Parallel Programming Parallel Architectures Diego Fabregat-Traver and Prof. Paolo Bientinesi HPAC, RWTH Aachen [email protected] WS15/16 Parallel Architectures Acknowledgements Prof. Felix
Hadoop on the Gordon Data Intensive Cluster
Hadoop on the Gordon Data Intensive Cluster Amit Majumdar, Scientific Computing Applications Mahidhar Tatineni, HPC User Services San Diego Supercomputer Center University of California San Diego Dec 18,
Lecture 1: the anatomy of a supercomputer
Where a calculator on the ENIAC is equipped with 18,000 vacuum tubes and weighs 30 tons, computers of the future may have only 1,000 vacuum tubes and perhaps weigh 1½ tons. Popular Mechanics, March 1949
PCI Express Impact on Storage Architectures and Future Data Centers. Ron Emerick, Oracle Corporation
PCI Express Impact on Storage Architectures and Future Data Centers Ron Emerick, Oracle Corporation SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA. Member companies
Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
Introduction to GPU Programming Languages
CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure
LSN 2 Computer Processors
LSN 2 Computer Processors Department of Engineering Technology LSN 2 Computer Processors Microprocessors Design Instruction set Processor organization Processor performance Bandwidth Clock speed LSN 2
AMD Opteron Quad-Core
AMD Opteron Quad-Core a brief overview Daniele Magliozzi Politecnico di Milano Opteron Memory Architecture native quad-core design (four cores on a single die for more efficient data sharing) enhanced
How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (
TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 7 th CALL (Tier-0) Contributing sites and the corresponding computer systems for this call are: GCS@Jülich, Germany IBM Blue Gene/Q GENCI@CEA, France Bull Bullx
An Introduction to Parallel Computing/ Programming
An Introduction to Parallel Computing/ Programming Vicky Papadopoulou Lesta Astrophysics and High Performance Computing Research Group (http://ahpc.euc.ac.cy) Dep. of Computer Science and Engineering European
22S:295 Seminar in Applied Statistics High Performance Computing in Statistics
22S:295 Seminar in Applied Statistics High Performance Computing in Statistics Luke Tierney Department of Statistics & Actuarial Science University of Iowa August 30, 2007 Luke Tierney (U. of Iowa) HPC
PCI Express Impact on Storage Architectures and Future Data Centers. Ron Emerick, Oracle Corporation
PCI Express Impact on Storage Architectures and Future Data Centers Ron Emerick, Oracle Corporation SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA. Member companies
Building an Inexpensive Parallel Computer
Res. Lett. Inf. Math. Sci., (2000) 1, 113-118 Available online at http://www.massey.ac.nz/~wwiims/rlims/ Building an Inexpensive Parallel Computer Lutz Grosz and Andre Barczak I.I.M.S., Massey University
COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING
COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING 2013/2014 1 st Semester Sample Exam January 2014 Duration: 2h00 - No extra material allowed. This includes notes, scratch paper, calculator, etc.
PCIe Over Cable Provides Greater Performance for Less Cost for High Performance Computing (HPC) Clusters. from One Stop Systems (OSS)
PCIe Over Cable Provides Greater Performance for Less Cost for High Performance Computing (HPC) Clusters from One Stop Systems (OSS) PCIe Over Cable PCIe provides greater performance 8 7 6 5 GBytes/s 4
CMSC 611: Advanced Computer Architecture
CMSC 611: Advanced Computer Architecture Parallel Computation Most slides adapted from David Patterson. Some from Mohomed Younis Parallel Computers Definition: A parallel computer is a collection of processing
ANALYSIS OF SUPERCOMPUTER DESIGN
ANALYSIS OF SUPERCOMPUTER DESIGN CS/ECE 566 Parallel Processing Fall 2011 1 Anh Huy Bui Nilesh Malpekar Vishnu Gajendran AGENDA Brief introduction of supercomputer Supercomputer design concerns and analysis
Next Generation GPU Architecture Code-named Fermi
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o [email protected]
Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o [email protected] Informa(on & Communica(on Technology Sec(on (ICTS) Interna(onal Centre for Theore(cal Physics (ICTP) Mul(ple Socket
COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters
COSC 6374 Parallel Computation Parallel I/O (I) I/O basics Spring 2008 Concept of a clusters Processor 1 local disks Compute node message passing network administrative network Memory Processor 2 Network
Chapter 2 Parallel Computer Architecture
Chapter 2 Parallel Computer Architecture The possibility for a parallel execution of computations strongly depends on the architecture of the execution platform. This chapter gives an overview of the general
Binary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child
Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.
Sun Constellation System: The Open Petascale Computing Architecture
CAS2K7 13 September, 2007 Sun Constellation System: The Open Petascale Computing Architecture John Fragalla Senior HPC Technical Specialist Global Systems Practice Sun Microsystems, Inc. 25 Years of Technical
PCI Express and Storage. Ron Emerick, Sun Microsystems
Ron Emerick, Sun Microsystems SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA. Member companies and individuals may use this material in presentations and literature
Introduction to GPGPU. Tiziano Diamanti [email protected]
[email protected] Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate
Trends in High-Performance Computing for Power Grid Applications
Trends in High-Performance Computing for Power Grid Applications Franz Franchetti ECE, Carnegie Mellon University www.spiral.net Co-Founder, SpiralGen www.spiralgen.com This talk presents my personal views
HP Z Turbo Drive PCIe SSD
Performance Evaluation of HP Z Turbo Drive PCIe SSD Powered by Samsung XP941 technology Evaluation Conducted Independently by: Hamid Taghavi Senior Technical Consultant June 2014 Sponsored by: P a g e
Cloud Data Center Acceleration 2015
Cloud Data Center Acceleration 2015 Agenda! Computer & Storage Trends! Server and Storage System - Memory and Homogenous Architecture - Direct Attachment! Memory Trends! Acceleration Introduction! FPGA
WHITE PAPER FUJITSU PRIMERGY SERVERS PERFORMANCE REPORT PRIMERGY BX620 S6
WHITE PAPER PERFORMANCE REPORT PRIMERGY BX620 S6 WHITE PAPER FUJITSU PRIMERGY SERVERS PERFORMANCE REPORT PRIMERGY BX620 S6 This document contains a summary of the benchmarks executed for the PRIMERGY BX620
Pedraforca: ARM + GPU prototype
www.bsc.es Pedraforca: ARM + GPU prototype Filippo Mantovani Workshop on exascale and PRACE prototypes Barcelona, 20 May 2014 Overview Goals: Test the performance, scalability, and energy efficiency of
OBJECTIVE ANALYSIS WHITE PAPER MATCH FLASH. TO THE PROCESSOR Why Multithreading Requires Parallelized Flash ATCHING
OBJECTIVE ANALYSIS WHITE PAPER MATCH ATCHING FLASH TO THE PROCESSOR Why Multithreading Requires Parallelized Flash T he computing community is at an important juncture: flash memory is now generally accepted
GPU File System Encryption Kartik Kulkarni and Eugene Linkov
GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through
PCI Express Impact on Storage Architectures. Ron Emerick, Sun Microsystems
PCI Express Impact on Storage Architectures Ron Emerick, Sun Microsystems SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA. Member companies and individual members may
Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27
Logistics Week 1: Wednesday, Jan 27 Because of overcrowding, we will be changing to a new room on Monday (Snee 1120). Accounts on the class cluster (crocus.csuglab.cornell.edu) will be available next week.
EDUCATION. PCI Express, InfiniBand and Storage Ron Emerick, Sun Microsystems Paul Millard, Xyratex Corporation
PCI Express, InfiniBand and Storage Ron Emerick, Sun Microsystems Paul Millard, Xyratex Corporation SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA. Member companies
Building Clusters for Gromacs and other HPC applications
Building Clusters for Gromacs and other HPC applications Erik Lindahl [email protected] CBR Outline: Clusters Clusters vs. small networks of machines Why do YOU need a cluster? Computer hardware Network
HPC Update: Engagement Model
HPC Update: Engagement Model MIKE VILDIBILL Director, Strategic Engagements Sun Microsystems [email protected] Our Strategy Building a Comprehensive HPC Portfolio that Delivers Differentiated Customer Value
Memory Architecture and Management in a NoC Platform
Architecture and Management in a NoC Platform Axel Jantsch Xiaowen Chen Zhonghai Lu Chaochao Feng Abdul Nameed Yuang Zhang Ahmed Hemani DATE 2011 Overview Motivation State of the Art Data Management Engine
The Bus (PCI and PCI-Express)
4 Jan, 2008 The Bus (PCI and PCI-Express) The CPU, memory, disks, and all the other devices in a computer have to be able to communicate and exchange data. The technology that connects them is called the
OpenMP Programming on ScaleMP
OpenMP Programming on ScaleMP Dirk Schmidl [email protected] Rechen- und Kommunikationszentrum (RZ) MPI vs. OpenMP MPI distributed address space explicit message passing typically code redesign
ECLIPSE Performance Benchmarks and Profiling. January 2009
ECLIPSE Performance Benchmarks and Profiling January 2009 Note The following research was performed under the HPC Advisory Council activities AMD, Dell, Mellanox, Schlumberger HPC Advisory Council Cluster
CHAPTER 7: The CPU and Memory
CHAPTER 7: The CPU and Memory The Architecture of Computer Hardware, Systems Software & Networking: An Information Technology Approach 4th Edition, Irv Englander John Wiley and Sons 2010 PowerPoint slides
Storage Architectures. Ron Emerick, Oracle Corporation
PCI Express PRESENTATION and Its TITLE Interfaces GOES HERE to Flash Storage Architectures Ron Emerick, Oracle Corporation SNIA Legal Notice The material contained in this tutorial is copyrighted by the
Parallel Computing. Benson Muite. [email protected] http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite [email protected] http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
Introduction History Design Blue Gene/Q Job Scheduler Filesystem Power usage Performance Summary Sequoia is a petascale Blue Gene/Q supercomputer Being constructed by IBM for the National Nuclear Security
Hadoop: Embracing future hardware
Hadoop: Embracing future hardware Suresh Srinivas @suresh_m_s Page 1 About Me Architect & Founder at Hortonworks Long time Apache Hadoop committer and PMC member Designed and developed many key Hadoop
Architecting High-Speed Data Streaming Systems. Sujit Basu
Architecting High-Speed Data Streaming Systems Sujit Basu stream ing [stree-ming] verb 1. The act of transferring data to or from an instrument at a rate high enough to sustain continuous acquisition or
RAID. RAID 0 No redundancy ( AID?) Just stripe data over multiple disks But it does improve performance. Chapter 6 Storage and Other I/O Topics 29
RAID Redundant Array of Inexpensive (Independent) Disks Use multiple smaller disks (c.f. one large disk) Parallelism improves performance Plus extra disk(s) for redundant data storage Provides fault tolerant
Michael Kagan. [email protected]
Virtualization in Data Center The Network Perspective Michael Kagan CTO, Mellanox Technologies [email protected] Outline Data Center Transition Servers S as a Service Network as a Service IO as a Service
Emerging storage and HPC technologies to accelerate big data analytics Jerome Gaysse JG Consulting
Emerging storage and HPC technologies to accelerate big data analytics Jerome Gaysse JG Consulting Introduction Big Data Analytics needs: Low latency data access Fast computing Power efficiency Latest
Intel Data Direct I/O Technology (Intel DDIO): A Primer >
Intel Data Direct I/O Technology (Intel DDIO): A Primer > Technical Brief February 2012 Revision 1.0 Legal Statements INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE,
Computer Graphics Hardware An Overview
Computer Graphics Hardware An Overview Graphics System Monitor Input devices CPU/Memory GPU Raster Graphics System Raster: An array of picture elements Based on raster-scan TV technology The screen (and
FPGA-based Multithreading for In-Memory Hash Joins
FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded
Annotation to the assignments and the solution sheet. Note the following points
Computer rchitecture 2 / dvanced Computer rchitecture Seite: 1 nnotation to the assignments and the solution sheet This is a multiple choice examination, that means: Solution approaches are not assessed
A+ Guide to Managing and Maintaining Your PC, 7e. Chapter 1 Introducing Hardware
A+ Guide to Managing and Maintaining Your PC, 7e Chapter 1 Introducing Hardware Objectives Learn that a computer requires both hardware and software to work Learn about the many different hardware components
Learning Outcomes. Simple CPU Operation and Buses. Composition of a CPU. A simple CPU design
Learning Outcomes Simple CPU Operation and Buses Dr Eddie Edwards [email protected] At the end of this lecture you will Understand how a CPU might be put together Be able to name the basic components
Cisco 7816-I5 Media Convergence Server
Cisco 7816-I5 Media Convergence Server Cisco Unified Communications Solutions unify voice, video, data, and mobile applications on fixed and mobile networks, enabling easy collaboration every time from
Evaluation of CUDA Fortran for the CFD code Strukti
Evaluation of CUDA Fortran for the CFD code Strukti Practical term report from Stephan Soller High performance computing center Stuttgart 1 Stuttgart Media University 2 High performance computing center
David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems
David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems About me David Rioja Redondo Telecommunication Engineer - Universidad de Alcalá >2 years building and managing clusters UPM
HP ProLiant SL270s Gen8 Server. Evaluation Report
HP ProLiant SL270s Gen8 Server Evaluation Report Thomas Schoenemeyer, Hussein Harake and Daniel Peter Swiss National Supercomputing Centre (CSCS), Lugano Institute of Geophysics, ETH Zürich [email protected]
Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC
HPC Architecture End to End Alexandre Chauvin Agenda HPC Software Stack Visualization National Scientific Center 2 Agenda HPC Software Stack Alexandre Chauvin Typical HPC Software Stack Externes LAN Typical
and RISC Optimization Techniques for the Hitachi SR8000 Architecture
1 KONWIHR Project: Centre of Excellence for High Performance Computing Pseudo-Vectorization and RISC Optimization Techniques for the Hitachi SR8000 Architecture F. Deserno, G. Hager, F. Brechtefeld, G.
Current Status of FEFS for the K computer
Current Status of FEFS for the K computer Shinji Sumimoto Fujitsu Limited Apr.24 2012 LUG2012@Austin Outline RIKEN and Fujitsu are jointly developing the K computer * Development continues with system
LS DYNA Performance Benchmarks and Profiling. January 2009
LS DYNA Performance Benchmarks and Profiling January 2009 Note The following research was performed under the HPC Advisory Council activities AMD, Dell, Mellanox HPC Advisory Council Cluster Center The
Stream Processing on GPUs Using Distributed Multimedia Middleware
Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research
CPU Session 1. Praktikum Parallele Rechnerarchtitekturen. Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14, 2015 1
CPU Session 1 Praktikum Parallele Rechnerarchtitekturen Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14, 2015 1 Overview Types of Parallelism in Modern Multi-Core CPUs o Multicore
ICRI-CI Retreat Architecture track
ICRI-CI Retreat Architecture track Uri Weiser June 5 th 2015 - Funnel: Memory Traffic Reduction for Big Data & Machine Learning (Uri) - Accelerators for Big Data & Machine Learning (Ran) - Machine Learning
Computer Organization
Computer Organization and Architecture Designing for Performance Ninth Edition William Stallings International Edition contributions by R. Mohan National Institute of Technology, Tiruchirappalli PEARSON
Cisco MCS 7825-H3 Unified Communications Manager Appliance
Cisco MCS 7825-H3 Unified Communications Manager Appliance Cisco Unified Communications is a comprehensive IP communications system of voice, video, data, and mobility products and applications. It enables
Computer Systems Structure Input/Output
Computer Systems Structure Input/Output Peripherals Computer Central Processing Unit Main Memory Computer Systems Interconnection Communication lines Input Output Ward 1 Ward 2 Examples of I/O Devices
HPC and Big Data. EPCC The University of Edinburgh. Adrian Jackson Technical Architect [email protected]
HPC and Big Data EPCC The University of Edinburgh Adrian Jackson Technical Architect [email protected] EPCC Facilities Technology Transfer European Projects HPC Research Visitor Programmes Training
