Programming the Cell Multiprocessor: A Brief Introduction
|
|
|
- Tyrone Greene
- 10 years ago
- Views:
Transcription
1 Programming the Cell Multiprocessor: A Brief Introduction David McCaughan, HPC Analyst SHARCNET, University of Guelph [email protected] Overview Programming for the Cell is non-trivial many issues to be addressed explicit management of heterogeneous compute cores explicit management of limited memory bandwidth explicit vectorization of code it will be impossible to cover all of this in a meaningful way in 3 hours Our goal: ramp up our understanding of the major architectural issues begin the process of getting hands-on experience with the Cell Broadband Engine set the stage for taking our next steps I will attempt to follow-up on this course with a more in-depth treatment over AccessGrid (possibly as a multi-part offering) at some point over the summer A brief history of multiprocessing In the beginning Von Neumann architecture one processing unit + one memory + bus processor executes one instruction at a time other device elements share the bus for data transfer memory CPU bus Performance and the single processor Two bottlenecks: compute increased in agreement with Moore's Law memory bandwidth bus typically slow (fraction of CPU clock) memory access slower due to physics increases more slowly than compute Strategies: pipelining caching 1
2 Example: IBM Power 4 Multiprocessing: compute power core Step 1: if it isn't broken, don't fix it Von Neumann architecture is adaptable add processors sharing existing data bus issues: cache coherence memory bandwidth cache memory CPU 1 CPU 2 CPU 3 CPU 4 bus Bus architecture Benefits: simple design easy to maintain cache coherence issues: contention for the bus performance degrades quickly amplifies memory bandwidth issues Multi-processing: memory Step 2: try to keep the compute beast fed limited benefit from having huge compute that cannot be kept busy with data communication with memory is the issue Possible solutions: crossbar switch non-uniform memory access (NUMA) 2
3 Crossbar switch Crossbar switch (cont.) Each piece of hardware has a bus-like communication channel organized so that they interconnect in a grid some path exists between each piece of hardware through the crossbar attempts to mitigate bottleneck a single bus produces Switches at intersections route connections as necessary must also resolve contention when more that one device wishes to access the same hardware P 1 P 2 P 3 P 4 memory 1 memory 2 memory 3 memory 4 A 4x4 Crossbar Switch Crossbar switch (cont.) Non-uniform memory (NUMA) Benefits uniform memory access efficient shared memory programmer s view of system is simplified Issues cache coherence (requires some means of broadcasting updates) memory access patterns still significant processors hammering the same block of memory will still saturate available bandwidth does not scale well increasing complexity (particularly the switches) increasing cost Allow models to scale by abandoning uniform memory access processors organized into nodes containing processors and local (fast) memory nodes connected using an interconnect local memory will be much faster than accessing memory on another node shared memory systems may make global memory appear as a single memory space in the extreme case is message passing in a cluster where there is no direct remote memory access on some level, this is simply a different way to look at the distributed memory paradigm 3
4 Non-uniform memory (cont.) P 1 P 2 memory 1 memory 2 node local memory (fast) local memory (fast) memory interconnect local memory (fast) local memory (fast) A Non-Uniform Architecture Constructed from 2-Processor Nodes Multi-processing: the rise of multi-core For years, larger SMP machines with increasingly complex crossbar switches were built FLOP/$ so poor that clusters dominated note: large SMP solutions today are all NUMA widely used; a fair trade, but not true SMP Multi-core multiple compute cores within a single chip dual-core was cheap, but nearly as powerful as a true dualprocessor box warning: now is when the marketing types typically take over Intel/AMD/SUN/Dell touting 64+ cores on a chip is this realistic? Marketing; or "wacky, waving, inflatable, arm flailing tube man!" Multi-core: the reality sets in Have faith that chip designers can design/build a chip with as many cores as they'd like is this somehow circumventing the issues that plagued scaling of multi-processor systems? have we solved the crossbar problem? have we actually introduced yet another level of memory access complexity? consider: sharing a cache with 60 cores DELL slide from SC'08 (minus cartoon) Strategies: build more specialized hardware that works given the known limitations rather than pretending they don't exist shift burden of leveraging hardware to developer/compiler 4
5 Multi-core: the next generation Cell processor Old way: multiple general purpose CPU cores on a chip (homogenous) New way: since we can't practically use a large number of general purpose CPUs on the same chip anyway, specialize the cores to expose what is possible (heterogeneous) Homogenous multi-core: non-uniform threads highly specialized algorithms; limit bandwidth demands across threads Heterogeneous multi-core: GP-GPU (extension of classical SIMD) Cell processor PPE power processor element general purpose core L1/L2 cache SPE synergistic processor elements DMA unit local store memory 256KB (LS) execution units (SXU) On-chip coherent bus (EIB) allows single address space to be accessed by PPE/SPEs high bandwidth (200GB/s+) Cell processor (cont.) Feeding the multi-core beast (IBM) Programming strategies: fine-grained control over flow of data between main memory & local store DMA hardware gather/scatter data explicit control of data flow hide latency (read/write/compute) avoid hardware cache management Allows much more efficient use of available bandwidth the trade-off being it's all you doing it contrast with EPIC architecture only compiler-centric instruction scheduling It's now all about the data orchestrate the delivery/removal of data keep as much of the compute resources as possible working at all times Constraints: main memory latency size of local store bandwidth of coherent bus And none of this is automated for you! you should start to see why I'm on about ease of programming taking steps backwards in order to leverage power 5
6 Programming the Cell Basic idea: control functions assigned to PPE calculation functions assigned to SPEs Basic programming model PPE used for execution of the main program SPEs execute sub-programs, making use of DMA transfers to get/put data independent of PPE Maximizing performance involves: operate SPEs in parallel to maximize instructions per unit time perform SIMD parallelization on each SPE to maximize instructions per cycle Program control and data flow [PPE]: 1. load SPE program into LS 2. instruct SPE to execute program [SPE]: 1. transfer data memory -> LS 2. process data 3. transfer results LS -> memory 4. notify PPE processing is done SIMD programming Similar to Altivec/SSE/etc. Large registers (128 bit) explicit vector instructions can manipulate register contents as multiple values simultaneously 4 x 32-bit float, 8 x 16-bit integer, 16 x 8-bit integer, etc. 6
7 Cell programming fundamentals SPE Runtime Management Library (libspe v2) controls SPE program execution from PPE program handles SPEs as virtual objects (SPE contexts) SPE programs are loaded and executed by operating on SPE contexts (note: there are file system issues here) Programming basics: #include "libspe2.h" compile using cell-specific language support IBM: xlc, gcc (requires gcc with support for vector extensions) IBM provides a decent compilation framework we can (and will) use today --- see exercise The SHARCNET Cell cluster: prickly.sharcnet.ca Heterogenous cluster: dual quad-core [email protected] 8GB RAM x 4 nodes dual PowerXCell [email protected] 16GB RAM x 8 nodes no scheduler currently OpenMPI NOTE: you must be logged in to one of the Cell nodes in order to access the Cell development environments see also: Cell_Accelerated_Computing this is the "new" training material although not linked from the main page yet, you must be logged into the web portal to view it PPE programming PPE program execution basics: 1. open SPE program image 2. create SPE context 3. load SPE program into LS 4. execute program on SPE 5. destroy SPE context 6. close SPE program image SPE program image SPE program is compiled and linked as a separate entity from the PPE program there is no OS running on the SPEs PPE must explicitly copy the SPE image from a file to memory, and then explicitly invoke the DMA controller on a SPE to move the SPE image into LS SPE-ELF images are embedded in the PPE binaries declare external reference to force inclusion i.e.: extern spe_program_handle_t spe_program_handle; 7
8 Creating a SPE context spe_context_ptr_t spe_context_create ( unsigned int flags, spe_gang_context_ptr_t *gang ); Load SPE program into context int spe_program_load ( spe_context_ptr_t spe, spe_program_handle_t *program ); Create a new SPE context; if none available, will either block or return an error if you configure the library to do so Load an SPE main program into a provided SPE context (loads into SPE LS); returns 0 on success, non-zero (-1) otherwise flags adjust behaviour/policies of requested SPE spe SPE context on which to load the program gang allows for control of SPE affinity (if not NULL); contexts are assoiciated with physical SPEs randomly by libspe by default program address of SPE program (recall extern reference to SPE program image) Execute program on SPE int spe_context_run ( spe_context_ptr_t spe, unsigned int *entry, unsigned int runflags, void *argp, void *envp, spe_stop_info_t *stopinfo ); Destroy SPE context int spe_context_destroy ( spe_context_ptr_t spe ); Execute program loaded in provided SPE context (NOTE: blocking call!!) entry entry point for program (SPE_DEFAULT_ENTRY = main); updated during call runflags permit control of execution behaviour (0 = defaults) argp, envp arguments, environment data --- passed to SPE main stopinfo information about termination condition of SPE program Destroy provided SPE context and free associated resources spe SPE context to be destroyed 8
9 SPE program essentials SPE programs are particular entities that require specific compiler support we use a different compiler for SPE compilation recall: no OS runs on the SPE Special format for main() int main( unsigned long long spe, unsigned long long argp, unsigned long long envp ) { Example: "Hello, world!" SPE view: pretty much as expected in order to provide more information, we make use of the speid provided to the main() routine in our output #include <stdio.h> int main( unsigned long long speid, unsigned long long argp, unsigned long long envp ) { printf("hello, world, from SPE %lld!\n", speid); } } return(0); Example (cont.) PPE controller for "Hello, world!" note: this is just for 1 SPE running the program #include <stdlib.h> #include <stdio.h> #include <libspe2.h> extern spe_program_handle_t chello_spu; /* * NOTE: these need not be global */ spe_context_ptr_t speid; unsigned int entry; spe_stop_info_t stop_info; unsigned int flags = 0; SPE programming (cont.) int main(int argc, char *argv[]) { entry = SPE_DEFAULT_ENTRY; } if (!(speid = spe_context_create(flags, NULL))) /* error creating context */ if (spe_program_load(speid, &chello_spu)) /* error loading program into context */ if ((spe_context_run(speid, &entry, 0, NULL, NULL, &stop_info)) < 0) /* error running program in context */ spe_context_destroy(speid); return(0); 9
10 Exercise 1. log into prickly and shell to one of the Cell nodes (pri5-pri12) 2. in ~dbm/pub/exercises/cell/chello you'll find a working implementation of the "Hello, world!" program we just considered; copy this directory to your own workspace 3. examine the directory structure, source code and Makefile 4. build this program and execute it on the command line 5. modify the program to execute the SPE program a second time Food for thought did the second execution occur in parallel? did the second execution occur on the same SPE? think about how you would make it execute on the same and different SPEs Running SPEs in parallel Recall that spe_context_run() is a blocking call the PPE program blocks until the SPE program exits obviously we want things running on the SPEs in parallel --- how are we going to accomplish this? Answer: pthreads the Cell BE expects the programmer to use POSIX threads to allow for multiple SPEs to execute concurrently i.e. execute spe_context_run() in its own thread each thread can block until the SPE program exits allows for concurrent execution in SPEs Pthreads review Pthreads Programming Basics (cont.) Include Pthread header file #include pthread.h Compile with Pthreads support/library cc -pthread compiler vendors may differ in their usage to support pthreads (link a library, etc.) GNU and xlc use the -pthread argument so this will suffice for our purposes when in doubt, consult the man page for the compiler in question Note that all processes have an implicit main thread of control We need a means of creating a new thread pthread_create() We need a way to terminating a thread threads are terminated implicitly when the function that was the entry point of the thread returns, or can be explicitly destroyed using pthread_exit() We may need to distinguish one thread from another at run-time pthread_self(), pthread_equal() 10
11 pthread_create pthread_join int pthread_create ( pthread_t *thread, const pthread_attr_t *attr, void *(*f_start)(void *), void *arg ); Create a new thread with a function as its entry point thread handle (ID) for the created thread attr attributes for the thread (if not NULL) f_start pointer to the function that is to be called first in the new thread arg the argument provided to f_start when called consider: how would you provide multiple arguments to a thread? int pthread_join ( pthread_t *thread, void **status ); Suspends execution of the current thread until the specified thread is complete thread handle (ID) of the thread we are waiting on to finish status value returned by f_start, or provided to pthread_exit() (if not NULL) Exercise 1. refer again to the "Hello, world!" code in ~dbm/pub/exercises/ cell/chello you may wish to make another copy of the directory (or use the one you modified previously) 2. modify the program so that it loads the chello_spu program into all 8 SPEs and runs them concurrently Note: you will need to modify the data structures to allow for the recordkeeping associated with multiple SPE contexts, move the code responsible for executing the SPE program into a function that will be the entry point of the threads, and introduce loops to spawn threads to execute the SPE programs and wait for them to finish Food for thought how are you able to verify that the SPE programs are executing on different cores? if you did not use threads, how would the SPE programs execute? what can happen if you don't wait for the SPE running threads to exit (i.e. don't use pthread_join)? Toward more advanced programs PPE typically responsible for partitioning data consider constraints for multi-spe execution: 1. open SPE program image 2. create all SPE contexts 3. load SPE program into all LS memories 4. execute SPE program on each thread 5. wait for threads to terminate 6. destroy all SPE contexts 7. close SPE program image 11
12 Advancing the SPE: Data Management Going bigger: must make use of DMA engine data transfer between memory and LS controlled by SPE program 1. issue DMA transfer command 2. execute DMA transfer 3. wait for completion of transfer note: fetch/execute/store typical for most tasks Advancing the SPE: SIMD Programming To this point we have only discussed running programs in parallel on the SPE cores Each SPE core is a vector unit, and further speed-up is possible by using language extensions and vector operations vector types corresponding to scalar types of merit vector operations that operate on vector types note: vector types map directly on to contiguous scalar types can leave all data in arrays and step through them with pointers to the vector type The real power of the Cell is leveraging vectorized code running in parallel on SPEs SIMD programming example #include <stdio.h> #include <altivec.h> /* necessary to access VMX instructions */ int a[4] attribute ((aligned(16))) = {1, 2, 3, 4}; int b[4] attribute ((aligned(16))) = {5, 6, 7, 8}; int c[4] attribute ((aligned(16))); int main(int argc, char *argv[]) { vector signed int *va = ( vector signed int *) a; vector signed int *vb = ( vector signed int *) b; vector signed int *vc = ( vector signed int *) c; } *vc = vec_add(*va, *vb); printf("c = <%d, %d, %d, %d>\n",c[0],c[1],c[2],c[3]); return(0); Other considerations We are only scratching the surface in this brief introduction channels communication and coordination facilities (e.g. mailboxes) PPE <-> SPE and SPE <->SPE gangs affinity of SPEs can be important (consider channels) gang contexts allow us to control mapping of contexts to physical SPEs dealing with Cell constraints memory bandwidth issues (consider main memory vs. EIB) 256KB local store (potential for library issues!) physical organization of EIB rings memory alignment issues 12
13 Summary The real work with the Cell is in mapping your problem into a decomposition suitable to the Cell processor constraints not trivial; increasing complexity BFS example: 60 lines of code -> 1200 lines of code (optimized) 24-million edges/second -> 538-million edges/second Consider: domain decomposition for something of interest to you FFT, neural network, etc. where are the problems? how to refactor? Next steps Developer package SDK is free RHEL, Fedora GNU tool chain Analysis and optimization tools includes sophisticated software Cell simulator Optimized numerical libraries MASS, BLAS, LAPACK, FFT, etc. Documentation Examples Note: prickly is set up using the same software 13
Dynamic Profiling and Load-balancing of N-body Computational Kernels on Heterogeneous Architectures
Dynamic Profiling and Load-balancing of N-body Computational Kernels on Heterogeneous Architectures A THESIS SUBMITTED TO THE UNIVERSITY OF MANCHESTER FOR THE DEGREE OF MASTER OF SCIENCE IN THE FACULTY
IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 IBM CELL. Politecnico di Milano Como Campus
Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 CELL INTRODUCTION 2 1 CELL SYNERGY Cell is not a collection of different processors, but a synergistic whole Operation paradigms,
Multi-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
Parallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
High Performance Computing. Course Notes 2007-2008. HPC Fundamentals
High Performance Computing Course Notes 2007-2008 2008 HPC Fundamentals Introduction What is High Performance Computing (HPC)? Difficult to define - it s a moving target. Later 1980s, a supercomputer performs
Introduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
Principles and characteristics of distributed systems and environments
Principles and characteristics of distributed systems and environments Definition of a distributed system Distributed system is a collection of independent computers that appears to its users as a single
CMSC 611: Advanced Computer Architecture
CMSC 611: Advanced Computer Architecture Parallel Computation Most slides adapted from David Patterson. Some from Mohomed Younis Parallel Computers Definition: A parallel computer is a collection of processing
Symmetric Multiprocessing
Multicore Computing A multi-core processor is a processing system composed of two or more independent cores. One can describe it as an integrated circuit to which two or more individual processors (called
Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat
Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat Why Computers Are Getting Slower The traditional approach better performance Why computers are
Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming
Overview Lecture 1: an introduction to CUDA Mike Giles [email protected] hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.
22S:295 Seminar in Applied Statistics High Performance Computing in Statistics
22S:295 Seminar in Applied Statistics High Performance Computing in Statistics Luke Tierney Department of Statistics & Actuarial Science University of Iowa August 30, 2007 Luke Tierney (U. of Iowa) HPC
Operating Systems 4 th Class
Operating Systems 4 th Class Lecture 1 Operating Systems Operating systems are essential part of any computer system. Therefore, a course in operating systems is an essential part of any computer science
Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child
Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.
Multi-core Programming System Overview
Multi-core Programming System Overview Based on slides from Intel Software College and Multi-Core Programming increasing performance through software multi-threading by Shameem Akhter and Jason Roberts,
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
Cell-SWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine
Cell-SWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine Ashwin Aji, Wu Feng, Filip Blagojevic and Dimitris Nikolopoulos Forecast Efficient mapping of wavefront algorithms
Scalability and Classifications
Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static
1 Bull, 2011 Bull Extreme Computing
1 Bull, 2011 Bull Extreme Computing Table of Contents HPC Overview. Cluster Overview. FLOPS. 2 Bull, 2011 Bull Extreme Computing HPC Overview Ares, Gerardo, HPC Team HPC concepts HPC: High Performance
Optimizing Code for Accelerators: The Long Road to High Performance
Optimizing Code for Accelerators: The Long Road to High Performance Hans Vandierendonck Mons GPU Day November 9 th, 2010 The Age of Accelerators 2 Accelerators in Real Life 3 Latency (ps/inst) Why Accelerators?
Parallelization: Binary Tree Traversal
By Aaron Weeden and Patrick Royal Shodor Education Foundation, Inc. August 2012 Introduction: According to Moore s law, the number of transistors on a computer chip doubles roughly every two years. First
Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association
Making Multicore Work and Measuring its Benefits Markus Levy, president EEMBC and Multicore Association Agenda Why Multicore? Standards and issues in the multicore community What is Multicore Association?
Binary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
Parallel Algorithm Engineering
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis
CPU Scheduling Outline
CPU Scheduling Outline What is scheduling in the OS? What are common scheduling criteria? How to evaluate scheduling algorithms? What are common scheduling algorithms? How is thread scheduling different
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.
Introduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware
Shared Address Space Computing: Programming
Shared Address Space Computing: Programming Alistair Rendell See Chapter 6 or Lin and Synder, Chapter 7 of Grama, Gupta, Karypis and Kumar, and Chapter 8 of Wilkinson and Allen Fork/Join Programming Model
Virtualization Technologies and Blackboard: The Future of Blackboard Software on Multi-Core Technologies
Virtualization Technologies and Blackboard: The Future of Blackboard Software on Multi-Core Technologies Kurt Klemperer, Principal System Performance Engineer [email protected] Agenda Session Length:
Multi-core and Linux* Kernel
Multi-core and Linux* Kernel Suresh Siddha Intel Open Source Technology Center Abstract Semiconductor technological advances in the recent years have led to the inclusion of multiple CPU execution cores
Understanding the Benefits of IBM SPSS Statistics Server
IBM SPSS Statistics Server Understanding the Benefits of IBM SPSS Statistics Server Contents: 1 Introduction 2 Performance 101: Understanding the drivers of better performance 3 Why performance is faster
Chapter 2: Computer-System Structures. Computer System Operation Storage Structure Storage Hierarchy Hardware Protection General System Architecture
Chapter 2: Computer-System Structures Computer System Operation Storage Structure Storage Hierarchy Hardware Protection General System Architecture Operating System Concepts 2.1 Computer-System Architecture
COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING
COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING 2013/2014 1 st Semester Sample Exam January 2014 Duration: 2h00 - No extra material allowed. This includes notes, scratch paper, calculator, etc.
LS DYNA Performance Benchmarks and Profiling. January 2009
LS DYNA Performance Benchmarks and Profiling January 2009 Note The following research was performed under the HPC Advisory Council activities AMD, Dell, Mellanox HPC Advisory Council Cluster Center The
SYSTEM ecos Embedded Configurable Operating System
BELONGS TO THE CYGNUS SOLUTIONS founded about 1989 initiative connected with an idea of free software ( commercial support for the free software ). Recently merged with RedHat. CYGNUS was also the original
MPI and Hybrid Programming Models. William Gropp www.cs.illinois.edu/~wgropp
MPI and Hybrid Programming Models William Gropp www.cs.illinois.edu/~wgropp 2 What is a Hybrid Model? Combination of several parallel programming models in the same program May be mixed in the same source
Next Generation GPU Architecture Code-named Fermi
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
Chapter 2 Parallel Architecture, Software And Performance
Chapter 2 Parallel Architecture, Software And Performance UCSB CS140, T. Yang, 2014 Modified from texbook slides Roadmap Parallel hardware Parallel software Input and output Performance Parallel program
CHAPTER 1 INTRODUCTION
1 CHAPTER 1 INTRODUCTION 1.1 MOTIVATION OF RESEARCH Multicore processors have two or more execution cores (processors) implemented on a single chip having their own set of execution and architectural recourses.
Why the Network Matters
Week 2, Lecture 2 Copyright 2009 by W. Feng. Based on material from Matthew Sottile. So Far Overview of Multicore Systems Why Memory Matters Memory Architectures Emerging Chip Multiprocessors (CMP) Increasing
Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)
COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook) Vivek Sarkar Department of Computer Science Rice University [email protected] COMP
Distributed Systems LEEC (2005/06 2º Sem.)
Distributed Systems LEEC (2005/06 2º Sem.) Introduction João Paulo Carvalho Universidade Técnica de Lisboa / Instituto Superior Técnico Outline Definition of a Distributed System Goals Connecting Users
Parallel Programming
Parallel Programming Parallel Architectures Diego Fabregat-Traver and Prof. Paolo Bientinesi HPAC, RWTH Aachen [email protected] WS15/16 Parallel Architectures Acknowledgements Prof. Felix
RevoScaleR Speed and Scalability
EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution
Trends in High-Performance Computing for Power Grid Applications
Trends in High-Performance Computing for Power Grid Applications Franz Franchetti ECE, Carnegie Mellon University www.spiral.net Co-Founder, SpiralGen www.spiralgen.com This talk presents my personal views
HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS
HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS JASON POWER*, ARKAPRAVA BASU*, JUNLI GU, SOORAJ PUTHOOR, BRADFORD M BECKMANN, MARK D HILL*, STEVEN K REINHARDT, DAVID A WOOD* *University of
GPUs for Scientific Computing
GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles [email protected] Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research
FPGA-based Multithreading for In-Memory Hash Joins
FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded
Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007
Multi-core architectures Jernej Barbic 15-213, Spring 2007 May 3, 2007 1 Single-core computer 2 Single-core CPU chip the single core 3 Multi-core architectures This lecture is about a new trend in computer
Course Development of Programming for General-Purpose Multicore Processors
Course Development of Programming for General-Purpose Multicore Processors Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University Richmond, VA 23284 [email protected]
Driving force. What future software needs. Potential research topics
Improving Software Robustness and Efficiency Driving force Processor core clock speed reach practical limit ~4GHz (power issue) Percentage of sustainable # of active transistors decrease; Increase in #
Threads Scheduling on Linux Operating Systems
Threads Scheduling on Linux Operating Systems Igli Tafa 1, Stavri Thomollari 2, Julian Fejzaj 3 Polytechnic University of Tirana, Faculty of Information Technology 1,2 University of Tirana, Faculty of
An Implementation Of Multiprocessor Linux
An Implementation Of Multiprocessor Linux This document describes the implementation of a simple SMP Linux kernel extension and how to use this to develop SMP Linux kernels for architectures other than
Multicore Parallel Computing with OpenMP
Multicore Parallel Computing with OpenMP Tan Chee Chiang (SVU/Academic Computing, Computer Centre) 1. OpenMP Programming The death of OpenMP was anticipated when cluster systems rapidly replaced large
A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin
A Comparison Of Shared Memory Parallel Programming Models Jace A Mogill David Haglin 1 Parallel Programming Gap Not many innovations... Memory semantics unchanged for over 50 years 2010 Multi-Core x86
CS 377: Operating Systems. Outline. A review of what you ve learned, and how it applies to a real operating system. Lecture 25 - Linux Case Study
CS 377: Operating Systems Lecture 25 - Linux Case Study Guest Lecturer: Tim Wood Outline Linux History Design Principles System Overview Process Scheduling Memory Management File Systems A review of what
Bringing Big Data Modelling into the Hands of Domain Experts
Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks [email protected] 2015 The MathWorks, Inc. 1 Data is the sword of the
Intro to GPU computing. Spring 2015 Mark Silberstein, 048661, Technion 1
Intro to GPU computing Spring 2015 Mark Silberstein, 048661, Technion 1 Serial vs. parallel program One instruction at a time Multiple instructions in parallel Spring 2015 Mark Silberstein, 048661, Technion
Architectures and Platforms
Hardware/Software Codesign Arch&Platf. - 1 Architectures and Platforms 1. Architecture Selection: The Basic Trade-Offs 2. General Purpose vs. Application-Specific Processors 3. Processor Specialisation
Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui
Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching
Multicore Programming with LabVIEW Technical Resource Guide
Multicore Programming with LabVIEW Technical Resource Guide 2 INTRODUCTORY TOPICS UNDERSTANDING PARALLEL HARDWARE: MULTIPROCESSORS, HYPERTHREADING, DUAL- CORE, MULTICORE AND FPGAS... 5 DIFFERENCES BETWEEN
Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.
Agenda Enterprise Performance Factors Overall Enterprise Performance Factors Best Practice for generic Enterprise Best Practice for 3-tiers Enterprise Hardware Load Balancer Basic Unix Tuning Performance
Informatica e Sistemi in Tempo Reale
Informatica e Sistemi in Tempo Reale Introduction to C programming Giuseppe Lipari http://retis.sssup.it/~lipari Scuola Superiore Sant Anna Pisa October 25, 2010 G. Lipari (Scuola Superiore Sant Anna)
Chapter 1 Computer System Overview
Operating Systems: Internals and Design Principles Chapter 1 Computer System Overview Eighth Edition By William Stallings Operating System Exploits the hardware resources of one or more processors Provides
Embedded Systems: map to FPGA, GPU, CPU?
Embedded Systems: map to FPGA, GPU, CPU? Jos van Eijndhoven [email protected] Bits&Chips Embedded systems Nov 7, 2013 # of transistors Moore s law versus Amdahl s law Computational Capacity Hardware
The programming language C. sws1 1
The programming language C sws1 1 The programming language C invented by Dennis Ritchie in early 1970s who used it to write the first Hello World program C was used to write UNIX Standardised as K&C (Kernighan
Computer Systems Structure Input/Output
Computer Systems Structure Input/Output Peripherals Computer Central Processing Unit Main Memory Computer Systems Interconnection Communication lines Input Output Ward 1 Ward 2 Examples of I/O Devices
Chapter 18: Database System Architectures. Centralized Systems
Chapter 18: Database System Architectures! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems! Network Types 18.1 Centralized Systems! Run on a single computer system and
Architecture of Hitachi SR-8000
Architecture of Hitachi SR-8000 University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Slide 1 Most of the slides from Hitachi Slide 2 the problem modern computer are data
Hardware Based Virtualization Technologies. Elsie Wahlig [email protected] Platform Software Architect
Hardware Based Virtualization Technologies Elsie Wahlig [email protected] Platform Software Architect Outline What is Virtualization? Evolution of Virtualization AMD Virtualization AMD s IO Virtualization
Windows Server Performance Monitoring
Spot server problems before they are noticed The system s really slow today! How often have you heard that? Finding the solution isn t so easy. The obvious questions to ask are why is it running slowly
Performance Guide. 275 Technology Drive ANSYS, Inc. is Canonsburg, PA 15317. http://www.ansys.com (T) 724-746-3304 (F) 724-514-9494
Performance Guide ANSYS, Inc. Release 12.1 Southpointe November 2009 275 Technology Drive ANSYS, Inc. is Canonsburg, PA 15317 certified to ISO [email protected] 9001:2008. http://www.ansys.com (T) 724-746-3304
LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance
11 th International LS-DYNA Users Conference Session # LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance Gilad Shainer 1, Tong Liu 2, Jeff Layton 3, Onur Celebioglu
Outline. Introduction. Multiprocessor Systems on Chip. A MPSoC Example: Nexperia DVP. A New Paradigm: Network on Chip
Outline Modeling, simulation and optimization of Multi-Processor SoCs (MPSoCs) Università of Verona Dipartimento di Informatica MPSoCs: Multi-Processor Systems on Chip A simulation platform for a MPSoC
CUDA programming on NVIDIA GPUs
p. 1/21 on NVIDIA GPUs Mike Giles [email protected] Oxford University Mathematical Institute Oxford-Man Institute for Quantitative Finance Oxford eresearch Centre p. 2/21 Overview hardware view
Performance Monitoring of the Software Frameworks for LHC Experiments
Proceedings of the First EELA-2 Conference R. mayo et al. (Eds.) CIEMAT 2009 2009 The authors. All rights reserved Performance Monitoring of the Software Frameworks for LHC Experiments William A. Romero
PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design
PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions Slide 1 Outline Principles for performance oriented design Performance testing Performance tuning General
How To Understand The Concept Of A Distributed System
Distributed Operating Systems Introduction Ewa Niewiadomska-Szynkiewicz and Adam Kozakiewicz [email protected], [email protected] Institute of Control and Computation Engineering Warsaw University of
Rambus Smart Data Acceleration
Rambus Smart Data Acceleration Back to the Future Memory and Data Access: The Final Frontier As an industry, if real progress is to be made towards the level of computing that the future mandates, then
Lecture 2 Parallel Programming Platforms
Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple
Reconfigurable Architecture Requirements for Co-Designed Virtual Machines
Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Kenneth B. Kent University of New Brunswick Faculty of Computer Science Fredericton, New Brunswick, Canada [email protected] Micaela Serra
Agenda. Distributed System Structures. Why Distributed Systems? Motivation
Agenda Distributed System Structures CSCI 444/544 Operating Systems Fall 2008 Motivation Network structure Fundamental network services Sockets and ports Client/server model Remote Procedure Call (RPC)
Chapter 7: Distributed Systems: Warehouse-Scale Computing. Fall 2011 Jussi Kangasharju
Chapter 7: Distributed Systems: Warehouse-Scale Computing Fall 2011 Jussi Kangasharju Chapter Outline Warehouse-scale computing overview Workloads and software infrastructure Failures and repairs Note:
Lesson Objectives. To provide a grand tour of the major operating systems components To provide coverage of basic computer system organization
Lesson Objectives To provide a grand tour of the major operating systems components To provide coverage of basic computer system organization AE3B33OSD Lesson 1 / Page 2 What is an Operating System? A
Parallel Computing. Benson Muite. [email protected] http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite [email protected] http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
Choosing a Computer for Running SLX, P3D, and P5
Choosing a Computer for Running SLX, P3D, and P5 This paper is based on my experience purchasing a new laptop in January, 2010. I ll lead you through my selection criteria and point you to some on-line
Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering
Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays Red Hat Performance Engineering Version 1.0 August 2013 1801 Varsity Drive Raleigh NC
Design and Implementation of the Heterogeneous Multikernel Operating System
223 Design and Implementation of the Heterogeneous Multikernel Operating System Yauhen KLIMIANKOU Department of Computer Systems and Networks, Belarusian State University of Informatics and Radioelectronics,
OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC
OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC Driving industry innovation The goal of the OpenPOWER Foundation is to create an open ecosystem, using the POWER Architecture to share expertise,
OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA
OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization
Operating system Dr. Shroouq J.
3 OPERATING SYSTEM STRUCTURES An operating system provides the environment within which programs are executed. The design of a new operating system is a major task. The goals of the system must be well
Parallel Computing. Parallel shared memory computing with OpenMP
Parallel Computing Parallel shared memory computing with OpenMP Thorsten Grahs, 14.07.2014 Table of contents Introduction Directives Scope of data Synchronization OpenMP vs. MPI OpenMP & MPI 14.07.2014
An Introduction to Parallel Computing/ Programming
An Introduction to Parallel Computing/ Programming Vicky Papadopoulou Lesta Astrophysics and High Performance Computing Research Group (http://ahpc.euc.ac.cy) Dep. of Computer Science and Engineering European
Whitepaper: performance of SqlBulkCopy
We SOLVE COMPLEX PROBLEMS of DATA MODELING and DEVELOP TOOLS and solutions to let business perform best through data analysis Whitepaper: performance of SqlBulkCopy This whitepaper provides an analysis
Sources: On the Web: Slides will be available on:
C programming Introduction The basics of algorithms Structure of a C code, compilation step Constant, variable type, variable scope Expression and operators: assignment, arithmetic operators, comparison,
