High-Performance Computing: Architecture and APIs
|
|
- Richard Job Fletcher
- 7 years ago
- Views:
Transcription
1 High-Performance Computing: Architecture and APIs Douglas Fuller ASU Fulton High Performance Computing
2 Why HPC? Capacity computing Do similar jobs, but lots of them. Capability computing Run programs we could not before! Insufficient resources (memory, usually) Insufficient time
3 Commonly Discussed HPC Terms Supercomputer Cluster (Beowulf Cluster) Shared Memory Machine Grid Scalability
4 Intel Core 2 Duo - 3GF
5 Scaling von Neumann Where are the bottlenecks? C P M I/O
6 Vector processing SIMD approach (sound familiar?) Special register/execution units Handles large amounts of data at once Good for linear algebra / scientific computing Can be assisted by language support Can be partially leveraged by compilers
7 Cray-1 First successful vector system, bit, 80 MHz 8MB RAM 250 MFLOPS peak (136 typical) 115 kw
8 Vector processing Resurgence with high-end A/V market MMX, 3DNow!, SSE, SSE2 GPUs Game consoles iphones Vector processing leading to gaming/hpc convergence (Cell)
9 SMP Systems MIMD approach Commodity processors connect through an interconnect to a single logical memory Demands on the interconnection bus extremely high Sustained memory bandwidth Fetch latency Cache coherence is a real problem Programs must still have concurrency, but variables are shared C C C P P P I/O M P P P C C C
10 Digression: cache coherence Multiple processors using the same data These processors caches must stay synchronized! C A=3; C P P I/O P P Get A A C This introduces considerable overhead and limits scalability 3 P M P C
11 Programming SMPs The same as multithreaded serial programming, right? (example) Well, almost. More locality issues False sharing Toolkits to help OpenMP
12 SMP systems today Just about everything has multiple cores Intel Core, Core 2, Xeon,... AMD Opteron,... Cache strategies vary Transistor count (AMD vs. Intel) Memory bandwidth (and local IC buses)
13 Scaling the SMP Remember the critical system bus Broadcast coherence messages and shared links impose high bandwidth requirements Cores aren t the problem. Solution: serialize memory bus communication Removes the S in SMP
14 NUMA Similar to SMP, but we give up the S Reduces bus bandwidth requirements Requires interconnect design (more later) Introduces a penalty for remote memory access! Cache coherence pops up again What about for remote memory?
15 The directory I/O Tracks which CPUs have each cache line C D D C Allows point-to-point messages for cachecoherence P M M P How do you locate a remote block that s cached by another processor? P C M D M D P C
16 Programming NUMA It s just like writing for SMPs. (example) Right? Sort of. It looks the same, but there are more factors to consider. Architecture design imposes a performance impact Code still must be architecture-aware!
17 NUMA today Still exists for HPC, but expensive Custom hardware, directory units, interconnects Custom software (single system image) Commodity processors AMD Opteron (DirectConnect brings MMU onboard)
18 MPP Systems Use a large number of weaker processors Most decouple their memory subsystems - Distributed memory Relies on: Smart system, Smart compiler, or Smart programmer
19 MPP Systems Processors interconnected with custom hardware Architectures vary widely
20 Thinking Machines CM-5 Up to 16, MHz processors Largest ever built was 512 processors, 64 GF peak 16 GB main memory Where was the most famous CM-5?
21 Programming MPPs Program follows architecture (including interconnect) Many MPPs support multiple models More architecture-aware models perform better Less architecture-aware models are more portable What to choose when developing a program?
22 Interconnecting Topology choice critical; considerations include: Performance (latency and bandwidth) Conformity/uniformity Cost Scalability
23 Interconnecting Completely Connected an Completely Connected : Each pr communication link to every othe Completely Connected and Star Networks Completely Connected : Each processor has direct communication link to every other processor Fully connected Arrays and Rings Star Connected Network : The m Star the central processor. Every othe Linear Array : connected to it. Counter part of C Ring Dynamic interconnect. Star Connected Network : The middle processor is the central processor. Ring : Every other processor is connected to it. Counter part of Cross Bar switch in Dynamic interconnect. Mesh Network (e.g. 2D-array)
24 near Array : Arrays and Rings esh Network (e.g. 2D-array) Mesh Torus Hypercube Tree Hypercubes Interconnecting ing ercube : Network : A multidimensional mesh of essors with exactly two processors in each nsion. A d dimensional processor consists of p = 2 d processors wn below are 0, 1, 2, and 3D hypercubes Fat Trees Multiple switches Each level has the same number of links in as out Increasing number of links at each level 0-D 1-D 2-D 3-D hypercubes Gives full bandwidth between the links Torus 2-d Torus (2-d version of the ring) Added latency the higer you go
25 Look familiar? Desktop systems use the same architectures Token Ring SONET FDDI Ethernet
26 Desktop systems Leverage the economics Commodity parts CPUs and memory Circuit City supercomputing Interconnect is now a commodity network.
27 Beowulf Clusters A Beowulf is a parallel computer consisting of a collection of nodes built from commodity parts Each node has it's own processors, memory, and I/O Nodes communicate through an interconnection network. One node designated master or head is attached to public network and interconnection network Compute nodes Interconnection Network Master Node Internet or Internal Network Basic Beowulf
28 Clusters Important Characteristics Commodity Components - Mass Market R&D investment keeps technology moving forward Distributed memory - your old program won t speed up Communication between processors has a cost
29 Programming clusters Multiple system images, therefore there is NO shared memory. Many models try to emulate earlier architectures. Why? By far the most popular is MPI.
30 A 10 minute introduction to MPI
31 What is MPI? The Message Passing Interface,De Facto standard for message passing Unified many vendor specific MP libraries in 1990s Works with C,FORTRAN, F90 (always), C++ (usually) and more exotic things (e.g. Python) occasionally Allows programmer to explicitly send/receive messages among processes in parallel program Supports Data Parallel programming model
32 Data Parallel Programming One program, many copies. Each instance of the program (task) does the same instructions on different data. Each task has it s own local memory The trick (for the programmer): Remember it s parallel Remember what s in what memory Input Output
33 Why the Data Parallel Model? Only one program to worry about. Easier to debug program. Easier to visualize program behavior. Naturally load balances (sometimes).
34 Introduction to MPI MPI is a standard for message passing interfaces MPI-1 covers point-to-point and collective communication Point-to-point: Explicit Messages (send/receive) Collective: (Express Patterns of Communication) MPI-2 covers connection based communication and I/O Typical implementations include MPICH, LAMMPI, and OpenMPI
35 MPI in Six Functions MPI_Init - start using MPI MPI_Comm_size - get the number of tasks MPI_Comm_rank - the unique index of this task MPI_Send - send a message MPI_Recv - receive a message MPI_Finalize - stop using MPI
36 Initialize and Finalize The first MPI call must be to MPI_Init. The last MPI call must be to MPI_Finalize. #include <mpi.h> main(int argc, int **argv) { } MPI_Init(&argc, &argv ); // put program here MPI_Finalize();
37 Initialize and Finalize int MPI_Init(int *argc,char ***argv); int MPI_Finalize(); MPI_INIT(ierror) integer ierror MPI_FINALIZE(ierror) integer ierror void MPI::Init(int& argc,char**& argv); void MPI::Finalize();
38 Size and Rank MPI_Comm_size returns the number of tasks in the job int size; MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Comm_rank returns the number of the current task (0.. size-1) int rank; MPI_Comm_rank(MPI_COMM_WORLD, &rank);
39 MPI Communicators Abstract structure represents a group of MPI tasks that can communicate MPI_COMM_WORLD represents all of the tasks in a given job Programmer can create new communicators to subset MPI_COMM_WORLD RANK or task number is relative to a given communicator Messages from different communicators do not interfere
40 A Simple Example #include "mpi.h" #include <stdio.h> int main(int argc, char *argv[] ) { int rank, size; } MPI_Init( &argc,&argv); MPI_Comm_rank( MPI_COMM_WORLD, &rank); MPI_Comm_size( MPI_COMM_WORLD, &size); printf("hello world from process %d of %d \n",rank,size); MPI_Finalize(); return 0;
41 Send and Recv MPI_Send to send a message char sbuf[count]; MPI_Send(sbuf, COUNT, MPI_CHAR, 1, 99, MPI_COMM_WORLD); MPI_Recv to receive a message char rbuf[count]; MPI_Status status; MPI_Recv(rbuf, COUNT, MPI_CHAR, 1, 99, MPI_COMM_WORLD, &status);
42 Anatomy of MPI_Recv MPI_Recv(rbuf, COUNT, MPI_CHAR, 1, 99, MPI_COMM_WORLD, &status); rbuf : pointer to receive buffer COUNT : items in receive buffer MPI_CHAR : MPI datatype 1 : source task number (rank) 99 : message tag MPI_COMM_WORLD : communicator Status : pointer to status struct
43 MPI Datatypes Encodes type of data sent and received Built-in types MPI_CHAR, MPI_SHORT, MPI_INT, MPI_LONG MPI_FLOAT, MPI_DOUBLE, MPI_LONG_DOUBLE MPI_BYTE, MPI_PACKED User defined types MPI_Type_contiguous, MPI_Type_vector, MPI_Type_indexed, MPI_Type_struct MPI_Pack, MPI_Unpack
44 A Quick Send and Receive Example #include "mpi.h" #include <stdio.h> int main(int argc, char *argv[] ) { int numprocs, myrank, namelen, i; char processor_name[mpi_max_processor_name]; char greeting[mpi_max_processor_name + 80]; MPI_Status status; MPI_Init( &argc,&argv); MPI_Comm_rank( MPI_COMM_WORLD, &myrank); MPI_Comm_size( MPI_COMM_WORLD, &numprocs); MPI_Get_Processor_name( processor_name, &namelen); sprintf(greeting,"hello world from process %d of %d on %s \n",myrank,numpro, processor_name);
45 A Quick Send and Receive Example } if (myrank == 0) { printf("%s\n", greeting); for(i=1;i<numprocs;i++) { MPI_Recv(greeting,sizeof(greeting), MPI_CHAR, i, 1, MPI_COMM_WORL printf("%s\n", greeting); } } else { MPI_Send(greeting, strlen( greeting) +1, MPI_CHAR, 0,1,MPI_COMM_WOR } MPI_Finalize(); return( 0);
46 Collective Operations Rather than dealing with individual messages, express common patterns of communication Simpler coding Hide optimization Hide cluster topology details Called at the same time by every task in the communicator (no if/else) - true data parallel
47 Common Collectives Broadcast / Reduce Scatter / Gather Barrier All-to-all
48 Sending to 8 nodes with a for loop and MPI_Send vs. MPI_Broadcast
49 Message Passing Cautions All messages are overhead (the non-parallel program wouldn t have them). Messages take substantial time Use them only when necessary, and group together as many as possible (long blocks of computation between communication raises performance!).
50
51 Parallelism in Monte Carlo Methods Monte Carlo methods often amenable to parallelism Find an estimate about p times faster OR Reduce error of estimate by p1/2 The trick to parallelizing MC methods is developing independent random number generators!!!
52 Linear Congruential RNGs X i = (a X i 1 + c)mod M
53 Linear Congruential RNGs X i = (a X i 1 + c)mod M Multiplier
54 Linear Congruential RNGs X i = (a X i 1 + c)mod M Multiplier Additive constant
55 Linear Congruential RNGs X i = (a X i 1 + c)mod M Modulus Multiplier Additive constant
56 Linear Congruential RNGs X i = (a X i 1 + c)mod M Modulus Multiplier Additive constant Sequence depends on choice of seed, X 0
57 Period of Linear Congruential RNG Maximum period is M For 32-bit integers maximum period is 232, or about 4 billion This is too small for modern computers Use a generator with at least 48 bits of precision
58 Producing Floating-Point Numbers X i, a, c, and M are all integers X i s range in value from 0 to M-1 To produce floating-point numbers in range [0, 1), divide X i by M
59 Defects of Linear Congruential RNGs Least significant bits correlated Especially when M is a power of 2 k-tuples of random numbers form a lattice Especially pronounced when k is large
60 Lagged Fibonacci RNGs
61 Lagged Fibonacci RNGs "p and q are lags, p > q
62 Lagged Fibonacci RNGs "p and q are lags, p > q "* is any binary arithmetic operation
63 Lagged Fibonacci RNGs "p and q are lags, p > q "* is any binary arithmetic operation "Addition modulo M
64 Lagged Fibonacci RNGs "p and q are lags, p > q "* is any binary arithmetic operation "Addition modulo M "Subtraction modulo M
65 Lagged Fibonacci RNGs "p and q are lags, p > q "* is any binary arithmetic operation "Addition modulo M "Subtraction modulo M "Multiplication modulo M
66 Lagged Fibonacci RNGs "p and q are lags, p > q "* is any binary arithmetic operation "Addition modulo M "Subtraction modulo M "Multiplication modulo M "Bitwise exclusive or
67 Properties of Lagged Fibonacci RNGs Require p seed values Careful selection of seed values, p, and q can result in very long periods and good randomness For example, suppose M has b bits Maximum period for additive lagged Fibonacci RNG is (2 p -1)2 b-1
68 Ideal Parallel RNGs All properties of sequential RNGs No correlations among numbers in different sequences Scalability Locality
69 Parallel RNG Designs Manager-worker Leapfrog Sequence splitting Independent sequences
70 Manager-Worker Parallel RNG Manager process generates random numbers Worker processes consume them If algorithm is synchronous, may achieve goal of consistency Not scalable Does not exhibit locality
71 Leapfrog Method
72 Leapfrog Method Process with rank 1 of 4 processes
73 Leapfrog Method Process with rank 1 of 4 processes
74 Leapfrog Method Process with rank 1 of 4 processes
75 Leapfrog Method Process with rank 1 of 4 processes
76 Leapfrog Method Process with rank 1 of 4 processes
77 Leapfrog Method Process with rank 1 of 4 processes
78 Leapfrog Method Process with rank 1 of 4 processes
79 Properties of Leapfrog Method Easy modify linear congruential RNG to support jumping by p Can allow parallel program to generate same tuples as sequential program Does not support dynamic creation of new random number streams
80 Sequence Splitting
81 Sequence Splitting Process with rank 1 of 4 processes
82 Sequence Splitting Process with rank 1 of 4 processes
83 Sequence Splitting Process with rank 1 of 4 processes
84 Sequence Splitting Process with rank 1 of 4 processes
85 Sequence Splitting Process with rank 1 of 4 processes
86 Sequence Splitting Process with rank 1 of 4 processes
87 Sequence Splitting Process with rank 1 of 4 processes
88 Properties of Sequence Splitting Forces each process to move ahead to its starting point Does not support goal of reproducibility May run into long-range correlation problems Can be modified to support dynamic creation of new sequences
89 Independent Sequences Run sequential RNG on each process Start each with different seed(s) or other parameters Example: linear congruential RNGs with different additive constants Works well with lagged Fibonacci RNGs Supports goals of locality and scalability
90 Best Approach - Use an Existing Library SPRNG (Scalable Parallel Random Number Generator) from Florida State is an MPI based library for generating random numbers independently Linear congruential generator on one node provides seed values for lagged fib generators on other nodes Ridiculously long period, good statistical properties SPRNG is simple and robust, and is highly recommended.
91 SPRNG Example See code listings: sprng_mpi.c seed_mpi.c 2streams_mpi.c pi-simple_mpi.c
92 Parting Note... If you are doing anything special (massive runs, massive storage, massive memory, meeting deadlines, non-traditional usage), please contact us and let us work with you to meet your needs. Policies are there to keep automated systems running well, they are not locked in stone.
93 How to Get More Help Online If something isn t there, fill out a service request to ask for help (same form as account request) Someone will respond next business day hpc@asu.edu Phone -- Leah Kritzer - The HPCI front desk More lectures would be fun: Short courses offered again soon CSE 494/598 (SP07) - a one semester course in MPI and HPC
94 Grids Loosely coupled sets of HPC (and other) compute resources Workstation Grid Portal No centralized control Middleware moves jobs to resources A way to share resources Cluster 1 Cluster 2 SMP Database Server
LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS 2015. Hermann Härtig
LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS 2015 Hermann Härtig ISSUES starting points independent Unix processes and block synchronous execution who does it load migration mechanism
More informationLightning Introduction to MPI Programming
Lightning Introduction to MPI Programming May, 2015 What is MPI? Message Passing Interface A standard, not a product First published 1994, MPI-2 published 1997 De facto standard for distributed-memory
More informationHPCC - Hrothgar Getting Started User Guide MPI Programming
HPCC - Hrothgar Getting Started User Guide MPI Programming High Performance Computing Center Texas Tech University HPCC - Hrothgar 2 Table of Contents 1. Introduction... 3 2. Setting up the environment...
More informationScalability and Classifications
Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static
More informationHigh Performance Computing
High Performance Computing Trey Breckenridge Computing Systems Manager Engineering Research Center Mississippi State University What is High Performance Computing? HPC is ill defined and context dependent.
More informationMulti-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
More informationTo connect to the cluster, simply use a SSH or SFTP client to connect to:
RIT Computer Engineering Cluster The RIT Computer Engineering cluster contains 12 computers for parallel programming using MPI. One computer, cluster-head.ce.rit.edu, serves as the master controller or
More informationIntro to GPU computing. Spring 2015 Mark Silberstein, 048661, Technion 1
Intro to GPU computing Spring 2015 Mark Silberstein, 048661, Technion 1 Serial vs. parallel program One instruction at a time Multiple instructions in parallel Spring 2015 Mark Silberstein, 048661, Technion
More informationParallelization: Binary Tree Traversal
By Aaron Weeden and Patrick Royal Shodor Education Foundation, Inc. August 2012 Introduction: According to Moore s law, the number of transistors on a computer chip doubles roughly every two years. First
More informationOverview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming
Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.
More informationHigh performance computing systems. Lab 1
High performance computing systems Lab 1 Dept. of Computer Architecture Faculty of ETI Gdansk University of Technology Paweł Czarnul For this exercise, study basic MPI functions such as: 1. for MPI management:
More informationIntroduction to grid technologies, parallel and cloud computing. Alaa Osama Allam Saida Saad Mohamed Mohamed Ibrahim Gaber
Introduction to grid technologies, parallel and cloud computing Alaa Osama Allam Saida Saad Mohamed Mohamed Ibrahim Gaber OUTLINES Grid Computing Parallel programming technologies (MPI- Open MP-Cuda )
More informationPerformance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
More informationSymmetric Multiprocessing
Multicore Computing A multi-core processor is a processing system composed of two or more independent cores. One can describe it as an integrated circuit to which two or more individual processors (called
More informationHigh Performance Computing. Course Notes 2007-2008. HPC Fundamentals
High Performance Computing Course Notes 2007-2008 2008 HPC Fundamentals Introduction What is High Performance Computing (HPC)? Difficult to define - it s a moving target. Later 1980s, a supercomputer performs
More information22S:295 Seminar in Applied Statistics High Performance Computing in Statistics
22S:295 Seminar in Applied Statistics High Performance Computing in Statistics Luke Tierney Department of Statistics & Actuarial Science University of Iowa August 30, 2007 Luke Tierney (U. of Iowa) HPC
More informationLecture 2 Parallel Programming Platforms
Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple
More informationMPI Application Development Using the Analysis Tool MARMOT
MPI Application Development Using the Analysis Tool MARMOT HLRS High Performance Computing Center Stuttgart Allmandring 30 D-70550 Stuttgart http://www.hlrs.de 24.02.2005 1 Höchstleistungsrechenzentrum
More informationChapter 2 Parallel Architecture, Software And Performance
Chapter 2 Parallel Architecture, Software And Performance UCSB CS140, T. Yang, 2014 Modified from texbook slides Roadmap Parallel hardware Parallel software Input and output Performance Parallel program
More informationSystolic Computing. Fundamentals
Systolic Computing Fundamentals Motivations for Systolic Processing PARALLEL ALGORITHMS WHICH MODEL OF COMPUTATION IS THE BETTER TO USE? HOW MUCH TIME WE EXPECT TO SAVE USING A PARALLEL ALGORITHM? HOW
More informationIntroduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
More information64-Bit versus 32-Bit CPUs in Scientific Computing
64-Bit versus 32-Bit CPUs in Scientific Computing Axel Kohlmeyer Lehrstuhl für Theoretische Chemie Ruhr-Universität Bochum March 2004 1/25 Outline 64-Bit and 32-Bit CPU Examples
More informationIntroduction to Cloud Computing
Introduction to Cloud Computing Distributed Systems 15-319, spring 2010 11 th Lecture, Feb 16 th Majd F. Sakr Lecture Motivation Understand Distributed Systems Concepts Understand the concepts / ideas
More informationAnalysis and Implementation of Cluster Computing Using Linux Operating System
IOSR Journal of Computer Engineering (IOSRJCE) ISSN: 2278-0661 Volume 2, Issue 3 (July-Aug. 2012), PP 06-11 Analysis and Implementation of Cluster Computing Using Linux Operating System Zinnia Sultana
More informationAMD Opteron Quad-Core
AMD Opteron Quad-Core a brief overview Daniele Magliozzi Politecnico di Milano Opteron Memory Architecture native quad-core design (four cores on a single die for more efficient data sharing) enhanced
More informationMulti-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007
Multi-core architectures Jernej Barbic 15-213, Spring 2007 May 3, 2007 1 Single-core computer 2 Single-core CPU chip the single core 3 Multi-core architectures This lecture is about a new trend in computer
More informationLecture 6: Introduction to MPI programming. Lecture 6: Introduction to MPI programming p. 1
Lecture 6: Introduction to MPI programming Lecture 6: Introduction to MPI programming p. 1 MPI (message passing interface) MPI is a library standard for programming distributed memory MPI implementation(s)
More informationGPUs for Scientific Computing
GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles mike.giles@maths.ox.ac.uk Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research
More informationAn Introduction to Parallel Computing/ Programming
An Introduction to Parallel Computing/ Programming Vicky Papadopoulou Lesta Astrophysics and High Performance Computing Research Group (http://ahpc.euc.ac.cy) Dep. of Computer Science and Engineering European
More informationBasic Concepts in Parallelization
1 Basic Concepts in Parallelization Ruud van der Pas Senior Staff Engineer Oracle Solaris Studio Oracle Menlo Park, CA, USA IWOMP 2010 CCS, University of Tsukuba Tsukuba, Japan June 14-16, 2010 2 Outline
More informationPARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN
1 PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN Introduction What is cluster computing? Classification of Cluster Computing Technologies: Beowulf cluster Construction
More informationParallel Programming with MPI on the Odyssey Cluster
Parallel Programming with MPI on the Odyssey Cluster Plamen Krastev Office: Oxford 38, Room 204 Email: plamenkrastev@fas.harvard.edu FAS Research Computing Harvard University Objectives: To introduce you
More informationHigh Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/2015. 2015 CAE Associates
High Performance Computing (HPC) CAEA elearning Series Jonathan G. Dudley, Ph.D. 06/09/2015 2015 CAE Associates Agenda Introduction HPC Background Why HPC SMP vs. DMP Licensing HPC Terminology Types of
More informationWinBioinfTools: Bioinformatics Tools for Windows Cluster. Done By: Hisham Adel Mohamed
WinBioinfTools: Bioinformatics Tools for Windows Cluster Done By: Hisham Adel Mohamed Objective Implement and Modify Bioinformatics Tools To run under Windows Cluster Project : Research Project between
More informationLecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
More informationHPC Wales Skills Academy Course Catalogue 2015
HPC Wales Skills Academy Course Catalogue 2015 Overview The HPC Wales Skills Academy provides a variety of courses and workshops aimed at building skills in High Performance Computing (HPC). Our courses
More informationGPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
More informationParallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
More informationMcMPI. Managed-code MPI library in Pure C# Dr D Holmes, EPCC dholmes@epcc.ed.ac.uk
McMPI Managed-code MPI library in Pure C# Dr D Holmes, EPCC dholmes@epcc.ed.ac.uk Outline Yet another MPI library? Managed-code, C#, Windows McMPI, design and implementation details Object-orientation,
More informationİSTANBUL AYDIN UNIVERSITY
İSTANBUL AYDIN UNIVERSITY FACULTY OF ENGİNEERİNG SOFTWARE ENGINEERING THE PROJECT OF THE INSTRUCTION SET COMPUTER ORGANIZATION GÖZDE ARAS B1205.090015 Instructor: Prof. Dr. HASAN HÜSEYİN BALIK DECEMBER
More informationParallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite benson.muite@ut.ee http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
More informationParallel Algorithm Engineering
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis
More informationParallel Computing. Parallel shared memory computing with OpenMP
Parallel Computing Parallel shared memory computing with OpenMP Thorsten Grahs, 14.07.2014 Table of contents Introduction Directives Scope of data Synchronization OpenMP vs. MPI OpenMP & MPI 14.07.2014
More informationIntroduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware
More informationClient/Server Computing Distributed Processing, Client/Server, and Clusters
Client/Server Computing Distributed Processing, Client/Server, and Clusters Chapter 13 Client machines are generally single-user PCs or workstations that provide a highly userfriendly interface to the
More informationCHAPTER 7: The CPU and Memory
CHAPTER 7: The CPU and Memory The Architecture of Computer Hardware, Systems Software & Networking: An Information Technology Approach 4th Edition, Irv Englander John Wiley and Sons 2010 PowerPoint slides
More informationIntroduction to Hybrid Programming
Introduction to Hybrid Programming Hristo Iliev Rechen- und Kommunikationszentrum aixcelerate 2012 / Aachen 10. Oktober 2012 Version: 1.1 Rechen- und Kommunikationszentrum (RZ) Motivation for hybrid programming
More informationCompute Cluster Server Lab 3: Debugging the parallel MPI programs in Microsoft Visual Studio 2005
Compute Cluster Server Lab 3: Debugging the parallel MPI programs in Microsoft Visual Studio 2005 Compute Cluster Server Lab 3: Debugging the parallel MPI programs in Microsoft Visual Studio 2005... 1
More informationNext Generation GPU Architecture Code-named Fermi
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
More informationQUADRICS IN LINUX CLUSTERS
QUADRICS IN LINUX CLUSTERS John Taylor Motivation QLC 21/11/00 Quadrics Cluster Products Performance Case Studies Development Activities Super-Cluster Performance Landscape CPLANT ~600 GF? 128 64 32 16
More informationx64 Servers: Do you want 64 or 32 bit apps with that server?
TMurgent Technologies x64 Servers: Do you want 64 or 32 bit apps with that server? White Paper by Tim Mangan TMurgent Technologies February, 2006 Introduction New servers based on what is generally called
More informationIntroduction to GPU Programming Languages
CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure
More informationCUDA programming on NVIDIA GPUs
p. 1/21 on NVIDIA GPUs Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford-Man Institute for Quantitative Finance Oxford eresearch Centre p. 2/21 Overview hardware view
More informationScalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011
Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis
More informationRecommended hardware system configurations for ANSYS users
Recommended hardware system configurations for ANSYS users The purpose of this document is to recommend system configurations that will deliver high performance for ANSYS users across the entire range
More informationTrends in High-Performance Computing for Power Grid Applications
Trends in High-Performance Computing for Power Grid Applications Franz Franchetti ECE, Carnegie Mellon University www.spiral.net Co-Founder, SpiralGen www.spiralgen.com This talk presents my personal views
More informationThis Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?
This Unit: Putting It All Together CIS 501 Computer Architecture Unit 11: Putting It All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Amir Roth with contributions by Milo
More informationWorkshare Process of Thread Programming and MPI Model on Multicore Architecture
Vol., No. 7, 011 Workshare Process of Thread Programming and MPI Model on Multicore Architecture R. Refianti 1, A.B. Mutiara, D.T Hasta 3 Faculty of Computer Science and Information Technology, Gunadarma
More informationParallel Computing with MATLAB
Parallel Computing with MATLAB Scott Benway Senior Account Manager Jiro Doke, Ph.D. Senior Application Engineer 2013 The MathWorks, Inc. 1 Acceleration Strategies Applied in MATLAB Approach Options Best
More informationLS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.
LS-DYNA Scalability on Cray Supercomputers Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp. WP-LS-DYNA-12213 www.cray.com Table of Contents Abstract... 3 Introduction... 3 Scalability
More informationMany-task applications in use today with a look toward the future
Many-task applications in use today with a look toward the future Alan Gara IBM Research Lots of help form Mark Megerian, IBM 1 Outline Review of Many-Task motivations on supercomputers and observations
More informationOn-Demand Supercomputing Multiplies the Possibilities
Microsoft Windows Compute Cluster Server 2003 Partner Solution Brief Image courtesy of Wolfram Research, Inc. On-Demand Supercomputing Multiplies the Possibilities Microsoft Windows Compute Cluster Server
More informationWhite Paper The Numascale Solution: Extreme BIG DATA Computing
White Paper The Numascale Solution: Extreme BIG DATA Computing By: Einar Rustad ABOUT THE AUTHOR Einar Rustad is CTO of Numascale and has a background as CPU, Computer Systems and HPC Systems De-signer
More informationBenchmarking Large Scale Cloud Computing in Asia Pacific
2013 19th IEEE International Conference on Parallel and Distributed Systems ing Large Scale Cloud Computing in Asia Pacific Amalina Mohamad Sabri 1, Suresh Reuben Balakrishnan 1, Sun Veer Moolye 1, Chung
More informationSession 2: MUST. Correctness Checking
Center for Information Services and High Performance Computing (ZIH) Session 2: MUST Correctness Checking Dr. Matthias S. Müller (RWTH Aachen University) Tobias Hilbrich (Technische Universität Dresden)
More informationRevoScaleR Speed and Scalability
EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution
More informationIntroduction. Reading. Today MPI & OpenMP papers Tuesday Commutativity Analysis & HPF. CMSC 818Z - S99 (lect 5)
Introduction Reading Today MPI & OpenMP papers Tuesday Commutativity Analysis & HPF 1 Programming Assignment Notes Assume that memory is limited don t replicate the board on all nodes Need to provide load
More informationBLM 413E - Parallel Programming Lecture 3
BLM 413E - Parallel Programming Lecture 3 FSMVU Bilgisayar Mühendisliği Öğr. Gör. Musa AYDIN 14.10.2015 2015-2016 M.A. 1 Parallel Programming Models Parallel Programming Models Overview There are several
More informationOverlapping Data Transfer With Application Execution on Clusters
Overlapping Data Transfer With Application Execution on Clusters Karen L. Reid and Michael Stumm reid@cs.toronto.edu stumm@eecg.toronto.edu Department of Computer Science Department of Electrical and Computer
More informationLecture 1: the anatomy of a supercomputer
Where a calculator on the ENIAC is equipped with 18,000 vacuum tubes and weighs 30 tons, computers of the future may have only 1,000 vacuum tubes and perhaps weigh 1½ tons. Popular Mechanics, March 1949
More informationCOMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)
COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook) Vivek Sarkar Department of Computer Science Rice University vsarkar@rice.edu COMP
More informationInterconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003
Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003 Josef Pelikán Charles University in Prague, KSVI Department, Josef.Pelikan@mff.cuni.cz Abstract 1 Interconnect quality
More informationCellular Computing on a Linux Cluster
Cellular Computing on a Linux Cluster Alexei Agueev, Bernd Däne, Wolfgang Fengler TU Ilmenau, Department of Computer Architecture Topics 1. Cellular Computing 2. The Experiment 3. Experimental Results
More informationnumascale White Paper The Numascale Solution: Extreme BIG DATA Computing Hardware Accellerated Data Intensive Computing By: Einar Rustad ABSTRACT
numascale Hardware Accellerated Data Intensive Computing White Paper The Numascale Solution: Extreme BIG DATA Computing By: Einar Rustad www.numascale.com Supemicro delivers 108 node system with Numascale
More informationWhy the Network Matters
Week 2, Lecture 2 Copyright 2009 by W. Feng. Based on material from Matthew Sottile. So Far Overview of Multicore Systems Why Memory Matters Memory Architectures Emerging Chip Multiprocessors (CMP) Increasing
More informationwhat operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?
Inside the CPU how does the CPU work? what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored? some short, boring programs to illustrate the
More informationUnderstanding the Benefits of IBM SPSS Statistics Server
IBM SPSS Statistics Server Understanding the Benefits of IBM SPSS Statistics Server Contents: 1 Introduction 2 Performance 101: Understanding the drivers of better performance 3 Why performance is faster
More informationDavid Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems
David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems About me David Rioja Redondo Telecommunication Engineer - Universidad de Alcalá >2 years building and managing clusters UPM
More informationParallel Processing and Software Performance. Lukáš Marek
Parallel Processing and Software Performance Lukáš Marek DISTRIBUTED SYSTEMS RESEARCH GROUP http://dsrg.mff.cuni.cz CHARLES UNIVERSITY PRAGUE Faculty of Mathematics and Physics Benchmarking in parallel
More informationBinary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
More informationCMSC 611: Advanced Computer Architecture
CMSC 611: Advanced Computer Architecture Parallel Computation Most slides adapted from David Patterson. Some from Mohomed Younis Parallel Computers Definition: A parallel computer is a collection of processing
More informationAgenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC
HPC Architecture End to End Alexandre Chauvin Agenda HPC Software Stack Visualization National Scientific Center 2 Agenda HPC Software Stack Alexandre Chauvin Typical HPC Software Stack Externes LAN Typical
More informationPrinciples and characteristics of distributed systems and environments
Principles and characteristics of distributed systems and environments Definition of a distributed system Distributed system is a collection of independent computers that appears to its users as a single
More informationRethinking SIMD Vectorization for In-Memory Databases
SIGMOD 215, Melbourne, Victoria, Australia Rethinking SIMD Vectorization for In-Memory Databases Orestis Polychroniou Columbia University Arun Raghavan Oracle Labs Kenneth A. Ross Columbia University Latest
More informationImproved LS-DYNA Performance on Sun Servers
8 th International LS-DYNA Users Conference Computing / Code Tech (2) Improved LS-DYNA Performance on Sun Servers Youn-Seo Roh, Ph.D. And Henry H. Fong Sun Microsystems, Inc. Abstract Current Sun platforms
More informationBuilding an Inexpensive Parallel Computer
Res. Lett. Inf. Math. Sci., (2000) 1, 113-118 Available online at http://www.massey.ac.nz/~wwiims/rlims/ Building an Inexpensive Parallel Computer Lutz Grosz and Andre Barczak I.I.M.S., Massey University
More informationVirtualization Technologies and Blackboard: The Future of Blackboard Software on Multi-Core Technologies
Virtualization Technologies and Blackboard: The Future of Blackboard Software on Multi-Core Technologies Kurt Klemperer, Principal System Performance Engineer kklemperer@blackboard.com Agenda Session Length:
More informationIntroducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child
Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.
More informationIntroduction to High Performance Cluster Computing. Cluster Training for UCL Part 1
Introduction to High Performance Cluster Computing Cluster Training for UCL Part 1 What is HPC HPC = High Performance Computing Includes Supercomputing HPCC = High Performance Cluster Computing Note: these
More informationCHAPTER 4 MARIE: An Introduction to a Simple Computer
CHAPTER 4 MARIE: An Introduction to a Simple Computer 4.1 Introduction 195 4.2 CPU Basics and Organization 195 4.2.1 The Registers 196 4.2.2 The ALU 197 4.2.3 The Control Unit 197 4.3 The Bus 197 4.4 Clocks
More informationFull and Para Virtualization
Full and Para Virtualization Dr. Sanjay P. Ahuja, Ph.D. 2010-14 FIS Distinguished Professor of Computer Science School of Computing, UNF x86 Hardware Virtualization The x86 architecture offers four levels
More informationA Flexible Cluster Infrastructure for Systems Research and Software Development
Award Number: CNS-551555 Title: CRI: Acquisition of an InfiniBand Cluster with SMP Nodes Institution: Florida State University PIs: Xin Yuan, Robert van Engelen, Kartik Gopalan A Flexible Cluster Infrastructure
More informationsupercomputing. simplified.
supercomputing. simplified. INTRODUCING WINDOWS HPC SERVER 2008 R2 SUITE Windows HPC Server 2008 R2, Microsoft s third-generation HPC solution, provides a comprehensive and costeffective solution for harnessing
More informationLattice QCD Performance. on Multi core Linux Servers
Lattice QCD Performance on Multi core Linux Servers Yang Suli * Department of Physics, Peking University, Beijing, 100871 Abstract At the moment, lattice quantum chromodynamics (lattice QCD) is the most
More informationChapter 1 Computer System Overview
Operating Systems: Internals and Design Principles Chapter 1 Computer System Overview Eighth Edition By William Stallings Operating System Exploits the hardware resources of one or more processors Provides
More informationComputers. Hardware. The Central Processing Unit (CPU) CMPT 125: Lecture 1: Understanding the Computer
Computers CMPT 125: Lecture 1: Understanding the Computer Tamara Smyth, tamaras@cs.sfu.ca School of Computing Science, Simon Fraser University January 3, 2009 A computer performs 2 basic functions: 1.
More informationLet s put together a Manual Processor
Lecture 14 Let s put together a Manual Processor Hardware Lecture 14 Slide 1 The processor Inside every computer there is at least one processor which can take an instruction, some operands and produce
More informationDepartment of Computer Sciences University of Salzburg. HPC In The Cloud? Seminar aus Informatik SS 2011/2012. July 16, 2012
Department of Computer Sciences University of Salzburg HPC In The Cloud? Seminar aus Informatik SS 2011/2012 July 16, 2012 Michael Kleber, mkleber@cosy.sbg.ac.at Contents 1 Introduction...................................
More informationDesign and Implementation of the Heterogeneous Multikernel Operating System
223 Design and Implementation of the Heterogeneous Multikernel Operating System Yauhen KLIMIANKOU Department of Computer Systems and Networks, Belarusian State University of Informatics and Radioelectronics,
More informationCPU Organisation and Operation
CPU Organisation and Operation The Fetch-Execute Cycle The operation of the CPU 1 is usually described in terms of the Fetch-Execute cycle. 2 Fetch-Execute Cycle Fetch the Instruction Increment the Program
More information