Awareness of MPI Virtual Process Topologies on the Single-Chip Cloud Computer

Size: px
Start display at page:

Download "Awareness of MPI Virtual Process Topologies on the Single-Chip Cloud Computer"

Transcription

1 Awareness of MPI Virtual Process Topologies on the Single-Chip Cloud Computer Steffen Christgau, Bettina Schnor Potsdam University Institute of Computer Science Operating Systems and Distributed Systems 21. May 2012

2 Outline The Single-Chip Cloud Computer (SCC) RCKMPI MPI on the SCC Topology-Awareness for RCKMPI The concept Topology-Awareness for RCKMPI Evaluation Summary and Future Work Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 2 of 19

3 Outline The Single-Chip Cloud Computer (SCC) RCKMPI MPI on the SCC Topology-Awareness for RCKMPI The concept Topology-Awareness for RCKMPI Evaluation Summary and Future Work Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 2 of 19

4 Outline The Single-Chip Cloud Computer (SCC) RCKMPI MPI on the SCC Topology-Awareness for RCKMPI The concept Topology-Awareness for RCKMPI Evaluation Summary and Future Work Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 2 of 19

5 Outline The Single-Chip Cloud Computer (SCC) RCKMPI MPI on the SCC Topology-Awareness for RCKMPI The concept Topology-Awareness for RCKMPI Evaluation Summary and Future Work Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 2 of 19

6 Outline The Single-Chip Cloud Computer (SCC) RCKMPI MPI on the SCC Topology-Awareness for RCKMPI The concept Topology-Awareness for RCKMPI Evaluation Summary and Future Work Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 2 of 19

7 1. The Single-Chip Cloud Computer (SCC) Many-Core Architecture Research Community (MARC) established by Intel research of universities together with Intel world-wide community (dominated by European institutions) Website, regular symposia (every 6 months, 5 up to now) Our group is MARC member Focus: application scalability Experiences with parallel ASP, climate simulation Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 3 of 19

8 Single-Chip Cloud Computer L2 Cache 256 KB Router Core1 MPB 16 KB MC 1 MC 3 L2 Cache 256 KB Core0 MC 0 MC 2 VRC SIF 24 tiles, 48 P54C cores, connected via Network-on-Chip, no Cache-Coherence fast 16 KB tile SRAM on each tile Message Passing Buffer (MPB) Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 4 of 19

9 2. RCKMPI: MPI on the SCC fork of MPICH2 Process Management Interface Sock Application Message Passing Interface MPICH2 ROMIO Abstract Device Interface (ADI3) Channel-3 Device BG Cray Nemesis SCCMPB SCCSHM SCCMulti (MPI IO Implementation) ADIO MPI Implementation MPI Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 5 of 19

10 2. RCKMPI: MPI on the SCC fork of MPICH2 Process Management Interface Sock Application Message Passing Interface MPICH2 ROMIO Abstract Device Interface (ADI3) Channel-3 Device BG Cray Nemesis SCCMPB SCCSHM SCCMulti (MPI IO Implementation) ADIO MPI Implementation MPI Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 5 of 19

11 SCCMPB uses the fast Message Passing Buffer of each tile as shared memory and divides it into n equal-size Exclusive Write Sections (EWS) = remote write, local read MPI_COMM_WORLD MPI Rank 0 MPI Rank n Core 0 Core n rank 1 rank rank n-1 rank n MPB of core/rank 0 Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 6 of 19

12 SCCMPB uses the fast Message Passing Buffer of each tile as shared memory and divides it into n equal-size Exclusive Write Sections (EWS) = remote write, local read MPI_COMM_WORLD MPI Rank 0 MPI Rank n Core 0 Core n rank 1 rank rank n-1 rank n MPB of core/rank 0 Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 6 of 19

13 Comparison of different CH3-devices at maximum Manhattan distance bandwidth / MByte/s RCKMPI sccmulti CH device RCKMPI sccmpb CH device RCKMPI sccshm CH device Ki 4 Ki 16 Ki 64 Ki 256 Ki 1 Mi 4 Mi message size / Byte Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 7 of 19

14 Bandwidths for Manhattan distance 0, 5 and 8 (two processes started) Core 00 and 01 Core 00 and 10 Core 00 and bandwidth / MByte/s Ki 4 Ki 16 Ki 64 Ki 256 Ki 1 Mi 4 Mi message size / Byte Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 8 of 19

15 Bandwidths for maximum Manhattan distance 8, and varied number of MPI processes MPI processes 12 MPI processes 24 MPI processes 48 MPI processes 50 bandwidth / MByte/s Ki 4 Ki 16 Ki 64 Ki 256 Ki 1 Mi 4 Mi message size / Byte Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 9 of 19

16 Remember: SCCMPB uses the fast Message Passing Buffer of each tile as shared memory (total 384 KB) MPI_COMM_WORLD MPI Rank 0 MPI Rank n Core 0 Core n rank 1 rank rank n-1 rank n MPB of core/rank 0 The MPB is equally devided in n sections = depending on the number of started MPI processes. Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 10 of 19

17 3. Topology Awareness for RCKMPI: The Concept The bandwidth between 2 RCKMPI processes depends on: the number of started MPI processes since a fully-connected network between all MPI processes is managed. Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 11 of 19

18 Application Behaviour Tv process 1 qv Tu process 2 qu uu process n vv uv Goal: The bandwidth between communicating processes, so-called neighbors in the Task Interaction Graph should be increased. Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 12 of 19

19 Application Behaviour Tv process 1 qv Tu process 2 qu uu process n vv uv Goal: The bandwidth between communicating processes, so-called neighbors in the Task Interaction Graph should be increased. Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 12 of 19

20 Requirements: 1 An improved MPB layout must consider both: communication neighbors and group communication (barriers, broadcasts, gather/scatter,...) 2 Each MPI process has to know its new offset within all remote MPBs. Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 13 of 19

21 Putting things together... message payload channel header original layout proc. 1 proc. 2 proc. n-1 proc. n write section new layout without topology information p 1 p 2 p n proc. 1 proc. 2 proc. n-1 proc. n channel headers payload area with topology information p 1 p 2 p n p1 proc. 2 pn Internal barrier for recalculation phase of new MPB addresses. Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 14 of 19

22 MPI offers API to specify virtual process topology 1 #d e f i n e NUM_DIMS i n t grid_dims [NUM_DIMS] ; 4 i n t grid_periods [NUM_DIMS] ; 5 MPI_Comm comm_topo ; 6 7 /* f o r a grid, s e t a l l items o f grid_periods to 0 */ 8 f o r ( i n t i = 0 ; i < NUM_DIMS; i++) 9 grid_periods [ i ] = 0 ; 0 1 MPI_Dims_create ( numprocs, NUM_DIMS, grid_dims ) ; 2 MPI_Cart_create (MPI_COMM_WORLD, NUM_DIMS, grid_dims, 3 grid_periods, true, &comm_topo ) ; Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 15 of 19

23 MPI offers API to specify virtual process topology 1 #d e f i n e NUM_DIMS i n t grid_dims [NUM_DIMS] ; 4 i n t grid_periods [NUM_DIMS] ; 5 MPI_Comm comm_topo ; 6 7 /* f o r a grid, s e t a l l items o f grid_periods to 0 */ 8 f o r ( i n t i = 0 ; i < NUM_DIMS; i++) 9 grid_periods [ i ] = 0 ; 0 1 MPI_Dims_create ( numprocs, NUM_DIMS, grid_dims ) ; 2 MPI_Cart_create (MPI_COMM_WORLD, NUM_DIMS, grid_dims, 3 grid_periods, true, &comm_topo ) ; Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 15 of 19

24 4. Topology Awareness for RCKMPI: Evaluation bandwidth / MByte/s Ki 4 Ki 16 Ki 64 Ki 256 Ki 1 Mi 4 Mi message size / Byte enhanced RCKMPI with 1D topology (48 procs, 2 Cache lines) enhanced RCKMPI with 1D topology (48 procs, 3 Cache lines) enhanced RCKMPI without topology (48 procs) Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 16 of 19

25 Results for 2D CFD application with ring topology: process 1 process 2 process n Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 17 of 19

26 speedup enhanced RCKMPI with topology information, 2 CL original RCKMPI number of processes Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 18 of 19

27 Summary and Future work SCC is equipped with a fast NoC. Message Passing Buffer on tile is beneficial for fast communication. Bandwidth performance gain for MPI applications by using MPI s virtual process topologies and rearranging the MPB layout. Current/Future Work: Comparison with I. C. Ureña, and M. Gerndt: Improved RCKMPI s SCCMPB Channel: Scaling and Dynamic Processes Support, ARCS 2012 Fixed the One-Sided Communication in RCKMPI = support of applications based on Global Arrays Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 19 of 19

28 Summary and Future work SCC is equipped with a fast NoC. Message Passing Buffer on tile is beneficial for fast communication. Bandwidth performance gain for MPI applications by using MPI s virtual process topologies and rearranging the MPB layout. Current/Future Work: Comparison with I. C. Ureña, and M. Gerndt: Improved RCKMPI s SCCMPB Channel: Scaling and Dynamic Processes Support, ARCS 2012 Fixed the One-Sided Communication in RCKMPI = support of applications based on Global Arrays Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 19 of 19

29 Summary and Future work SCC is equipped with a fast NoC. Message Passing Buffer on tile is beneficial for fast communication. Bandwidth performance gain for MPI applications by using MPI s virtual process topologies and rearranging the MPB layout. Current/Future Work: Comparison with I. C. Ureña, and M. Gerndt: Improved RCKMPI s SCCMPB Channel: Scaling and Dynamic Processes Support, ARCS 2012 Fixed the One-Sided Communication in RCKMPI = support of applications based on Global Arrays Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 19 of 19

DEPLOYING AND MONITORING HADOOP MAP-REDUCE ANALYTICS ON SINGLE-CHIP CLOUD COMPUTER

DEPLOYING AND MONITORING HADOOP MAP-REDUCE ANALYTICS ON SINGLE-CHIP CLOUD COMPUTER DEPLOYING AND MONITORING HADOOP MAP-REDUCE ANALYTICS ON SINGLE-CHIP CLOUD COMPUTER ANDREAS-LAZAROS GEORGIADIS, SOTIRIOS XYDIS, DIMITRIOS SOUDRIS MICROPROCESSOR AND MICROSYSTEMS LABORATORY ELECTRICAL AND

More information

- Nishad Nerurkar. - Aniket Mhatre

- Nishad Nerurkar. - Aniket Mhatre - Nishad Nerurkar - Aniket Mhatre Single Chip Cloud Computer is a project developed by Intel. It was developed by Intel Lab Bangalore, Intel Lab America and Intel Lab Germany. It is part of a larger project,

More information

MPICH FOR SCI-CONNECTED CLUSTERS

MPICH FOR SCI-CONNECTED CLUSTERS Autumn Meeting 99 of AK Scientific Computing MPICH FOR SCI-CONNECTED CLUSTERS Joachim Worringen AGENDA Introduction, Related Work & Motivation Implementation Performance Work in Progress Summary MESSAGE-PASSING

More information

Invasive MPI on Intel s Single-Chip Cloud Computer

Invasive MPI on Intel s Single-Chip Cloud Computer Invasive MPI on Intel s Single-Chip Cloud Computer Isaías A. Comprés Ureña 1, Michael Riepen 2, Michael Konow 2, and Michael Gerndt 1 1 Technical University of Munich (TUM), Institute of Informatics, Boltzmannstr.

More information

Fast Fluid Dynamics on the Single-chip Cloud Computer

Fast Fluid Dynamics on the Single-chip Cloud Computer Fast Fluid Dynamics on the Single-chip Cloud Computer Marco Fais, Francesco Iorio High-Performance Computing Group Autodesk Research Toronto, Canada francesco.iorio@autodesk.com Abstract Fast simulation

More information

Single-chip Cloud Computer IA Tera-scale Research Processor

Single-chip Cloud Computer IA Tera-scale Research Processor Single-chip Cloud Computer IA Tera-scale esearch Processor Jim Held Intel Fellow & Director Tera-scale Computing esearch Intel Labs August 31, 2010 www.intel.com/info/scc Agenda Tera-scale esearch SCC

More information

Overlapping Data Transfer With Application Execution on Clusters

Overlapping Data Transfer With Application Execution on Clusters Overlapping Data Transfer With Application Execution on Clusters Karen L. Reid and Michael Stumm reid@cs.toronto.edu stumm@eecg.toronto.edu Department of Computer Science Department of Electrical and Computer

More information

EXPLOITING SHARED MEMORY TO IMPROVE PARALLEL I/O PERFORMANCE

EXPLOITING SHARED MEMORY TO IMPROVE PARALLEL I/O PERFORMANCE EXPLOITING SHARED MEMORY TO IMPROVE PARALLEL I/O PERFORMANCE Andrew B. Hastings Sun Microsystems, Inc. Alok Choudhary Northwestern University September 19, 2006 This material is based on work supported

More information

Parallel sorting on Intel Single-Chip Cloud computer

Parallel sorting on Intel Single-Chip Cloud computer Parallel sorting on Intel Single-Chip Cloud computer Kenan Avdic, Nicolas Melot, Jörg Keller 2, and Christoph Kessler Linköpings Universitet, Dept. of Computer and Inf. Science, 5883 Linköping, Sweden

More information

Bottleneck Detection in Parallel File Systems with Trace-Based Performance Monitoring

Bottleneck Detection in Parallel File Systems with Trace-Based Performance Monitoring Julian M. Kunkel - Euro-Par 2008 1/33 Bottleneck Detection in Parallel File Systems with Trace-Based Performance Monitoring Julian M. Kunkel Thomas Ludwig Institute for Computer Science Parallel and Distributed

More information

How To Build A Cloud Computer

How To Build A Cloud Computer Introducing the Singlechip Cloud Computer Exploring the Future of Many-core Processors White Paper Intel Labs Jim Held Intel Fellow, Intel Labs Director, Tera-scale Computing Research Sean Koehl Technology

More information

Hyper Node Torus: A New Interconnection Network for High Speed Packet Processors

Hyper Node Torus: A New Interconnection Network for High Speed Packet Processors 2011 International Symposium on Computer Networks and Distributed Systems (CNDS), February 23-24, 2011 Hyper Node Torus: A New Interconnection Network for High Speed Packet Processors Atefeh Khosravi,

More information

Proceedings of the 4th Many-core Applications Research Community (MARC) Symposium

Proceedings of the 4th Many-core Applications Research Community (MARC) Symposium Proceedings of the 4th Many-core Applications Research Community (MARC) Symposium Peter Tröger, Andreas Polze (Eds.) Technische Berichte Nr. 55 des Hasso-Plattner-Instituts für Softwaresystemtechnik an

More information

A Fast Inter-Kernel Communication and Synchronization Layer for MetalSVM

A Fast Inter-Kernel Communication and Synchronization Layer for MetalSVM A Fast Inter-Kernel Communication and Synchronization Layer for MetalSVM Pablo Reble, Stefan Lankes, Carsten Clauss, Thomas Bemmerl Chair for Operating Systems, RWTH Aachen University Kopernikusstr. 16,

More information

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical Identify a problem Review approaches to the problem Propose a novel approach to the problem Define, design, prototype an implementation to evaluate your approach Could be a real system, simulation and/or

More information

Early experience with the Barrelfish OS and the Single-Chip Cloud Computer

Early experience with the Barrelfish OS and the Single-Chip Cloud Computer Early experience with the Barrelfish OS and the Single-Chip Cloud Computer Simon Peter, Adrian Schüpbach, Dominik Menzi and Timothy Roscoe Systems Group, Department of Computer Science, ETH Zurich Abstract

More information

Single-chip Cloud Computer A many-core research platform from Intel Labs

Single-chip Cloud Computer A many-core research platform from Intel Labs Single-chip Cloud Computer A many-core research platform from Intel Labs Compute evolving to Tera- Scale Entertainment, Learning Performance TIPS GIPS MIPS KIPS Multimedia Model- Based Apps 3D and Video

More information

Distributed communication-aware load balancing with TreeMatch in Charm++

Distributed communication-aware load balancing with TreeMatch in Charm++ Distributed communication-aware load balancing with TreeMatch in Charm++ The 9th Scheduling for Large Scale Systems Workshop, Lyon, France Emmanuel Jeannot Guillaume Mercier Francois Tessier In collaboration

More information

Performance Characteristics of Large SMP Machines

Performance Characteristics of Large SMP Machines Performance Characteristics of Large SMP Machines Dirk Schmidl, Dieter an Mey, Matthias S. Müller schmidl@rz.rwth-aachen.de Rechen- und Kommunikationszentrum (RZ) Agenda Investigated Hardware Kernel Benchmark

More information

Exploring the Intel Single-Chip Cloud Computer and its possibilities for SVP

Exploring the Intel Single-Chip Cloud Computer and its possibilities for SVP Master Computer Science Computer Science University of Amsterdam Exploring the Intel Single-Chip Cloud Computer and its possibilities for SVP Roy Bakker 0583650 bakkerr@science.uva.nl October 7, 2011 Supervisors:

More information

Building an Inexpensive Parallel Computer

Building an Inexpensive Parallel Computer Res. Lett. Inf. Math. Sci., (2000) 1, 113-118 Available online at http://www.massey.ac.nz/~wwiims/rlims/ Building an Inexpensive Parallel Computer Lutz Grosz and Andre Barczak I.I.M.S., Massey University

More information

OpTiMSoC Build Your Own System-on-Chip!

OpTiMSoC Build Your Own System-on-Chip! OpTiMSoC Build Your Own System-on-Chip! FOSDEM 2014 February 2nd 2014 Institute for Integrated Systems Prof. Dr. Andreas Herkersdorf Arcisstraße 21 80333 München Germany http://www.lis.ei.tum.de 2 Photo

More information

1. If we need to use each thread to calculate one output element of a vector addition, what would

1. If we need to use each thread to calculate one output element of a vector addition, what would Quiz questions Lecture 2: 1. If we need to use each thread to calculate one output element of a vector addition, what would be the expression for mapping the thread/block indices to data index: (A) i=threadidx.x

More information

OpenMP Programming on ScaleMP

OpenMP Programming on ScaleMP OpenMP Programming on ScaleMP Dirk Schmidl schmidl@rz.rwth-aachen.de Rechen- und Kommunikationszentrum (RZ) MPI vs. OpenMP MPI distributed address space explicit message passing typically code redesign

More information

Lecture 18: Interconnection Networks. CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012)

Lecture 18: Interconnection Networks. CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Lecture 18: Interconnection Networks CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Announcements Project deadlines: - Mon, April 2: project proposal: 1-2 page writeup - Fri,

More information

Introduction to Exploration and Optimization of Multiprocessor Embedded Architectures based on Networks On-Chip

Introduction to Exploration and Optimization of Multiprocessor Embedded Architectures based on Networks On-Chip Introduction to Exploration and Optimization of Multiprocessor Embedded Architectures based on Networks On-Chip Cristina SILVANO silvano@elet.polimi.it Politecnico di Milano, Milano (Italy) Talk Outline

More information

Algorithms of Scientific Computing II

Algorithms of Scientific Computing II Technische Universität München WS 2010/2011 Institut für Informatik Prof. Dr. Hans-Joachim Bungartz Alexander Heinecke, M.Sc., M.Sc.w.H. Algorithms of Scientific Computing II Exercise 4 - Hardware-aware

More information

Internet Packets. Forwarding Datagrams

Internet Packets. Forwarding Datagrams Internet Packets Packets at the network layer level are called datagrams They are encapsulated in frames for delivery across physical networks Frames are packets at the data link layer Datagrams are formed

More information

Beyond Embarrassingly Parallel Big Data. William Gropp www.cs.illinois.edu/~wgropp

Beyond Embarrassingly Parallel Big Data. William Gropp www.cs.illinois.edu/~wgropp Beyond Embarrassingly Parallel Big Data William Gropp www.cs.illinois.edu/~wgropp Messages Big is big Data driven is an important area, but not all data driven problems are big data (despite current hype).

More information

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp www.cs.illinois.edu/~wgropp

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp www.cs.illinois.edu/~wgropp Designing and Building Applications for Extreme Scale Systems CS598 William Gropp www.cs.illinois.edu/~wgropp Welcome! Who am I? William (Bill) Gropp Professor of Computer Science One of the Creators of

More information

Binary search tree with SIMD bandwidth optimization using SSE

Binary search tree with SIMD bandwidth optimization using SSE Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous

More information

Architectural Level Power Consumption of Network on Chip. Presenter: YUAN Zheng

Architectural Level Power Consumption of Network on Chip. Presenter: YUAN Zheng Architectural Level Power Consumption of Network Presenter: YUAN Zheng Why Architectural Low Power Design? High-speed and large volume communication among different parts on a chip Problem: Power consumption

More information

Optimizing Configuration and Application Mapping for MPSoC Architectures

Optimizing Configuration and Application Mapping for MPSoC Architectures Optimizing Configuration and Application Mapping for MPSoC Architectures École Polytechnique de Montréal, Canada Email : Sebastien.Le-Beux@polymtl.ca 1 Multi-Processor Systems on Chip (MPSoC) Design Trends

More information

Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC

Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC Outline Dynamic Load Balancing framework in Charm++ Measurement Based Load Balancing Examples: Hybrid Load Balancers Topology-aware

More information

Comparing the Power and Performance of Intel s SCC to State-of-the-Art CPUs and GPUs

Comparing the Power and Performance of Intel s SCC to State-of-the-Art CPUs and GPUs Comparing the Power and Performance of Intel s SCC to State-of-the-Art CPUs and GPUs Ehsan Totoni, Babak Behzad, Swapnil Ghike, Josep Torrellas Department of Computer Science, University of Illinois at

More information

The 48-core SCC processor: the programmer s view

The 48-core SCC processor: the programmer s view 1 The 48-core SCC processor: the programmer s view Timothy G. Mattson, Rob F. Van der Wijngaart, Michael Riepen, Thomas Lehnig, Paul Brett, Werner Haas, Patrick Kennedy, Jason Howard, Sriram Vangal, Nitin

More information

High Performance Computing. Course Notes 2007-2008. HPC Fundamentals

High Performance Computing. Course Notes 2007-2008. HPC Fundamentals High Performance Computing Course Notes 2007-2008 2008 HPC Fundamentals Introduction What is High Performance Computing (HPC)? Difficult to define - it s a moving target. Later 1980s, a supercomputer performs

More information

FD4: A Framework for Highly Scalable Dynamic Load Balancing and Model Coupling

FD4: A Framework for Highly Scalable Dynamic Load Balancing and Model Coupling Center for Information Services and High Performance Computing (ZIH) FD4: A Framework for Highly Scalable Dynamic Load Balancing and Model Coupling Symposium on HPC and Data-Intensive Applications in Earth

More information

SWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri

SWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri SWARM: A Parallel Programming Framework for Multicore Processors David A. Bader, Varun N. Kanade and Kamesh Madduri Our Contributions SWARM: SoftWare and Algorithms for Running on Multicore, a portable

More information

Bare-metal message passing API for many-core systems

Bare-metal message passing API for many-core systems Bare-metal message passing API for many-core systems Johannes Scheller Institut Superieur de l Aeronautique et de l Espace, Toulouse, France Eric Noulard and Claire Pagetti and Wolfgang Puffitsch ONERA-DTIM,

More information

Big Graph Processing: Some Background

Big Graph Processing: Some Background Big Graph Processing: Some Background Bo Wu Colorado School of Mines Part of slides from: Paul Burkhardt (National Security Agency) and Carlos Guestrin (Washington University) Mines CSCI-580, Bo Wu Graphs

More information

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters COSC 6374 Parallel Computation Parallel I/O (I) I/O basics Spring 2008 Concept of a clusters Processor 1 local disks Compute node message passing network administrative network Memory Processor 2 Network

More information

Advances in Smart Systems Research : ISSN 2050-8662 : http://nimbusvault.net/publications/koala/assr/ Vol. 3. No. 3 : pp.

Advances in Smart Systems Research : ISSN 2050-8662 : http://nimbusvault.net/publications/koala/assr/ Vol. 3. No. 3 : pp. Advances in Smart Systems Research : ISSN 2050-8662 : http://nimbusvault.net/publications/koala/assr/ Vol. 3. No. 3 : pp.49-54 : isrp13-005 Optimized Communications on Cloud Computer Processor by Using

More information

Globus Striped GridFTP Framework and Server. Raj Kettimuthu, ANL and U. Chicago

Globus Striped GridFTP Framework and Server. Raj Kettimuthu, ANL and U. Chicago Globus Striped GridFTP Framework and Server Raj Kettimuthu, ANL and U. Chicago Outline Introduction Features Motivation Architecture Globus XIO Experimental Results 3 August 2005 The Ohio State University

More information

Initial Performance Evaluation of the Cray SeaStar

Initial Performance Evaluation of the Cray SeaStar Initial Performance Evaluation of the Cray SeaStar Interconnect Ron Brightwell Kevin Pedretti Keith D. Underwood Sandia National Laboratories PO Box 58 Albuquerque, NM 87185-111 E-mail: {rbbrigh,ktpedre,kdunder}@sandia.gov

More information

Cloud computing. Intelligent Services for Energy-Efficient Design and Life Cycle Simulation. as used by the ISES project

Cloud computing. Intelligent Services for Energy-Efficient Design and Life Cycle Simulation. as used by the ISES project Intelligent Services for Energy-Efficient Design and Life Cycle Simulation Project number: 288819 Call identifier: FP7-ICT-2011-7 Project coordinator: Technische Universität Dresden, Germany Website: ises.eu-project.info

More information

On-Chip Interconnection Networks Low-Power Interconnect

On-Chip Interconnection Networks Low-Power Interconnect On-Chip Interconnection Networks Low-Power Interconnect William J. Dally Computer Systems Laboratory Stanford University ISLPED August 27, 2007 ISLPED: 1 Aug 27, 2007 Outline Demand for On-Chip Networks

More information

Performance and scalability of MPI on PC clusters

Performance and scalability of MPI on PC clusters CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 24; 16:79 17 (DOI: 1.12/cpe.749) Performance Performance and scalability of MPI on PC clusters Glenn R. Luecke,,

More information

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems About me David Rioja Redondo Telecommunication Engineer - Universidad de Alcalá >2 years building and managing clusters UPM

More information

Communication Networks. MAP-TELE 2011/12 José Ruela

Communication Networks. MAP-TELE 2011/12 José Ruela Communication Networks MAP-TELE 2011/12 José Ruela Network basic mechanisms Introduction to Communications Networks Communications networks Communications networks are used to transport information (data)

More information

基 於 SDN 與 可 程 式 化 硬 體 架 構 之 雲 端 網 路 系 統 交 換 器

基 於 SDN 與 可 程 式 化 硬 體 架 構 之 雲 端 網 路 系 統 交 換 器 基 於 SDN 與 可 程 式 化 硬 體 架 構 之 雲 端 網 路 系 統 交 換 器 楊 竹 星 教 授 國 立 成 功 大 學 電 機 工 程 學 系 Outline Introduction OpenFlow NetFPGA OpenFlow Switch on NetFPGA Development Cases Conclusion 2 Introduction With the proposal

More information

Advancing Applications Performance With InfiniBand

Advancing Applications Performance With InfiniBand Advancing Applications Performance With InfiniBand Pak Lui, Application Performance Manager September 12, 2013 Mellanox Overview Ticker: MLNX Leading provider of high-throughput, low-latency server and

More information

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster Gabriele Jost and Haoqiang Jin NAS Division, NASA Ames Research Center, Moffett Field, CA 94035-1000 {gjost,hjin}@nas.nasa.gov

More information

Applications. Network Application Performance Analysis. Laboratory. Objective. Overview

Applications. Network Application Performance Analysis. Laboratory. Objective. Overview Laboratory 12 Applications Network Application Performance Analysis Objective The objective of this lab is to analyze the performance of an Internet application protocol and its relation to the underlying

More information

Axon: A Flexible Substrate for Source- routed Ethernet. Jeffrey Shafer Brent Stephens Michael Foss Sco6 Rixner Alan L. Cox

Axon: A Flexible Substrate for Source- routed Ethernet. Jeffrey Shafer Brent Stephens Michael Foss Sco6 Rixner Alan L. Cox Axon: A Flexible Substrate for Source- routed Ethernet Jeffrey Shafer Brent Stephens Michael Foss Sco6 Rixner Alan L. Cox 2 Ethernet Tradeoffs Strengths Weaknesses Cheap Simple High data rate Ubiquitous

More information

- An Essential Building Block for Stable and Reliable Compute Clusters

- An Essential Building Block for Stable and Reliable Compute Clusters Ferdinand Geier ParTec Cluster Competence Center GmbH, V. 1.4, March 2005 Cluster Middleware - An Essential Building Block for Stable and Reliable Compute Clusters Contents: Compute Clusters a Real Alternative

More information

LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS 2015. Hermann Härtig

LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS 2015. Hermann Härtig LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS 2015 Hermann Härtig ISSUES starting points independent Unix processes and block synchronous execution who does it load migration mechanism

More information

Memory Architecture and Management in a NoC Platform

Memory Architecture and Management in a NoC Platform Architecture and Management in a NoC Platform Axel Jantsch Xiaowen Chen Zhonghai Lu Chaochao Feng Abdul Nameed Yuang Zhang Ahmed Hemani DATE 2011 Overview Motivation State of the Art Data Management Engine

More information

Chapter 2 Parallel Architecture, Software And Performance

Chapter 2 Parallel Architecture, Software And Performance Chapter 2 Parallel Architecture, Software And Performance UCSB CS140, T. Yang, 2014 Modified from texbook slides Roadmap Parallel hardware Parallel software Input and output Performance Parallel program

More information

COS 318: Operating Systems. File Layout and Directories. Topics. File System Components. Steps to Open A File

COS 318: Operating Systems. File Layout and Directories. Topics. File System Components. Steps to Open A File Topics COS 318: Operating Systems File Layout and Directories File system structure Disk allocation and i-nodes Directory and link implementations Physical layout for performance 2 File System Components

More information

Exploiting Transparent Remote Memory Access for Non-Contiguous- and One-Sided-Communication

Exploiting Transparent Remote Memory Access for Non-Contiguous- and One-Sided-Communication Workshop for Communication Architecture in Clusters, IPDPS 2002: Exploiting Transparent Remote Memory Access for Non-Contiguous- and One-Sided-Communication Joachim Worringen, Andreas Gäer, Frank Reker

More information

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

Seeking Opportunities for Hardware Acceleration in Big Data Analytics Seeking Opportunities for Hardware Acceleration in Big Data Analytics Paul Chow High-Performance Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Toronto Who

More information

Network Infrastructure Services CS848 Project

Network Infrastructure Services CS848 Project Quality of Service Guarantees for Cloud Services CS848 Project presentation by Alexey Karyakin David R. Cheriton School of Computer Science University of Waterloo March 2010 Outline 1. Performance of cloud

More information

Router and Routing Basics

Router and Routing Basics Router and Routing Basics Malin Bornhager Halmstad University Session Number 2002, Svenska-CNAP Halmstad University 1 Routing Protocols and Concepts CCNA2 Routing and packet forwarding Static routing Dynamic

More information

September 25, 2007. Maya Gokhale Georgia Institute of Technology

September 25, 2007. Maya Gokhale Georgia Institute of Technology NAND Flash Storage for High Performance Computing Craig Ulmer cdulmer@sandia.gov September 25, 2007 Craig Ulmer Maya Gokhale Greg Diamos Michael Rewak SNL/CA, LLNL Georgia Institute of Technology University

More information

Technical Computing Suite Job Management Software

Technical Computing Suite Job Management Software Technical Computing Suite Job Management Software Toshiaki Mikamo Fujitsu Limited Supercomputer PRIMEHPC FX10 PRIMERGY x86 cluster Outline System Configuration and Software Stack Features The major functions

More information

Performance Characteristics of a Cost-Effective Medium-Sized Beowulf Cluster Supercomputer

Performance Characteristics of a Cost-Effective Medium-Sized Beowulf Cluster Supercomputer Res. Lett. Inf. Math. Sci., 2003, Vol.5, pp 1-10 Available online at http://iims.massey.ac.nz/research/letters/ 1 Performance Characteristics of a Cost-Effective Medium-Sized Beowulf Cluster Supercomputer

More information

Scalable Source Routing

Scalable Source Routing Scalable Source Routing January 2010 Thomas Fuhrmann Department of Informatics, Self-Organizing Systems Group, Technical University Munich, Germany Routing in Networks You re there. I m here. Scalable

More information

Scientific application deployment on Cloud: A Topology-Aware Method

Scientific application deployment on Cloud: A Topology-Aware Method CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 212; :1 2 Published online in Wiley InterScience (www.interscience.wiley.com). Scientific application deployment

More information

Efficient Built-In NoC Support for Gather Operations in Invalidation-Based Coherence Protocols

Efficient Built-In NoC Support for Gather Operations in Invalidation-Based Coherence Protocols Universitat Politècnica de València Master Thesis Efficient Built-In NoC Support for Gather Operations in Invalidation-Based Coherence Protocols Author: Mario Lodde Advisor: Prof. José Flich Cardo A thesis

More information

A Generic Network Interface Architecture for a Networked Processor Array (NePA)

A Generic Network Interface Architecture for a Networked Processor Array (NePA) A Generic Network Interface Architecture for a Networked Processor Array (NePA) Seung Eun Lee, Jun Ho Bahn, Yoon Seok Yang, and Nader Bagherzadeh EECS @ University of California, Irvine Outline Introduction

More information

Lecture 2 Parallel Programming Platforms

Lecture 2 Parallel Programming Platforms Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple

More information

Diablo and VMware TM powering SQL Server TM in Virtual SAN TM. A Diablo Technologies Whitepaper. May 2015

Diablo and VMware TM powering SQL Server TM in Virtual SAN TM. A Diablo Technologies Whitepaper. May 2015 A Diablo Technologies Whitepaper Diablo and VMware TM powering SQL Server TM in Virtual SAN TM May 2015 Ricky Trigalo, Director for Virtualization Solutions Architecture, Diablo Technologies Daniel Beveridge,

More information

Performance Analysis and Optimization Tool

Performance Analysis and Optimization Tool Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL andres.charif@uvsq.fr Performance Analysis Team, University of Versailles http://www.maqao.org Introduction Performance Analysis Develop

More information

A Comparison of Three MPI Implementations for Red Storm

A Comparison of Three MPI Implementations for Red Storm A Comparison of Three MPI Implementations for Red Storm Ron Brightwell Scalable Computing Systems Sandia National Laboratories P.O. Box 58 Albuquerque, NM 87185-111 rbbrigh@sandia.gov Abstract. Cray Red

More information

Analysis of Power Management Strategies for a Single-Chip Cloud Computer

Analysis of Power Management Strategies for a Single-Chip Cloud Computer Analysis of Power Management Strategies for a Single-Chip Cloud Computer Robert Eksten, Kameswar Rao Vaddina, Pasi Liljeberg and Juha Plosila Turku Center for Computer Science (TUCS), Joukahaisenkatu 3-5,

More information

ADVANCED NETWORK CONFIGURATION GUIDE

ADVANCED NETWORK CONFIGURATION GUIDE White Paper ADVANCED NETWORK CONFIGURATION GUIDE CONTENTS Introduction 1 Terminology 1 VLAN configuration 2 NIC Bonding configuration 3 Jumbo frame configuration 4 Other I/O high availability options 4

More information

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage Parallel Computing Benson Muite benson.muite@ut.ee http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework

More information

Implementing MPI-IO Shared File Pointers without File System Support

Implementing MPI-IO Shared File Pointers without File System Support Implementing MPI-IO Shared File Pointers without File System Support Robert Latham, Robert Ross, Rajeev Thakur, Brian Toonen Mathematics and Computer Science Division Argonne National Laboratory Argonne,

More information

Load Balancing Support for Grid-enabled Applications

Load Balancing Support for Grid-enabled Applications John von Neumann Institute for Computing Load Balancing Support for Grid-enabled Applications S. Rips published in Parallel Computing: Current & Future Issues of High-End Computing, Proceedings of the

More information

Accelerating the Data Plane With the TILE-Mx Manycore Processor

Accelerating the Data Plane With the TILE-Mx Manycore Processor Accelerating the Data Plane With the TILE-Mx Manycore Processor Bob Doud Director of Marketing EZchip Linley Data Center Conference February 25 26, 2015 1 Announcing the World s First 100-Core A 64-Bit

More information

Cray Gemini Interconnect. Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak

Cray Gemini Interconnect. Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak Cray Gemini Interconnect Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak Outline 1. Introduction 2. Overview 3. Architecture 4. Gemini Blocks 5. FMA & BTA 6. Fault tolerance

More information

QoSIP: A QoS Aware IP Routing Protocol for Multimedia Data

QoSIP: A QoS Aware IP Routing Protocol for Multimedia Data QoSIP: A QoS Aware IP Routing Protocol for Multimedia Data Md. Golam Shagadul Amin Talukder and Al-Mukaddim Khan Pathan* Department of Computer Science and Engineering, Metropolitan University, Sylhet,

More information

Performance Evaluation of 2D-Mesh, Ring, and Crossbar Interconnects for Chip Multi- Processors. NoCArc 09

Performance Evaluation of 2D-Mesh, Ring, and Crossbar Interconnects for Chip Multi- Processors. NoCArc 09 Performance Evaluation of 2D-Mesh, Ring, and Crossbar Interconnects for Chip Multi- Processors NoCArc 09 Jesús Camacho Villanueva, José Flich, José Duato Universidad Politécnica de Valencia December 12,

More information

In-Memory Computing for Iterative CPU-intensive Calculations in Financial Industry In-Memory Computing Summit 2015

In-Memory Computing for Iterative CPU-intensive Calculations in Financial Industry In-Memory Computing Summit 2015 In-Memory Computing for Iterative CPU-intensive Calculations in Financial Industry In-Memory Computing Summit 2015 June 29-30, 2015 Contacts Alexandre Boudnik Senior Solution Architect, EPAM Systems Alexandre_Boudnik@epam.com

More information

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.

More information

PRIMERGY server-based High Performance Computing solutions

PRIMERGY server-based High Performance Computing solutions PRIMERGY server-based High Performance Computing solutions PreSales - May 2010 - HPC Revenue OS & Processor Type Increasing standardization with shift in HPC to x86 with 70% in 2008.. HPC revenue by operating

More information

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011 SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications Jürgen Primsch, SAP AG July 2011 Why In-Memory? Information at the Speed of Thought Imagine access to business data,

More information

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere! Interconnection Networks Interconnection Networks Interconnection networks are used everywhere! Supercomputers connecting the processors Routers connecting the ports can consider a router as a parallel

More information

Operating System Support for Multiprocessor Systems-on-Chip

Operating System Support for Multiprocessor Systems-on-Chip Operating System Support for Multiprocessor Systems-on-Chip Dr. Gabriel marchesan almeida Agenda. Introduction. Adaptive System + Shop Architecture. Preliminary Results. Perspectives & Conclusions Dr.

More information

Using IPv6 and 6LoWPAN for Home Automation Networks

Using IPv6 and 6LoWPAN for Home Automation Networks Using IPv6 and 6LoWPAN for Home Automation Networks Thomas Scheffler / Bernd Dörge ICCE-Berlin Berlin, 06.09.2011 Overview IPv6 and 6LoWPAN for Home Automation Networks 6LoWPAN Application & Network Architecture

More information

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters COSC 6374 Parallel I/O (I) I/O basics Fall 2012 Concept of a clusters Processor 1 local disks Compute node message passing network administrative network Memory Processor 2 Network card 1 Network card

More information

Performance Evaluation of the XDEM framework on the OpenStack Cloud Computing Middleware

Performance Evaluation of the XDEM framework on the OpenStack Cloud Computing Middleware Performance Evaluation of the XDEM framework on the OpenStack Cloud Computing Middleware 1 / 17 Performance Evaluation of the XDEM framework on the OpenStack Cloud Computing Middleware X. Besseron 1 V.

More information

Microsoft SMB 2.2 - Running Over RDMA in Windows Server 8

Microsoft SMB 2.2 - Running Over RDMA in Windows Server 8 Microsoft SMB 2.2 - Running Over RDMA in Windows Server 8 Tom Talpey, Architect Microsoft March 27, 2012 1 SMB2 Background The primary Windows filesharing protocol Initially shipped in Vista and Server

More information

Load Imbalance Analysis

Load Imbalance Analysis With CrayPat Load Imbalance Analysis Imbalance time is a metric based on execution time and is dependent on the type of activity: User functions Imbalance time = Maximum time Average time Synchronization

More information

Interconnection Network

Interconnection Network Interconnection Network Recap: Generic Parallel Architecture A generic modern multiprocessor Network Mem Communication assist (CA) $ P Node: processor(s), memory system, plus communication assist Network

More information

Performance analysis with Periscope

Performance analysis with Periscope Performance analysis with Periscope M. Gerndt, V. Petkov, Y. Oleynik, S. Benedict Technische Universität München September 2010 Outline Motivation Periscope architecture Periscope performance analysis

More information

D1.2 Network Load Balancing

D1.2 Network Load Balancing D1. Network Load Balancing Ronald van der Pol, Freek Dijkstra, Igor Idziejczak, and Mark Meijerink SARA Computing and Networking Services, Science Park 11, 9 XG Amsterdam, The Netherlands June ronald.vanderpol@sara.nl,freek.dijkstra@sara.nl,

More information

Architecture of distributed network processors: specifics of application in information security systems

Architecture of distributed network processors: specifics of application in information security systems Architecture of distributed network processors: specifics of application in information security systems V.Zaborovsky, Politechnical University, Sait-Petersburg, Russia vlad@neva.ru 1. Introduction Modern

More information

Workshop on Parallel and Distributed Scientific and Engineering Computing, Shanghai, 25 May 2012

Workshop on Parallel and Distributed Scientific and Engineering Computing, Shanghai, 25 May 2012 Scientific Application Performance on HPC, Private and Public Cloud Resources: A Case Study Using Climate, Cardiac Model Codes and the NPB Benchmark Suite Peter Strazdins (Research School of Computer Science),

More information