High Performance Computing in the Multi-core Area

Similar documents

Parallel Programming Survey

Supercomputing Status und Trends (Conference Report) Peter Wegner

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

SOC architecture and design

Parallel Computing. Benson Muite. benson.

Introduction to Cloud Computing

Multi-Threading Performance on Commodity Multi-Core Processors

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)

Cray Gemini Interconnect. Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak

Performance Characteristics of Large SMP Machines

OpenMP Programming on ScaleMP

An Introduction to Parallel Computing/ Programming

FPGA-based Multithreading for In-Memory Hash Joins

OBJECTIVE ANALYSIS WHITE PAPER MATCH FLASH. TO THE PROCESSOR Why Multithreading Requires Parallelized Flash ATCHING

Study Plan Masters of Science in Computer Engineering and Networks (Thesis Track)

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Systolic Computing. Fundamentals

Performance Analysis and Optimization Tool

High Performance Computing in CST STUDIO SUITE

Multi-core and Linux* Kernel

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

A New, High-Performance, Low-Power, Floating-Point Embedded Processor for Scientific Computing and DSP Applications

Introduction to GPGPU. Tiziano Diamanti

and RISC Optimization Techniques for the Hitachi SR8000 Architecture

Sun Constellation System: The Open Petascale Computing Architecture

Design and Implementation of the Heterogeneous Multikernel Operating System

Cloud Data Center Acceleration 2015

OC By Arsene Fansi T. POLIMI

Optimizing Code for Accelerators: The Long Road to High Performance

VLIW Processors. VLIW Processors

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

Enterprise Applications

Symmetric Multiprocessing

SGI High Performance Computing

Itanium 2 Platform and Technologies. Alexander Grudinski Business Solution Specialist Intel Corporation

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Architecture of Hitachi SR-8000

CFD Implementation with In-Socket FPGA Accelerators

Altix Usage and Application Programming. Welcome and Introduction

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Why the Network Matters

Photonic Networks for Data Centres and High Performance Computing

How System Settings Impact PCIe SSD Performance

White Paper The Numascale Solution: Extreme BIG DATA Computing

numascale White Paper The Numascale Solution: Extreme BIG DATA Computing Hardware Accellerated Data Intensive Computing By: Einar Rustad ABSTRACT

HPC Software Requirements to Support an HPC Cluster Supercomputer

FPGA Acceleration using OpenCL & PCIe Accelerators MEW 25

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

Petascale Software Challenges. William Gropp

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

Topological Properties

Intel Cluster Ready Appro Xtreme-X Computers with Mellanox QDR Infiniband

YALES2 porting on the Xeon- Phi Early results

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

CISC, RISC, and DSP Microprocessors

Experiences With Mobile Processors for Energy Efficient HPC

Driving force. What future software needs. Potential research topics

OpenSoC Fabric: On-Chip Network Generator

Enabling Technologies for Distributed Computing

Chapter 2 Parallel Architecture, Software And Performance

Exascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Chapter 2 Parallel Computer Architecture

New Dimensions in Configurable Computing at runtime simultaneously allows Big Data and fine Grain HPC

Hadoop on the Gordon Data Intensive Cluster

CS 159 Two Lecture Introduction. Parallel Processing: A Hardware Solution & A Software Challenge

High Performance Computing, an Introduction to

Intel Itanium Architecture

On-Chip Interconnection Networks Low-Power Interconnect

White Paper The Numascale Solution: Affordable BIG DATA Computing

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

High Performance Computing. Course Notes HPC Fundamentals

Parallel Computing 37 (2011) Contents lists available at ScienceDirect. Parallel Computing. journal homepage:

Agenda. Michele Taliercio, Il circuito Integrato, Novembre 2001

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

CMSC 611: Advanced Computer Architecture

Intel Xeon Processor E5-2600

OPTIMIZING SERVER VIRTUALIZATION

Annotation to the assignments and the solution sheet. Note the following points

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief

EMC XtremSF: Delivering Next Generation Performance for Oracle Database

Petascale Visualization: Approaches and Initial Results

Xeon+FPGA Platform for the Data Center

Architectures and Platforms

SeaMicro SM Server

Trends in High-Performance Computing for Power Grid Applications

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

Big Data Functionality for Oracle 11 / 12 Using High Density Computing and Memory Centric DataBase (MCDB) Frequently Asked Questions

Transcription:

High Performance Computing in the Multi-core Area Arndt Bode Technische Universität München

Technology Trends for Petascale Computing Architectures: Multicore Accelerators Special Purpose Reconfigurable Memory and Cache Network Secondary Storage Computing and Software: Programming Models Heterogeneous vs. Homogeneous Compilers and Threading Operating Systems Tools Adaptivity Applications: Scalability 1

Trends in Microprocessor Technology Compute Performance of COTS Microprocessors depends of - Clock Frequency - Computer Organization Clock Frequency: - Exponential Increase impossible (Power, Cooling, 135 W/Chip) - Partly Solutions: Clock- and Power reduction, Sleep Transistors,... - New Goal for Optimization: Energy-Efficiency: MIPS, MFLOPS per W Computer Organization: ILP, Processor-internal Organization fully exploited - Pipelining, Superscalar, VLIW, Wordlength,..., contra-productive to Energy- Efficiency (Speculation) - Future is Processor/Thread level-parallelism (Parallelism of Simpler Units) - Multi-Core and Application-specific Processors, Fault Tolerance 2

Energy - Efficiency P ~ A C V 2 f P: Power A: Activity Factor (Active Transistors on Chip) C: Total Capacity V: Voltage F: Clock Frequency 3

Multi-Core and Multithreading...... P 0 P 1 P 2 P 7 0 1 4 5 8 9 28 29 2 3 6 7 10 11 30 31 FLC 0 FLC 1 FLC 2...... FLC 7 SLC (shared) (gemeinsam) Multicore-Chip 4

Multi-Core Architectures: A Parallel World for Everybody virtual processes VP 0 Compiler: Generation of parallel virtual processes...... VP 1 VP 2 VP k Virtualization at System Level Load Balancing / Fault Tolerance / Placement Multi-Core Server P 0 0 1 2 3 MMX...... DSP P 1 P 2 P 7 4 5 6 7 8 9 10 11 28 29 30 31... faulty MMX DSP P 0 0 1 2 3 P 1 P 2... P 7 4 5 6 7 8 9 10 11... 28 29 30 31 FLC 0 FLC 1 FLC 2 FLC 7 FLC 0 FLC 1 FLC 2 FLC 7 SLC (shared) Multicore-Chip 0 SLC (shared) Multicore-Chip n Memory 5

6

Processor-Design Options for Petaflop Systems Mega-Multicore Systems: General Purpose Applications (Itanium, Xeon, Power-7,... ) Multicore and Attached Accelerators (Cell, ClearSpeed, NVIDIA,...) Special Purpose (BlueGene, QCD, POLARIS,...) Reconfigurable and Adaptive (FPGA: continuous research) General Purpose vs. Special Purpose - Programmability - High Performance - Scalability - Low Power - Applicability History: Special Purpose volatile, Programming Interface not Compatible Microprogrammable Devices : Vertical Migration and Bitslice Early Microprocessors : Coprocessors (FP, IO, DSP,...) Early HPC : Vector Processor Attachments, Array- Processors, Associative Processors 7

Petaflop Processor Options: Bode s Oracle Two Types of Systems will persist: General Purpose Mega-Multicore (Programmability, Compatibility, Cost) Special Purpose Systems for Large Specific Application Classes (Energy Efficiency) The Systems will be Highly Parallel Thomas Sterling: Multicore and Heterogeneous is Disruptive Technology Tsugio Makimoto: Makimoto s Wave predicts Customized for 2007 2017 (Pendulum: 10 14 km) 8

9

10

Consequences for Computer Architecture Memory-Bottleneck Today: P Cache-Hierarchy M Multi-Core: P P P... P Bottleneck intensified M 11

Cache Organizations Distributed P 0 P 1 P 2...... P 7 Shared 0 1 2 3 4 5 6 7 8 9 10 11 28 29 30 31 Groupwise shared FLC 0 FLC 1 FLC 2...... FLC 7 SLC (shared) 12

Communication Bottleneck classical SMP: P P 1 Interface / Proc. IC M SMP based on Multi-Core P P... P P P...? M P n Interfaces / Multi-Core IC (e. g.: n cores 5n Interfaces mesh-structure) 13

HLRB II: SGI ALtix 4700 / 9728 Montecito cores 11 m 22 m 14

Blade Core 1 Itanium2 (Madison) Socket Itanium2 Dual Core Core 2 8.5 GB/s Memory Controller (Shub 2.0) 6.4 GB/s 6.4 GB/s NL Port Plane 1 NL Port Plane 2 4 GByte per Compute Core 15

Dual Socket Blades - Density Compute Blades Core 1 Itanium2 Dual Core Socket A Core 2 Core 1 Itanium2 Dual Core Socket A Core 2 8.5 GB/s Memory Controller (Shub 2.0) 4 GByte per Compute Core 6.4 GB/s 6.4 GB/s NL Port Plane 1 NL Port Plane 2 16

One 256 Socket Partition: ccnuma Shared Memory Single System Image, Fat Tree Connection 32*6.4 GByte/s = 204.8 Gbyte/s 32*6.4 GByte/s = 204.8 Gbyte/s 2*0.8 Gbyte/s per Socket pair 32*6.4 GByte/s = 204.8 Gbyte/s 17

2-D Torus Mesh topology. Rank 18

HLRB II Interconnect Fat Tree Topology 19

Topology Batch Scheduler must be topology aware PBSpro 8.0 early access Fat Tree Placements of jobs is optimized according to topology Topology is stored in placement sets 2D-Torus 20

Petascale Software Key Issue is use of Excess Parallelism: Programming Model, Language and Compiler Conventional Parallelizing (Including JIT) Parallelizing with Directives New Parallel Languages (Transactional Memory,...) Use of Threads Functional Speculative Assist/Helper (Prefetching, Monitoring, Debugging, Tools, Virtualization, Security, FT-Lockstep,...) Scalable Tool Models 21

You tell us! This is the Academic Forum, so you tell us. What are the solutions? Where are the new language ideas? Can you design a statically checked race-free language which is useful? Can naїve users really use functional languages? Any language which talks about threads is too low level for most users. So how do we raise the language level? Should we be doing message passing inside the node? It s the only demonstrated way to achieve high scalability Do we really need to bring back Occam? James Cownie Intel Academic Forum, Budapest, 13.06.2007 22

Some Examples of Research at MMI Cache-Prefetching with Helper-Core (up to 39 %) Cache-Behavior with shared/separate Caches Helper Cores for Fault Tolerance 23

Prefetching Bandwidth obtainable for Matrix-Multiplication (Blocking with small amount of caching) Matrix Dimension 24

Synthetic Benchmark: Advantages of Shared Caches Latency for continuous Writes onto the same Memory area (2 cores) 25

Lockstepping of Virtual Machines Synchronized Identical Execution of Applications on 2 Processors Virtual Environment allows for Monitoring and Control by Second Core Minimal Performance Loss 26

Lockstepping of Virtual Machines 27

Bavarian Graduate School of Computational Engineering Elitenetzwerk Bayern Welcome to the Bavarian Graduate School of Computational Engineering Computational Engineering... refers to all activities in engineering that use computers as their main tool. Typical tasks in Computational Engineering are the solution of differential equations that model certain physical phenomena, the optimization of processes in engineering, or the stochastic simulation of a complex system. The Bavarian Graduate School... of Computational Engineering is an association of three Master programs: Computational Engineering (CE) at the Friedrich-Alexander-Universität Erlangen-Nürnberg, Computational Mechanics (COME), and Computational Science and Engineering (CSE) at the Technische Universität München. Our goal is to push forward the rapidly growing field of Computational Engineering by offering high-quality master programs for students who are interested in the field. The Elitenetzwerk Bayern... is an initiative of the state of Bavaria to support the education and advancement of highly talented students. With the help of the Elitenetzwerk Bayern, we are able to offer an "elite" degree program for the best students in our master programs. Outstanding performance in one of the three Master's programs will be honoured by the newly formed academic degree of a "Master of Science with Honours". For further information about these elite program, please read on in our section "Infomation for students". IGSSE BGCE is a partner of IGSSE, TUM's International Graduate School of Science and Engineering 28

Summary Multicore (and Heterogeneous): Disruptive Technologies Many Choices in System Architecture and Programming Models HPC Systems will be massively parallel: Scalability is Challenge for Application-Algorithms, System Software, Tools and Architectures Interesting Times Ahead! 29