Carlo Cavazzoni, HPC department, CINECA www.cineca.it

Similar documents

Evoluzione dell Infrastruttura di Calcolo e Data Analytics per la ricerca

How Cineca supports IT

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Resource Scheduling Best Practice in Hybrid Clusters

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Cloud Data Center Acceleration 2015

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales

HPC Update: Engagement Model

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

Pedraforca: ARM + GPU prototype

Parallel Programming Survey

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

Using the Intel Xeon Phi (with the Stampede Supercomputer) ISC 13 Tutorial

Kriterien für ein PetaFlop System

FLOW-3D Performance Benchmark and Profiling. September 2012

Italian Scientific Big Data Initiative

FPGA Acceleration using OpenCL & PCIe Accelerators MEW 25

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

CORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER

A quick tutorial on Intel's Xeon Phi Coprocessor

COMP/CS 605: Intro to Parallel Computing Lecture 01: Parallel Computing Overview (Part 1)

Sun Constellation System: The Open Petascale Computing Architecture

Energy efficient computing on Embedded and Mobile devices. Nikola Rajovic, Nikola Puzovic, Lluis Vilanova, Carlos Villavieja, Alex Ramirez

Trends in High-Performance Computing for Power Grid Applications

Overview of HPC systems and software available within

Stovepipes to Clouds. Rick Reid Principal Engineer SGI Federal by SGI Federal. Published by The Aerospace Corporation with permission.

Running on Blue Gene/Q at Argonne Leadership Computing Facility (ALCF)

Supercomputing Resources in BSC, RES and PRACE

Copyright 2013, Oracle and/or its affiliates. All rights reserved.

The PHI solution. Fujitsu Industry Ready Intel XEON-PHI based solution. SC Denver

Hadoop on the Gordon Data Intensive Cluster

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

Emerging storage and HPC technologies to accelerate big data analytics Jerome Gaysse JG Consulting

Infrastructure Matters: POWER8 vs. Xeon x86

Mississippi State University High Performance Computing Collaboratory Brief Overview. Trey Breckenridge Director, HPC

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Visit to the National University for Defense Technology Changsha, China. Jack Dongarra. University of Tennessee. Oak Ridge National Laboratory

Overview of HPC Resources at Vanderbilt

ECDF Infrastructure Refresh - Requirements Consultation Document

Data Center and Cloud Computing Market Landscape and Challenges

Headline in Arial Bold 30pt. The Need For Speed. Rick Reid Principal Engineer SGI

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

RWTH GPU Cluster. Sandra Wienke November Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Build an Energy Efficient Supercomputer from Items You can Find in Your Home (Sort of)!

HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief

ANALYSIS OF SUPERCOMPUTER DESIGN

Part I Courses Syllabus

Review of SC13; Look Ahead to HPC in Addison Snell

Welcome to the. Jülich Supercomputing Centre. D. Rohe and N. Attig Jülich Supercomputing Centre (JSC), Forschungszentrum Jülich

Data Sheet FUJITSU Server PRIMERGY CX400 M1 Multi-Node Server Enclosure

Xeon+FPGA Platform for the Data Center

PRIMERGY server-based High Performance Computing solutions

2015 Global Technology conference. Diane Bryant Senior Vice President & General Manager Data Center Group Intel Corporation

Cloud Servers in the Datacenter: The Evolution of Density-Optimized

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Data Centric Systems (DCS)

SOSCIP Platforms. SOSCIP Platforms

ECLIPSE Best Practices Performance, Productivity, Efficiency. March 2009

Hyperscale. The new frontier for HPC. Philippe Trautmann. HPC/POD Sales Manager EMEA March 13th, 2011

Current Status of FEFS for the K computer

FPGA Accelerator Virtualization in an OpenPOWER cloud. Fei Chen, Yonghua Lin IBM China Research Lab

High Performance Computing in CST STUDIO SUITE

PRACE: access to Tier-0 systems and enabling the access to ExaScale systems Dr. Sergi Girona Managing Director and Chair of the PRACE Board of

Boost Database Performance with the Cisco UCS Storage Accelerator

~ Greetings from WSU CAPPLab ~

Experiences With Mobile Processors for Energy Efficient HPC

LS DYNA Performance Benchmarks and Profiling. January 2009

ECLIPSE Performance Benchmarks and Profiling. January 2009

SUN HARDWARE FROM ORACLE: PRICING FOR EDUCATION

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Supercomputing Status und Trends (Conference Report) Peter Wegner

AppliedMicro s X-Gene: Minimizing Power in Data-Center Servers

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

PCI Express Impact on Storage Architectures and Future Data Centers. Ron Emerick, Oracle Corporation

Fujitsu PRIMERGY Servers Portfolio

GTC Presentation March 19, Copyright 2012 Penguin Computing, Inc. All rights reserved

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

SGI High Performance Computing

High Performance Computing, an Introduction to

Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age

HPC Cloud. Focus on your research. Floris Sluiter Project leader SARA

Cosmological simulations on High Performance Computers

HP Moonshot: An Accelerator for Hyperscale Workloads

INDIAN INSTITUTE OF TECHNOLOGY KANPUR Department of Mechanical Engineering

How to Deploy OpenStack on TH-2 Supercomputer Yusong Tan, Bao Li National Supercomputing Center in Guangzhou April 10, 2014

Interconnect Your Future Enabling the Best Datacenter Return on Investment. TOP500 Supercomputers, June 2016

Scientific Computing Data Management Visions

Evaluation Report: HP Blade Server and HP MSA 16GFC Storage Evaluation

How To Write An Article On An Hp Appsystem For Spera Hana

Data Sheet FUJITSU Server PRIMERGY CX272 S1 Dual socket server node for PRIMERGY CX420 cluster server

Case Study on Productivity and Performance of GPGPUs

Server Consolidation for SAP ERP on IBM ex5 enterprise systems with Intel Xeon Processors:

Transcription:

CINECA HPC Infrastructure: state of the art and road map Carlo Cavazzoni, HPC department, CINECA www.cineca.it

Installed HPC Engines Eurora (Eurotech) FERMI, (IBM BGQ) PLX, (IBM DataPlex) hybrid cluster 64 nodes 1024 SandyBridge cores 64 K20 GPU 64 Xeon PHI coprocessor 150 TFlops peak 10240 nodes 163840 PowerA2 cores 2PFlops peak Hybrid cluster 274 nodes 3288 Westmere cores 548 nvidia M2070 (Fermi) 300TFlops peak

Architecture: 10 BGQ Frame Model: IBM-BG/Q Processor Type: IBM PowerA2, 1.6 GHz Computing Cores: 163840 Computing Nodes: 10240 RAM: 1GByte / core Internal Network: 5D Torus Disk Space: 2PByte of scratch space Peak Performance: 2PFlop/s FERMI @ CINECA PRACE Tier-0 System Available for ISCRA & PRACE call for projects

The PRACE RI provides access to distributed persistent pan-european world class HPC computing and data management resources and services. Expertise in efficient use of the resources is available through participating centers throughout Europe. Available resources are announced for each Call for Proposals.. European Tier 0 Tier 1 Tier 2 National Local Peer reviewed open access PRACE Projects (Tier-0) PRACE Preparatory (Tier-0) DECI Projects (Tier-1)

2. Single Chip Module 3. Compute card: One chip module, 16 GB DDR3 Memory, 4. Node Card: 32 Compute Cards, Optical Modules, Link Chips, Torus 1. Chip: 16 P cores 5b. IO drawer: 8 IO cards w/16 GB 8 PCIe Gen2 x8 slots 7. System: 20PF/s 5a. Midplane: 16 Node Cards 6. Rack: 2 Midplanes

BG/Q I/O architecture PCI_E IB IB BG/Q compute racks BG/Q IO Switch File system servers IB SAN

I/O drawers PCIe I/O nodes 8 I/O nodes At least one I/O node for each partition/job Minimum partition/job size: 64 nodes, 1024 cores

PowerA2 chip, basic info 64bit RISC Processor Power instruction set (Power1 Power7, PowerPC) 4 Floating Point units per core & 4 way MT 16 cores + 1 + 1 (17th Processor core for system functions) 1.6GHz 32MByte cache system-on-a-chip design 16GByte of RAM at 1.33GHz Peak Perf 204.8 gigaflops power draw of 55 watts 45 nanometer copper/soi process (same as Power7) Water Cooled

PowerA2 FPU Each FPU on each core has four pipelines execute scalar floating point instructions four-wide SIMD instructions two-wide complex arithmetic SIMD inst. six-stage pipeline maximum of eight concurrent floating point operations per clock plus a load and a store. 9

EURORA #1 in The Green500 List June 2013 What EURORA stant for? EURopean many integrated core Architecture What is EURORA? Prototype Project Founded by PRACE 2IP EU project Grant agreement number: RI-283493 Co-designed by CINECA and EUROTECH Where is EURORA? EURORA is installed at CINECA When EURORA has been installed? March 2013 Who is using EURORA? All Italian and EU researchers through PRACE Prototype grant access program 3,200MFLOPS/W 30KW

Why EURORA? (project objectives) Address Today HPC Constraints: Flops/Watt, Flops/m2, Flops/Dollar. Efficient Cooling Technology: hot water cooling (free cooling); measure power efficiency, evaluate (PUE & TCO). Improve Application Performances: at the same rate as in the past (~Moore s Law); new programming models. Evaluate Hybrid (accelerated) Technology: Intel Xeon Phi; NVIDIA Kepler. Custom Interconnection Technology: 3D Torus network (FPGA); evaluation of accelerator-toaccelerator communications.

EURORA prototype configuration 64 compute cards 128 Xeon SandyBridge (2.1GHz, 95W and 3.1GHz, 150W) 16GByte DDR3 1600MHz per node 160GByte SSD per node 1 FPGA (Altera Stratix V) per node IB QDR interconnect 3D Torus interconnect 128 Accelerator cards (NVIDA K20 and INTEL PHI)

Node card K20 Xeon PHI 13

Node Energy Efficiency Decreases! 14

HPC Service

HPC Engines HPC Services FERMI (IBM BGQ) Eurora (Eurotech hybrid) PLX (IBM x86+gpu) HPC Workloads PRACE LISA Projects Agreements #12 Top500 2PFlops peak 163840 cores 163Tbyte RAM Power 1.6GHz #1 Green500 0.17PFlops peak 1024 x86 cores 64 Intel PHI 64 NVIDIA K20 0.3PFlops peak ~3500 x86 procs 548 NVIDIA GPU 20 NVIDIA Quadro 16 Fat nodes ISCRA Data Processing Workloads FERMI High througput Training viz Labs Big mem PLX DB Industry Web serv. HPC Data store Data mover Data mover processing Tape 1.5PB Repository 1.8PByte Workspace 3.6PByte NUBES Cloud serv. We b FEC Archive FTP External Data Sources HPC Cloud Nubes FEC PLX Store PRACE EUDAT Labs Projects Network Custom IB Gbe Fibre FERMI EURORA EURORA PLX Store Nubes Infrastructure Internet Store

CINECA services High Performance Computing Computational workflow Storage Data analytics Data preservation (long term) Data access (web/app) Remote Visualization HPC Training HPC Consulting HPC Hosting Monitoring and Metering For academia and industry

Road Map

(data centric) Infrastructure (Q3 2014) External Data Sources Cloud service SaaS APP PRACE EUDAT Other Data Sources Laboratories Core Data Store New storage Human Brain Prj Repository 5PByte Tape 5+ PByte Internal data sources Core Data Processing viz Big mem Data mover We b Archive Analytics APP DB We b serv. processing FTP New analytics Workspace 3.6PByte Scale-Out Data Processing FERMI X86 Cluster Parallel APP

New Tier 1 CINECA Procurement Q3 2014 Requisiti di alto livello del sistema Potenza elettrica assorbita: 400KW Dimensione fisica del sistema: 5 racks Potenza di picco del sistema (CPU+GPU): nell'ordine di 1PFlops Potenza di picco del sistema (solo CPU): nell'ordine di 300TFlops

Tier 1 CINECA Requisiti di alto livello del sistema Architettura CPU: Intel Xeon Ivy Bridge Numero di core per CPU: 8 @ >3GHz, oppure 12 @ 2.4GHz La scelta della frequenza ed il numero di core dipende dal TDP del socket, dalla densità del sistema e dalla capacità di raffreddamento Numero di server: 500-600, ( Peak perf = 600 * 2socket * 12core * 3GHz * 8Flop/clk = 345TFlops ) Il numero di server del sistema potrà dipendere dal costo o dalla geometria della configurazione in termini di numero di nodi solo CPU e numero di nodi CPU+GPU Architettura GPU: Nvidia K40 Numero di GPU: >500 ( Peak perf = 700 * 1.43TFlops = 1PFlops ) Il numero di schede GPU del sistema potrà dipendere dal costo o dalla geometria della configurazione in termini di numero di nodi solo CPU e numero di nodi CPU+GPU

Tier 1 CINECA Requisiti di alto livello del sistema Vendor identificati: IBM, Eurotech DRAM Memory: 1GByte/core Verrà richiesta la possibilità di avere un sottoinsieme di nodi con una quantità di memoria più elevata Memoria non volatile locale: >500GByte SSD/HD a seconda del costo e dalla configurazione del sistema Cooling: sistema di raffreddamento a liquido con opzione di free cooling Spazio disco scratch: >300TByte (provided by CINECA)

Roadmap 50PFlops Power consumption EURORA 50KW, PLX 350 KW, BGQ 1000KW + ENI EURORA or PLX upgrade 400KW; BGQ 1000KW, Data repository 200KW; - ENI R&D Eurora EuroExa STM / ARM board EuroExa STM / ARM prototype PCP Proto 1PF in a rack EuroExa STM / ARM PF platform ETP proto towards exascale board Deployment Eurora industrial prototype 150 TF Eurora or PLX upgrade 1PF peak, 350TF scalar multi petaflop system Tier-0 50PF Tier-1 towards exascale Time line 2013 2014 2015 2016 2017 2018 2019 2020

Roadmap to Exascale (architectural trends)

HPC Architectures two model Hybrid: Server class processors: Server class nodes Special purpose nodes Accelerator devices: Nvidia Intel AMD FPGA Homogeneus: Server class node: Standar processors Special porpouse nodes Special purpose processors

Architectural trends Peak Performance FPU Performance Number of FPUs App. Parallelism Moore law Dennard law Moore + Dennard Amdahl's law

Programming Models fundamental paradigm: Message passing Multi-threads Consolidated standard: MPI & OpenMP New task based programming model Special purpose for accelerators: CUDA Intel offload directives OpenACC, OpenCL, Ecc NO consolidated standard Scripting: python

But! 14nm VLSI 0.54 nm Si lattice 300 atoms! There will be still 4~6 cycles (or technology generations) left until we reach 11 ~ 5.5 nm technologies, at which we will reach downscaling limit, in some year between 2020-30 (H. Iwai, IWJT2008).

Thank you

Dennard scaling law (downscaling) new VLSI gen. old VLSI gen. L = L / 2 V = V / 2 F = F * 2 D = 1 / L 2 = 4D P = P do not hold anymore! The core frequency and performance do not grow following the Moore s law any longer L = L / 2 V = ~V F = ~F * 2 D = 1 / L 2 = 4 * D P = 4 * P The power crisis! Increase the number of cores to maintain the architectures evolution on the Moore s law Programming crisis!

Moore s Law Economic and market law Stacy Smith, Intel s chief financial officer, later gave some more detail on the economic benefits of staying on the Moore s Law race. The cost per chip is going down more than the capital intensity is going up, Smith said, suggesting Intel s profit margins should not suffer because of heavy capital spending. This is the economic beauty of Moore s Law. And Intel has a good handle on the next production shift, shrinking circuitry to 10 nanometers. Holt said the company has test chips running on that technology. We are projecting similar kinds of improvements in cost out to 10 nanometers, he said. So, despite the challenges, Holt could not be induced to say there s any looming end to Moore s Law, the invention race that has been a key driver of electronics innovation since first defined by Intel s co-founder in the mid-1960s. From WSJ It is all about the number of chips per Si wafer!

What about Applications? In a massively parallel context, an upper limit for the scalability of parallel applications is determined by the fraction of the overall execution time spent in non-scalable operations (Amdahl's law). maximum speedup tends to 1 / ( 1 P ) P= parallel fraction 1000000 core P = 0.999999 serial fraction= 0.000001

HPC Architectures two model Hybrid, but Homogeneus, but What 100PFlops system we will see my guess IBM (hybrid) Power8+Nvidia GPU Cray (homo/hybrid) with Intel only! Intel (hybrid) Xeon + MIC Arm (homo) only arm chip, but Nvidia/Arm (hybrid) arm+nvidia Fujitsu (homo) sparc high density low power China (homo/hybrid) with Intel only Room for AMD console chips

Chip Architecture Strongly market driven Mobile, Tv set, Screens Video/Image processing Intel ARM NVIDIA Power AMD New arch to compete with ARM Less Xeon, but PHI Main focus on low power mobile chip Qualcomm, Texas inst., Nvidia, ST, ecc new HPC market, server maket GPU alone will not last long ARM+GPU, Power+GPU Embedded market Power+GPU, only chance for HPC Console market Still some chance for HPC