Supercomputing 2004 - Status und Trends (Conference Report) Peter Wegner



Similar documents
BSC - Barcelona Supercomputer Center

Parallel Programming Survey

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Current Trend of Supercomputer Architecture


High Performance Computing in the Multi-core Area

A Scalable Ethernet Clos-Switch

Trends in High-Performance Computing for Power Grid Applications

Building a Top500-class Supercomputing Cluster at LNS-BUAP

Access to the Federal High-Performance Computing-Centers

COMPLETE SOLUTIONS FOR HIGH-PERFORMANCE COMPUTING IN INDUSTRY

How To Write A Parallel Computer Program

Lattice QCD Performance. on Multi core Linux Servers

Building Clusters for Gromacs and other HPC applications

ANALYSIS OF SUPERCOMPUTER DESIGN

LS DYNA Performance Benchmarks and Profiling. January 2009

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales

Cluster Computing at HRI

Altix Usage and Application Programming. Welcome and Introduction

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

A Flexible Cluster Infrastructure for Systems Research and Software Development

RLX Technologies Server Blades

Introduction to High Performance Cluster Computing. Cluster Training for UCL Part 1

How To Build A Cloud Server For A Large Company

Cray XT3 Supercomputer Scalable by Design CRAY XT3 DATASHEET

Enabling Technologies for Distributed Computing

Lecture 1: the anatomy of a supercomputer

Enabling Technologies for Distributed and Cloud Computing

High Performance Computing, an Introduction to

Performance of the JMA NWP models on the PC cluster TSUBAME.

Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit.

SRNWP Workshop. HP Solutions and Activities in Climate & Weather Research. Michael Riedmann European Performance Center

Clusters: Mainstream Technology for CAE

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

COMP/CS 605: Intro to Parallel Computing Lecture 01: Parallel Computing Overview (Part 1)

How To Compare Amazon Ec2 To A Supercomputer For Scientific Applications

Multi-Threading Performance on Commodity Multi-Core Processors

Energy efficient computing on Embedded and Mobile devices. Nikola Rajovic, Nikola Puzovic, Lluis Vilanova, Carlos Villavieja, Alex Ramirez

HPC Update: Engagement Model

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

An Overview of High- Performance Computing and Challenges for the Future

PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN

Supercomputing Resources in BSC, RES and PRACE

Mississippi State University High Performance Computing Collaboratory Brief Overview. Trey Breckenridge Director, HPC

Performance Characteristics of a Cost-Effective Medium-Sized Beowulf Cluster Supercomputer

High Performance Computing in CST STUDIO SUITE

Cray Gemini Interconnect. Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak

Getting the Performance Out Of High Performance Computing. Getting the Performance Out Of High Performance Computing

Oracle Database Scalability in VMware ESX VMware ESX 3.5

High Performance Computing Infrastructure at DESY

White Paper Solarflare High-Performance Computing (HPC) Applications

Sun in HPC. Update for IDC HPC User Forum Tucson, AZ, Sept 2008

Performance Monitoring of Parallel Scientific Applications

A New, High-Performance, Low-Power, Floating-Point Embedded Processor for Scientific Computing and DSP Applications

ECLIPSE Best Practices Performance, Productivity, Efficiency. March 2009

Linux Cluster Computing An Administrator s Perspective

PCI Express Impact on Storage Architectures and Future Data Centers. Ron Emerick, Oracle Corporation

PCIe Over Cable Provides Greater Performance for Less Cost for High Performance Computing (HPC) Clusters. from One Stop Systems (OSS)

EDUCATION. PCI Express, InfiniBand and Storage Ron Emerick, Sun Microsystems Paul Millard, Xyratex Corporation

1 Bull, 2011 Bull Extreme Computing

Logically a Linux cluster looks something like the following: Compute Nodes. user Head node. network

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

MEGWARE HPC Cluster am LRZ eine mehr als 12-jährige Zusammenarbeit. Prof. Dieter Kranzlmüller (LRZ)

OpenMP Programming on ScaleMP

PACE Predictive Analytics Center of San Diego Supercomputer Center, UCSD. Natasha Balac, Ph.D.

SOC architecture and design

PCI Express Impact on Storage Architectures. Ron Emerick, Sun Microsystems

Computational infrastructure for NGS data analysis. José Carbonell Caballero Pablo Escobar

PCI Technology Overview

Comparing the performance of the Landmark Nexus reservoir simulator on HP servers

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

Recommended hardware system configurations for ANSYS users

ALPS Supercomputing System A Scalable Supercomputer with Flexible Services

THE EXPAND PARALLEL FILE SYSTEM A FILE SYSTEM FOR CLUSTER AND GRID COMPUTING. José Daniel García Sánchez ARCOS Group University Carlos III of Madrid

Parallel Computing. Introduction

The L-CSC cluster: Optimizing power efficiency to become the greenest supercomputer in the world in the Green500 list of November 2014

CAS2K5. Jim Tuccillo

Streamline Computing: Cluster Integration and Effective Software

Kalray MPPA Massively Parallel Processing Array

64-Bit versus 32-Bit CPUs in Scientific Computing

Chapter 4 System Unit Components. Discovering Computers Your Interactive Guide to the Digital World

Improved LS-DYNA Performance on Sun Servers

Performance, Reliability, and Operational Issues for High Performance NAS Storage on Cray Platforms. Cray User Group Meeting June 2007

Evoluzione dell Infrastruttura di Calcolo e Data Analytics per la ricerca

OpenPower: IBM s Strategy for Best of Breed 64-bit Linux

HP ProLiant BL460c achieves #1 performance spot on Siebel CRM Release 8.0 Benchmark Industry Applications running Microsoft, Oracle

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

Transcription:

(Conference Report) Peter Wegner SC2004 conference Top500 List BG/L Moors Law, problems of recent architectures Solutions Interconnects Software Lattice QCD machines DESY @SC2004 QCDOC Conclusions

Technical Program SC2004 Exhibition Research : ~150 Industry : ~200

Supercomputing = (Processor Performance) + Network Speed + Storage Capacity SCinet Aggregate WAN transport delivered to the Industry and Research Exhibitors is expected to exceed 150 Gbps StorCloud 1 PetaByte of randomly accessible storage 1 GigaByte per second backup bandwidth

Top 500 List (Linpack Benchmark, Jack Dongarra, Erich Strohmaier, Hans W. Meuer ) Rank Computer/Processors/Manufactor R max R peak Installation Site Country/Year Inst. type Installation Area 1 BlueGene/L beta-system BlueGene/L DD2 beta-system (0.7 GHz PowerPC 440) / 32768 IBM 70720 91750 United States/2004 Research 2 Columbia SGI Altix 1.5 GHz, Voltaire Infiniband / 10160 SGI 51870 60960 NASA/Ames Research Center/NAS United States/2004 Research 3 Earth-Simulator / 5120 NEC 35860 40960 The Earth Simulator Center Japan/2002 Research 4 MareNostrum eserver BladeCenter JS20 (PowerPC970 2.2 GHz), Myrinet / 3564 IBM 20530 31363 Barcelona Supercomputer Center Spain/2004 Research 5 Thunder Intel Itanium2 Tiger4 1.4GHz - Quadrics / 4096 California Digital Corporation http://www.top500.org 19940 22938 Lawrence Livermore National Laboratory United States/2004 Research

IBM CMOS CU-11 0.13µm technology.

HITACHI: SR11000 model H1, 16 POWER5 processors (1.9GHz) per node (http://www.hitachi.co.jp/prod/comp/hpc/sr_e/11ktop_e.html ) NEC: SX-7 Vector Computer (http://www.nec.co.jp/press/en/0210/0901-01.html) Cray : X1 Vector Computer (http://www.cray.com/products/x1/index.html)

Moors Law Gordon Moore from INTEL predicted at the beginning of IC mass production in the 1970 that within every period of 18 month the complexity (number of transistors on one chip) will be doubled. No prediction about power consumption, about performance, about efficiency Problems: memory bandwidth, performance loss due to the increase in pipeline size, cooling Solutions: multi-core CPUs, multithreaded architectures (or both), software prefetch (HITACHI),

CPU efficiency at increasing clock speed PIII cpu efficiency (performance / MHz) P4 clock speed Marc Tremblay, SUN

Multi-core multithreaded Architectures Marc Tremblay, SUN

Reconfigurable co-processors Cray XD1: Compute System 12 AMD Opteron 32/64 bit, x86 processors per shelf Packaged as six 2-way SMP s Linux Very densely packaged RapidArray Interconnect System 12 communications processors 1 terabit per second switch fabric 30X Gigabit ethernet performance Active Management System Dedicated processor High availability Single system management and control Application Acceleration System 6 FPGA co-processors

Reconfigurable co-processors Cray XD1:

Reconfigurable co-processors Clearspeed: Linux Networks?

X86 architectures, bottlenecks

X86 architectures, dual chipset XEON Motherboards Akira Ukawa, University of Tsukuba German-Japanese Workshop 27 November, 2004 A Similar Architecture using Infiniband interconnect was presented on SC2004 by James Ballew from Raytheon

Interconnects Infiniband absolute winner, but in the a good chance future for low latency 10GBit Ethernet (10Gig will get to copper in a year or two ) PathScale HTX-Adaptor - Hypertransport connector reduced infiniband protocol, accepted by Infiniband switches, 1.5 µs MPI latency, FPGA ready, ASIC Q2/2005 Myrinet, QsNET, Dolphin Interconnect (SCALI)

Software PathScale EKOPATH Compiler Suite for AMD64 and EM64T PGI Compiler Intel Compiler Absoft ----------------------------------------------------------------------------------------------- TotalView (Etnus) Allinea DDT (Distributed Debugging Tool) ----------------------------------------------------------------------------------------------- PBS Pro SUN Gridengine (SUN, RAYTHEON) LSF

NIC FZJ, DESY NIC/FZJ JUMP Supercomputer (IBM Regatta, 8.9 TFlops, Rank 30 of Top500 list) Graphical User Interface for Job Monitoring NIC/DESY APE massive parallel computers (550 GFlops/3 TFlops)

Lattice QCD, NIC/DESY Zeuthen poster : APEmille

NIC/DESY FZJ

Lattice QCD QCDOC Columbia University RIKEN/Brookhaven National Laboratory (BNL) UKQCD, IBM

Columbia University RIKEN/Brookhaven National Laboratory (BNL) UKQCD, IBM MIMD machine with distributed memory system-on-a-chip design (QCDOC = QCD on a chip)technology: IBM SA27E = CMOS 7SF = 0.18 mm lithography process ASIC combines existing IBM components and QCD-specific, customdesigned logic: 500 MHz PowerPC 440 processor core with 64-bit, 1 GFlops FPU 4 MB on-chip memory (embedded DRAM) Nearest-neighbor serial communications unit (SCU) 6-dimensional communication network (4-D Physics, 2-D partitioning) Silicon chip of about 11 mm square, consuming 2 W

Conclusions A wide range of supercomputing architectures - high end supercomputers Hitachi SR1100, Cray X1, NEC SX7/8 -commodity clusters as HPC solution Linux Networks, IBM, DELL, SGI, SUN, HP, Rack Server - more than 20 systems with more than 1500 dual CPU nodes in Top500 list - Blade servers - SMP systems IBM (Power5), HP(Itanium), SUN(Sparc, Opteron), SGI (Itanium) - special designs with integrated commodity CPUs Cray XD1, Linux Networks? -special purpose systems BG/L, QCDOC, apenext, GRAPE6 (Japan) No significant systems performance increase using classical technologies, therefore Direct CPU-Memory connection (AMD) Multi-core CPUs (IBM, Intel, AMD) Multi-core multithreaded CPUs (SUN, MTA) Increasing fraction of Opteron HPC systems Infiniband as a special High Performance Link beats products like Myrinet and QsNet, in future Gbit10 Ethernet? All commercial compiler vendors are offering optimized compilers for 64-Bit x86 systems (Opteron, EM64T)