Highly Productive HPC on Modern Vector Supercomputers: Current and Future
|
|
|
- Alban Hudson
- 9 years ago
- Views:
Transcription
1 Highly Productive HPC on Modern Vector Supercomputers: Current and Future Hiroaki Kobayashi Director and Professor Cyberscience Center Tohoku University Moscow, Russia
2 About Tohoku University Founded in 1907 as the 3rd Imperial University in Sendai city Japan 3116 faculty members, 18,000 grads and undergrads. 2
3 Missions of Cyberscience Center As a National Supercomputer Center High-Performance Computing Center founded in 1969 Offering leading-edge high-performance computing environments to academic users nationwide in Japan 24/7 operations of large-scale vector-parallel and scalarparallel systems 1500 users registered in AY 2014 User supports Benchmarking, analyzing, and tuning users programs Holding seminars and lectures Supercomputing R&D, collaborating work with NEC Designing next-generation high-performance computing systems and their applications for highly-productive supercomputing 57-year history of collaboration between Tohoku University and NEC on High Performance Vector Computing Education Teaching and supervising BS, MS and Ph.D. Students as a cooperative laboratory of graduate school of information sciences, Tohoku university SX-1 in 1985 SX-2 in 1989 SX-3 in 1994 SX-4 in 1998 SX-7 in 2003 SX-9 in 2008
4
5 New HPC Building Construction and System Installation (2014.7~2015.2) Hiroaki Kobayashi, Tohoku University 5
6 Cooling Facility of HPC Building Evaporation RF R-1-1 R-1-2 R-1-3 R-1-4 Fresh Air 2F Coolant PC-1-1 PC-1-2 PC-1-3 PC-1-4 AHU-2-2 AHU-2-1 AHU-2-3 EV Hot Air 1F AHU-1-1 AHU-1-3 AHU-1-5 AHU-1-7 AHU-1-9 Cold Air Capping area Cluster 5 Cluster 4 Cluster 3 Cluster 2 Cluster 1 EV of cold-aisle HE-1-3 HE-1-2 HE-1-1 HE-1-4 HE-1-5 AHU-1-2 AHU-1-4 AHU-1-6 AHU-1-8 AHU
7 Organization of Tohoku Univ. SX-ACE System IXS IXS Cluster 0 Cluster 4 Core CPU(Socket) Node Cluster Total System Size 1 4 Cores 1CPU 512 Nodes 5 Clusters Performance (VPU+SPU) 69GFlop/s (68GF+1GF) 276GFlop/s (272GF+4GF) 141Tflop/s (139TF+ 2TF) 707Tflop/s (697TF+10TF) Mem. BW 256GB/s 131TB/s 655TB/s Memory Cap. 64GB 32TB 160TB IXS Node BW - 4GB/s x2-7
8 Features of the SX-ACE Vector Processor 4 Core Configuration, each with High-Performance Vector-Processing Unit and Scalar Processing Unit 272Gflop/s of VPU + 4Gflop/s of SPU per socket 68Gflop/s + 1Gflop/s per core 1MB private ADB per core (4MB per socket) Software-controlled on-chip memory for vector load/store 4x compared with SX-9 4-way set-associative MSHR with 512 entries (address+data) 256GB/s to/from Vec. Reg. 4B/F for Multiply-Add operations 256 GB/s memory bandwidth, Shared with 4 cores 1B/F in 4-core Multiply-Add operations ~ 4B/F in 1-core Multiply-Add operations 128 memory banks per socket Other improvement and new mechanisms to enhance vector processing capability, especially for efficient handling of short vectors operations and indirect memory accesses Out of Order execution for vector load/store operations Advanced data forwarding in vector pipes chaining Shorter memory latency than SX-9 SX-ACE Processor Architecture ADB 1MB 21st Russian WSSP Supercomputing Days 8 September February 18-19, 28-29, 2015 MC SPU 256GB/s MC MC MSHR MC VPU MC Core 0 MC 256GB/s MC Core 1 Crossbar MC MC MC Core 2 MC MC Core 3 MC MC MC RCU MC 4GB/s 4GB/s Source: NEC 256GB/s 256GB/s
9 Features of Tohoku Univ. SX-ACE System SX-9 (2008) SX-ACE (2014) Improvement Number of Cores 1 4 4x CPU Performance Total Performance, Footprint, Power Consumption Total Flop/s 118.4Gflop/s 276Gflop/s 2.3x Memory Bandwidth 256GB/sec 256GB/sec 1 ADB Capacity 256KB 4MB 16x Total Flop/s 34.1Tfop/s 706.6Tflop/s 20.7x Total Memory Bandwidth 73.7TB/s 655TB/s 8.9x Total Memory Capacity 18TB 160TB 8.9x Power Consumption 590kVA 1,080kVA 1.8x Footprint 293m 2 430m 2 1.5x SX-ACE(2014) K(2011) Ratio CPU (Node) Performance Clock Frequency 1GHz 2GHz 0.5x Flop/s per Core 64Gflop/s 16Gflop/s 4x Cores per CPU x Flop/s per CPU 256Gflop/s 128Gflop/s 2x Bandwidth 256GB/s 64GB/s 4x Bytes per Flop (B/F) x Memory Capacity 64GB 16GB 4x 9
10 Floating-Point Operations Serviced( ) 10,000,000 1,000, ,000 10,000 1, SX-2R SX-3N SX-4 SX-7 SX-7&SX-7C SX-9&SX-7C SX-9 SX-9&SX-ACE High Demands for Vector Systems in Memory-Intensive, Science and Engineering Applications Upper Limit of 24/7 service Upper Limit of 24/7 service Advanced CFD model for digital flight Turbine Design Industry Academia (Outside Tohoku Univ) Atmosphere Analysis& Whether Forecasting Combustion Academia (Inside Tohoku Univ) Earthquake analysis Electromagnetic Field Analysis Catalyzer Design Magnetic recoding device Design Year 10
11 Expanding Industrial Use TV program Close-up GENDAI by NHK ( ) Advanced Perpendicular Magnetic Recording Hard Drive Highly Efficient Turbines for Power Plants Exhaust Gas Catalyst Industry Use Base Material for PCBs Regional Jet 11
12 Performance Evaluations of SX-ACE 12
13 Specifications of Evaluated Systems System No. of Sockets/ Node Perf./ Socket (Gflop/s) No. of Cores Perf. /core (Gflop/s) Mem. BW GB/sec On-chip mem NW BW (GB/sec) Sys. B/F SX-ACE MB ADB /core 2 x 4 IXS 1.0 SX KB ADB/core 2 x 128 IXS 2.5 ES KB ADB/core 2 x 64IXS 2.5 LX 406 (Ivy Bridge) KB L2/core 30MB Shared L3 5 IB 0.26 FX10 (SPARK64IX) K (SPARK64VIII) MB shared L MB Shared L Tofu NW 5-50 Tofu NW SR16K M1 (Power7) KB L2/core 32MB shared L3 2 x custom NW 0.52 Remarks: Listed performances are obtained based on total Multiply-Add performances of individual systems 13
14 Sustained Memory Bandwidth STREAM (TRIAD) Stream memory bandwidth (GB/sec.) SX-ACE SR16K M1(Power7) LX406(Ivy Bridge) FX10 (SPARC64IX) Number of threads (activated cores) 14
15 Sustained Single CPU Performance 60 Sustained Performance [Gflop/s] 47% 45.9% 41.1% 35.6% 17.5% 6.7% 9% 3.5% 30.1% SX-9 LX406(Ivy Bridge) 27.4% 7.7% 5.5% 39.4% 32.2% 7.9% 2.8% 29.4% SX-ACE SR16K M1(Power7) 17.5% 10.4% 9.8% 7.0% 5.9% 3% 3.4% 2.5% 12.3% % 3.5% Efficiency [%] 0 QSFDM Barotropic MHD (FDM) Seism3D MHD (Spectral) TURBINE BCM Code B/F Memory Intensity
16 Sustained Performance of Barotropic Ocean Model on Multi-Node Systems 1,200 ES2 SX-ACE LX406(Ivy Bridge) SR16K M1(Power7) FX10 (SPARC64IX) Perforrmance (Gflop/s) SMP-MPI Hybrid Number of MPI processes (Nodes) 16
17 Performance Evaluation of SX-ACE by using HPCG HPCG (High Performance Conjugate Gradients) is designed to exercise computational and data access patterns that more closely match a broad set of important applications, and to give incentive to computer system designers to invest in capabilities that will have impact on the collective performance of these applications. HPL for top500 is increasingly unreliable as a true measure of system performance for a growing collection of important science and engineering applications. HPCG is a complete, stand-alone code that measures the performance of basic operations in a unified code: Sparse matrix-vector multiplication. Sparse triangular solve. Vector updates. Global dot products. Local symmetric Gauss-Seidel smoother. Driven by multigrid preconditioned conjugate gradient algorithm that exercises the key kernels on a nested set of coarse grids. Reference implementation is written in C++ with MPI and OpenMP support. 17
18 Efficiency Evaluation of HPCG Performance Efficiency of Sustained Performance to Peak[%] CINECA Cambridge NSC/Linkoping E. and P. Eni S.p.A NSCC/Guangihou TACC Oak Ridge SURF sara Intel TSUBAME 2.5 Stanford KAUST CSCS Argonne Lab TSUBAMEーKFC Bull Angers SURF sara Meteo France Meteo France GENCI-CINES Leibniz NASA U. of Toulouse NIFS IFERC, EU(F4E) CEA/TGCC-GENCI L Barkley EPSRC U. of Tokyo K Computer Max-Planck CEIST / JAMSTEC Osaka U Tohoku U HLRS SX-ACE Xeon K/FX GPU Xeon Phi Bluegene Power-Efficiency [Mflops/W] U. of Toulouse IFERC, EU(F4E) L Barkley TACC NSC/Linkoping CEA/TGCC-GENCI Leibniz EPSRC Meteo France CINECA Meteo France Intel NSCC/Guangihou K Computer U. of Tokyo NASA Oak Ridge KAUST SURF sara Argonne Lab HLRS E. and P. Eni S.p.A NIFS Bull Angers GENCI-CINES Max-Planck TSUBAME 2.5 CSCS SURF sara Cambridge CEIST / JAMSTEC Osaka U TSUBAMEーKFC Tohoku U Stanford
19 Efficiency Evaluation of HPCG Performance Necessary Peak Perf. to obtain the Same Sus. Perfo. (Normalized by SX-ACE) CINECA Cambridge NSC/Linkoping NSCC/Guangihou E. and P. Eni S.p.A TACC Oak Ridge SURF sara Intel TSUBAME 2.5 Stanford KAUST CSCS Argonne Lab TSUBAMEーKFC HLRS Bull Angers SURF sara Meteo France Meteo France GENCI-CINES Leibniz NASA U. of Toulouse NIFS IFERC, EU(F4E) L Barkley CEA/TGCC-GENCI EPSRC U. of Tokyo K Computer Max-Planck CEIST / JAMSTEC Osaka U Tohoku U SX-ACE Xeon K/FX GPU Xeon Phi Bluegene Necessary # of Cores to obtain the Same Sus. Perf. (Normalized by SX-ACE) CINECA TACC NSCC/Guangihou Argonne Lab Intel Meteo France NSC/Linkoping Meteo France Leibniz Oak Ridge KAUST U. of Tokyo L Barkley U. of Toulouse IFERC, EU(F4E) K Computer CEA/TGCC-GENCI EPSRC HLRS NASA Bull Angers Cambridge SURF sara E. and P. Eni S.p.A SURF sara NIFS GENCI-CINES Max-Planck TSUBAME 2.5 CSCS TSUBAMEーKFC Stanford CEIST / JAMSTEC Osaka U Tohoku U
20 A Real-Time Tsunami Inundation Forecasting System on SX-ACE for Tsunami Disaster Prevention and Mitigation 20
21 Hiroaki Kobayashi, Tohoku University Motivation: Serious Damage to Sendai Area Due to 2011 Tsunami Inundation Huge Tsunami hits coastal area of Tohoku region Inundation Area 21 Sendai Airport Courtesy of Prof. Koshimura Tohoku 東 北 大 越 University 村 教 授 提 供
22 It s not End: High Probability of Big Earthquakes in Japan Japan may be hit by severe earthquakes and large tsunamis in the next 30 years Seismogenic Zones 70 % probability in the next 30 years Ishinomaki The 2011 Great Tohoku Earthquake 東 北 大 越 村 教 授 提 供 Sendai Tokyo 88 % probability in the next 30 years Kochi Shizuoka 22
23 Design and Development of A Real-Time Tsunami Inundation Forecasting System GPS-Observation Simulation on SX-ACE Information Delivery Fault estimation based on GPS data Inundation ( ) simulation 10-m mesh models of coastal cities Inundation depth etc Just-In-Time access of Visualized information by local governments < 8 min < 8 min < 4 min < 20 min 23
24 Demo: Visualization of Simulation Results 24
25 Demo: Visualization of Simulation Results Miyagi 25
26 1,000,000 Scalability of Tunami Code K computer (SPARC64VIII) LX 406(Ivy Bridge) SX-ACE Execution Time(Sec) 100,000 10,000 1, Target of 8-min K 2K 4K 8K 13k Number of Cores (Processes) 26
27 Future Vector Systems R&D* *This work is partially conducted with NEC, but the contents do not reflect any future products of NEC
28 Big Problem in HPC System Design Brain Infarction of Modern HEC Systems Imbalance between peak flop/s rate and memory bandwidth of HEC systems result in an inefficiency in computing only a small portion of peak flop/s contributes to execution of practical applications in many important science and engineering fields. A large amount of sleeping flop/s power is wasted away!!! So fa, it would be OK because Moore s low works, but now it does not make sense, as the end of Moore s law is approaching! So, we have to become much more smart for design of Future HEC systems, Because it is very hard to obtain cost reduction by Moore s law, in particular, a sky-rocketing cost increase in semiconductor integration fabrication in the eras of 20nm or finer technologies The silicon budget for computing is also not free any more! Exploit sleeping flop/s efficiently by redesign/reinvent of memory subsystems to protect HEC systems from Serious Brain Infarction Use precious silicon budget (+ advanced device technologies) to effectively design mechanisms that can supply data to computing units smoothly. 28
29 Toward the Next-Generation Vector System Design Much more focusing on sustained performance by increasing processing efficiency, not heavily depending on peak performance for the design of the next generation vector systems Architecture design of high-performance vector cores connected to an advanced memory subsystem at a high sustained bandwidth find a good balance between computing performance and memory performance of future HEC systems to maximize computing efficiency of wide variety of science and engineering applications. achieve a certain level of sustained performance with a moderate number of high-performance cores/nodes with a lower power budget. shorter diameter, lower latency, high-bw networks also become adoptable. Intra-socket Network (elec. / opt.) Vector-core layers OnChip-Memory layers I/O layers 3D Vector Chip 5.5D 29
30 Feasibility Study of the Next-Generation Vector System Targeting Year around 2018+? System Spec Node0 4TFlop/s Node xxxx CPU0 1 CPU0 TFlop/s CPU TF = 256GFlop/s x 4cores core core core core core core core core Hierarchical Network core core core core Tread/proc 4 PeakPF/proc 1TF PeakPF/100,000proc 100PF Nodes /100,000proc 25,000 Total MemBW 200PB/s Total Cache Capacity 3.2TB ~1TB/s (4B/F) VLSB VLSB 32MB VLSB ~2TB/s(2B/F) 2.5D/3D Die-Stacking Shared memory Shared ~256GB memory 2.5D/3D Die-Stacking Shared Shared memory memory Hierarchical Distributed Storage System 30
31 Feasibility Study of the Next-Generation Vector System Targeting Year around 2018+? Vector 30cm 4TF 512GB~1TB 4TB/s~8TB/s 800~1000W 1U 40cm Cooling unit 30cm Vector X64 Vector Vector Vector NW-switch NW-switch Vector Vector Vector Vector PSU 60cm 200cm ~100cm 256TF(64nodes) 32TB~64TB 256TB/s~512TB/s 52~64KW 40m 100PF (~1EF) 400~ racks 12PB~25PB 100PB/s~200PB/s 20~30MW 40m 31
32 Performance Comparison of Our Target System with a Commodity-Based Future System Socket: 1TF 10GB/s x2 Fat-Tree 40GB/s x2 High Dimensional Torus 20GB/s Socket: 1.6TF core : 90GF 2~4 B/F 1 B/F 0.05TB/s 1~2TB/s 0.1TB/s 128GB x 2 (NUMA) Our System SMP(UMA) Architecture 4TF, ~8TB/s(~2B/F), ~512GB, 1~1.6KW Commodity-based System NUMA Architecture 3.3TF, 0.2TB/s(0.1B/F), 128GBx2, 0.4KW 32
33 Performance Estimation Performance normalized by Xeon-based System Our System Future Xeon-Based Architecture Speedup Speedup/Watt Our System Xeon-based Ratio Tread/proc 4 4 PeakPF/proc 1TF 0.36TF 2.79 PeakPF/100,000proc 100PF 35.84PF 2.79 Nodes /100,000proc 25,000 11, Total MemBW 200PB/s 3.4PB/s 58.6 Total Cache Capacity 3.2TB 0.8TB x~8x in Performance per watt in Seism3D and RSGDX because of their high B/F requirement Comparable performance per watt in Turbine
34 Performance Estimation (2/2) Execution Time(Sec) Xeon-based System: Our System Future Commodity-Based Commodity-based: Scalability Stalled # of Procs. Xeon-based System needs 6.4M processes, which needs a peak of 3.2EF to achieve the equivalent sustained performance of our 100PF system 34
35 Summary SX-ACE shows high sustained performance compared with SX-9, in particular short-vector processing and indirect memory accesses Well balanced HEC systems regarding memory performance is the key to success for realizing high productivity in science and engineering simulations in the post peta-scale era We explore the great potential of the new generation vector architecture for future HPC systems, with new device technologies such as 2.5D/3D die-stacking High sustained memory BW to fuel vector function units with lower power/energy. The on-chip vector load/store unit can boost the sustained memory bandwidth energy-efficiently When such new technologies will be available as production services? Design tools, fab. and markets steer the future of the technologies! 35
36 Final Remarks Now It s Time to Think Different! ~Make HPC Systems Much More Comfortable and Friendly ~ Targeting HPC systems design for the entry and middle-class of HPC community in daily use, not for top, flop/s-oriented, in special use! Spending much more efforts/resources to exploit the potential of the system even with the moderate number of nodes and/or cores for daily use that requires high-productivity in simulation. Even though this approach sacrifices exa-flop/s level peak performance! Seeking Exa-flops with accelerators and its downsizing deployment to entry and middle classes are NOT a smart solution in the post-peta scale era. Let s make The Supercomputer for the Rest of US happen! 36
Evoluzione dell Infrastruttura di Calcolo e Data Analytics per la ricerca
Evoluzione dell Infrastruttura di Calcolo e Data Analytics per la ricerca Carlo Cavazzoni CINECA Supercomputing Application & Innovation www.cineca.it 21 Aprile 2015 FERMI Name: Fermi Architecture: BlueGene/Q
Performance of the JMA NWP models on the PC cluster TSUBAME.
Performance of the JMA NWP models on the PC cluster TSUBAME. K.Takenouchi 1), S.Yokoi 1), T.Hara 1) *, T.Aoki 2), C.Muroi 1), K.Aranami 1), K.Iwamura 1), Y.Aikawa 1) 1) Japan Meteorological Agency (JMA)
Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing
Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing Innovation Intelligence Devin Jensen August 2012 Altair Knows HPC Altair is the only company that: makes HPC tools
Kriterien für ein PetaFlop System
Kriterien für ein PetaFlop System Rainer Keller, HLRS :: :: :: Context: Organizational HLRS is one of the three national supercomputing centers in Germany. The national supercomputing centers are working
Trends in High-Performance Computing for Power Grid Applications
Trends in High-Performance Computing for Power Grid Applications Franz Franchetti ECE, Carnegie Mellon University www.spiral.net Co-Founder, SpiralGen www.spiralgen.com This talk presents my personal views
The Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems
202 IEEE 202 26th IEEE International 26th International Parallel Parallel and Distributed and Distributed Processing Processing Symposium Symposium Workshops Workshops & PhD Forum The Green Index: A Metric
GPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
Performance Characteristics of Large SMP Machines
Performance Characteristics of Large SMP Machines Dirk Schmidl, Dieter an Mey, Matthias S. Müller [email protected] Rechen- und Kommunikationszentrum (RZ) Agenda Investigated Hardware Kernel Benchmark
Assessing the Performance of OpenMP Programs on the Intel Xeon Phi
Assessing the Performance of OpenMP Programs on the Intel Xeon Phi Dirk Schmidl, Tim Cramer, Sandra Wienke, Christian Terboven, and Matthias S. Müller [email protected] Rechen- und Kommunikationszentrum
and RISC Optimization Techniques for the Hitachi SR8000 Architecture
1 KONWIHR Project: Centre of Excellence for High Performance Computing Pseudo-Vectorization and RISC Optimization Techniques for the Hitachi SR8000 Architecture F. Deserno, G. Hager, F. Brechtefeld, G.
Parallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
Exascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation
Exascale Challenges and General Purpose Processors Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation Jun-93 Aug-94 Oct-95 Dec-96 Feb-98 Apr-99 Jun-00 Aug-01 Oct-02 Dec-03
Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC
HPC Architecture End to End Alexandre Chauvin Agenda HPC Software Stack Visualization National Scientific Center 2 Agenda HPC Software Stack Alexandre Chauvin Typical HPC Software Stack Externes LAN Typical
High Performance Computing Infrastructure in JAPAN
High Performance Computing Infrastructure in JAPAN Yutaka Ishikawa Director of Information Technology Center Professor of Computer Science Department University of Tokyo Leader of System Software Research
Building a Top500-class Supercomputing Cluster at LNS-BUAP
Building a Top500-class Supercomputing Cluster at LNS-BUAP Dr. José Luis Ricardo Chávez Dr. Humberto Salazar Ibargüen Dr. Enrique Varela Carlos Laboratorio Nacional de Supercómputo Benemérita Universidad
Hadoop on the Gordon Data Intensive Cluster
Hadoop on the Gordon Data Intensive Cluster Amit Majumdar, Scientific Computing Applications Mahidhar Tatineni, HPC User Services San Diego Supercomputer Center University of California San Diego Dec 18,
TSUBAME-KFC : a Modern Liquid Submersion Cooling Prototype Towards Exascale
TSUBAME-KFC : a Modern Liquid Submersion Cooling Prototype Towards Exascale Toshio Endo,Akira Nukada, Satoshi Matsuoka GSIC, Tokyo Institute of Technology ( 東 京 工 業 大 学 ) Performance/Watt is the Issue
Current Status of FEFS for the K computer
Current Status of FEFS for the K computer Shinji Sumimoto Fujitsu Limited Apr.24 2012 LUG2012@Austin Outline RIKEN and Fujitsu are jointly developing the K computer * Development continues with system
The K computer: Project overview
The Next-Generation Supercomputer The K computer: Project overview SHOJI, Fumiyoshi Next-Generation Supercomputer R&D Center, RIKEN The K computer Outline Project Overview System Configuration of the K
Accelerating CFD using OpenFOAM with GPUs
Accelerating CFD using OpenFOAM with GPUs Authors: Saeed Iqbal and Kevin Tubbs The OpenFOAM CFD Toolbox is a free, open source CFD software package produced by OpenCFD Ltd. Its user base represents a wide
HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK
HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK Steve Oberlin CTO, Accelerated Computing US to Build Two Flagship Supercomputers SUMMIT SIERRA Partnership for Science 100-300 PFLOPS Peak Performance
SGI High Performance Computing
SGI High Performance Computing Accelerate time to discovery, innovation, and profitability 2014 SGI SGI Company Proprietary 1 Typical Use Cases for SGI HPC Products Large scale-out, distributed memory
Keys to node-level performance analysis and threading in HPC applications
Keys to node-level performance analysis and threading in HPC applications Thomas GUILLET (Intel; Exascale Computing Research) IFERC seminar, 18 March 2015 Legal Disclaimer & Optimization Notice INFORMATION
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
Cloud Data Center Acceleration 2015
Cloud Data Center Acceleration 2015 Agenda! Computer & Storage Trends! Server and Storage System - Memory and Homogenous Architecture - Direct Attachment! Memory Trends! Acceleration Introduction! FPGA
Intel Cluster Ready Appro Xtreme-X Computers with Mellanox QDR Infiniband
Intel Cluster Ready Appro Xtreme-X Computers with Mellanox QDR Infiniband A P P R O I N T E R N A T I O N A L I N C Steve Lyness Vice President, HPC Solutions Engineering [email protected] Company Overview
OpenMP and Performance
Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group [email protected] IT Center der RWTH Aachen University Tuning Cycle Performance Tuning aims to improve the runtime of an
Jean-Pierre Panziera Teratec 2011
Technologies for the future HPC systems Jean-Pierre Panziera Teratec 2011 3 petaflop systems : TERA 100, CURIE & IFERC Tera100 Curie IFERC 1.25 PetaFlops 256 TB ory 30 PB disk storage 140 000+ Xeon cores
Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms
Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms Amani AlOnazi, David E. Keyes, Alexey Lastovetsky, Vladimir Rychkov Extreme Computing Research Center,
David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems
David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems About me David Rioja Redondo Telecommunication Engineer - Universidad de Alcalá >2 years building and managing clusters UPM
Application and Micro-benchmark Performance using MVAPICH2-X on SDSC Gordon Cluster
Application and Micro-benchmark Performance using MVAPICH2-X on SDSC Gordon Cluster Mahidhar Tatineni ([email protected]) MVAPICH User Group Meeting August 27, 2014 NSF grants: OCI #0910847 Gordon: A Data
Multicore Parallel Computing with OpenMP
Multicore Parallel Computing with OpenMP Tan Chee Chiang (SVU/Academic Computing, Computer Centre) 1. OpenMP Programming The death of OpenMP was anticipated when cluster systems rapidly replaced large
PRIMERGY server-based High Performance Computing solutions
PRIMERGY server-based High Performance Computing solutions PreSales - May 2010 - HPC Revenue OS & Processor Type Increasing standardization with shift in HPC to x86 with 70% in 2008.. HPC revenue by operating
ECLIPSE Best Practices Performance, Productivity, Efficiency. March 2009
ECLIPSE Best Practices Performance, Productivity, Efficiency March 29 ECLIPSE Performance, Productivity, Efficiency The following research was performed under the HPC Advisory Council activities HPC Advisory
Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales
Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes Anthony Kenisky, VP of North America Sales About Appro Over 20 Years of Experience 1991 2000 OEM Server Manufacturer 2001-2007
Lecture 1: the anatomy of a supercomputer
Where a calculator on the ENIAC is equipped with 18,000 vacuum tubes and weighs 30 tons, computers of the future may have only 1,000 vacuum tubes and perhaps weigh 1½ tons. Popular Mechanics, March 1949
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.
Sun Constellation System: The Open Petascale Computing Architecture
CAS2K7 13 September, 2007 Sun Constellation System: The Open Petascale Computing Architecture John Fragalla Senior HPC Technical Specialist Global Systems Practice Sun Microsystems, Inc. 25 Years of Technical
Parallel Computing. Introduction
Parallel Computing Introduction Thorsten Grahs, 14. April 2014 Administration Lecturer Dr. Thorsten Grahs (that s me) [email protected] Institute of Scientific Computing Room RZ 120 Lecture Monday 11:30-13:00
The PHI solution. Fujitsu Industry Ready Intel XEON-PHI based solution. SC2013 - Denver
1 The PHI solution Fujitsu Industry Ready Intel XEON-PHI based solution SC2013 - Denver Industrial Application Challenges Most of existing scientific and technical applications Are written for legacy execution
Journée Mésochallenges 2015 SysFera and ROMEO Make Large-Scale CFD Simulations Only 3 Clicks Away
SysFera and ROMEO Make Large-Scale CFD Simulations Only 3 Clicks Away Benjamin Depardon SysFera Sydney Tekam Tech-Am ING Arnaud Renard ROMEO Manufacturing with HPC 98% of products will be developed digitally
Architecture of Hitachi SR-8000
Architecture of Hitachi SR-8000 University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Slide 1 Most of the slides from Hitachi Slide 2 the problem modern computer are data
Jezelf Groen Rekenen met Supercomputers
Jezelf Groen Rekenen met Supercomputers Symposium Groene ICT en duurzaamheid: Nieuwe energie in het hoger onderwijs Walter Lioen Groepsleider Supercomputing About SURFsara SURFsara
Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms
Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms Björn Rocker Hamburg, June 17th 2010 Engineering Mathematics and Computing Lab (EMCL) KIT University of the State
BSC - Barcelona Supercomputer Center
Objectives Research in Supercomputing and Computer Architecture Collaborate in R&D e-science projects with prestigious scientific teams Manage BSC supercomputers to accelerate relevant contributions to
Introduction to GPGPU. Tiziano Diamanti [email protected]
[email protected] Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate
COMP/CS 605: Intro to Parallel Computing Lecture 01: Parallel Computing Overview (Part 1)
COMP/CS 605: Intro to Parallel Computing Lecture 01: Parallel Computing Overview (Part 1) Mary Thomas Department of Computer Science Computational Science Research Center (CSRC) San Diego State University
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,
How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (
TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 7 th CALL (Tier-0) Contributing sites and the corresponding computer systems for this call are: GCS@Jülich, Germany IBM Blue Gene/Q GENCI@CEA, France Bull Bullx
YALES2 porting on the Xeon- Phi Early results
YALES2 porting on the Xeon- Phi Early results Othman Bouizi Ghislain Lartigue Innovation and Pathfinding Architecture Group in Europe, Exascale Lab. Paris CRIHAN - Demi-journée calcul intensif, 16 juin
Lecture 2 Parallel Programming Platforms
Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple
Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis
Performance Metrics and Scalability Analysis 1 Performance Metrics and Scalability Analysis Lecture Outline Following Topics will be discussed Requirements in performance and cost Performance metrics Work
Power-Aware High-Performance Scientific Computing
Power-Aware High-Performance Scientific Computing Padma Raghavan Scalable Computing Laboratory Department of Computer Science Engineering The Pennsylvania State University http://www.cse.psu.edu/~raghavan
Turbomachinery CFD on many-core platforms experiences and strategies
Turbomachinery CFD on many-core platforms experiences and strategies Graham Pullan Whittle Laboratory, Department of Engineering, University of Cambridge MUSAF Colloquium, CERFACS, Toulouse September 27-29
Using the Intel Xeon Phi (with the Stampede Supercomputer) ISC 13 Tutorial
Using the Intel Xeon Phi (with the Stampede Supercomputer) ISC 13 Tutorial Bill Barth, Kent Milfeld, Dan Stanzione Tommy Minyard Texas Advanced Computing Center Jim Jeffers, Intel June 2013, Leipzig, Germany
Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
OpenMP Programming on ScaleMP
OpenMP Programming on ScaleMP Dirk Schmidl [email protected] Rechen- und Kommunikationszentrum (RZ) MPI vs. OpenMP MPI distributed address space explicit message passing typically code redesign
HP ProLiant SL270s Gen8 Server. Evaluation Report
HP ProLiant SL270s Gen8 Server Evaluation Report Thomas Schoenemeyer, Hussein Harake and Daniel Peter Swiss National Supercomputing Centre (CSCS), Lugano Institute of Geophysics, ETH Zürich [email protected]
ECLIPSE Performance Benchmarks and Profiling. January 2009
ECLIPSE Performance Benchmarks and Profiling January 2009 Note The following research was performed under the HPC Advisory Council activities AMD, Dell, Mellanox, Schlumberger HPC Advisory Council Cluster
Advancing Applications Performance With InfiniBand
Advancing Applications Performance With InfiniBand Pak Lui, Application Performance Manager September 12, 2013 Mellanox Overview Ticker: MLNX Leading provider of high-throughput, low-latency server and
numascale White Paper The Numascale Solution: Extreme BIG DATA Computing Hardware Accellerated Data Intensive Computing By: Einar Rustad ABSTRACT
numascale Hardware Accellerated Data Intensive Computing White Paper The Numascale Solution: Extreme BIG DATA Computing By: Einar Rustad www.numascale.com Supemicro delivers 108 node system with Numascale
FLOW-3D Performance Benchmark and Profiling. September 2012
FLOW-3D Performance Benchmark and Profiling September 2012 Note The following research was performed under the HPC Advisory Council activities Participating vendors: FLOW-3D, Dell, Intel, Mellanox Compute
Interconnect Your Future Enabling the Best Datacenter Return on Investment. TOP500 Supercomputers, November 2015
Interconnect Your Future Enabling the Best Datacenter Return on Investment TOP500 Supercomputers, November 2015 InfiniBand FDR and EDR Continue Growth and Leadership The Most Used Interconnect On The TOP500
White Paper The Numascale Solution: Extreme BIG DATA Computing
White Paper The Numascale Solution: Extreme BIG DATA Computing By: Einar Rustad ABOUT THE AUTHOR Einar Rustad is CTO of Numascale and has a background as CPU, Computer Systems and HPC Systems De-signer
Parallel Algorithm Engineering
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis
SAS Business Analytics. Base SAS for SAS 9.2
Performance & Scalability of SAS Business Analytics on an NEC Express5800/A1080a (Intel Xeon 7500 series-based Platform) using Red Hat Enterprise Linux 5 SAS Business Analytics Base SAS for SAS 9.2 Red
Performance Evaluation of Amazon EC2 for NASA HPC Applications!
National Aeronautics and Space Administration Performance Evaluation of Amazon EC2 for NASA HPC Applications! Piyush Mehrotra!! J. Djomehri, S. Heistand, R. Hood, H. Jin, A. Lazanoff,! S. Saini, R. Biswas!
New Dimensions in Configurable Computing at runtime simultaneously allows Big Data and fine Grain HPC
New Dimensions in Configurable Computing at runtime simultaneously allows Big Data and fine Grain HPC Alan Gara Intel Fellow Exascale Chief Architect Legal Disclaimer Today s presentations contain forward-looking
Accelerating Server Storage Performance on Lenovo ThinkServer
Accelerating Server Storage Performance on Lenovo ThinkServer Lenovo Enterprise Product Group April 214 Copyright Lenovo 214 LENOVO PROVIDES THIS PUBLICATION AS IS WITHOUT WARRANTY OF ANY KIND, EITHER
Seeking Opportunities for Hardware Acceleration in Big Data Analytics
Seeking Opportunities for Hardware Acceleration in Big Data Analytics Paul Chow High-Performance Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Toronto Who
HUAWEI TECHNOLOGIES CO., LTD. HUAWEI FusionServer X6800 Data Center Server
HUAWEI TECHNOLOGIES CO., LTD. HUAWEI FusionServer X6800 Data Center Server HUAWEI FusionServer X6800 Data Center Server Data Center Cloud Internet App Big Data HPC As the IT infrastructure changes with
Barry Bolding, Ph.D. VP, Cray Product Division
Barry Bolding, Ph.D. VP, Cray Product Division 1 Corporate Overview Trends in Supercomputing Types of Supercomputing and Cray s Approach The Cloud The Exascale Challenge Conclusion 2 Slide 3 Seymour Cray
Infrastructure Matters: POWER8 vs. Xeon x86
Advisory Infrastructure Matters: POWER8 vs. Xeon x86 Executive Summary This report compares IBM s new POWER8-based scale-out Power System to Intel E5 v2 x86- based scale-out systems. A follow-on report
Pedraforca: ARM + GPU prototype
www.bsc.es Pedraforca: ARM + GPU prototype Filippo Mantovani Workshop on exascale and PRACE prototypes Barcelona, 20 May 2014 Overview Goals: Test the performance, scalability, and energy efficiency of
Parallel Computing. Benson Muite. [email protected] http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite [email protected] http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers
Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers Haohuan Fu [email protected] High Performance Geo-Computing (HPGC) Group Center for Earth System Science Tsinghua University
Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) [email protected] http://www.mzahran.com
CSCI-GA.3033-012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) [email protected] http://www.mzahran.com Modern GPU
Bytes and BTUs: Holistic Approaches to Data Center Energy Efficiency. Steve Hammond NREL
Bytes and BTUs: Holistic Approaches to Data Center Energy Efficiency NREL 1 National Renewable Energy Laboratory Presentation Road Map A Holistic Approach to Efficiency: Power, Packaging, Cooling, Integration
FPGA Acceleration using OpenCL & PCIe Accelerators MEW 25
FPGA Acceleration using OpenCL & PCIe Accelerators MEW 25 December 2014 FPGAs in the news» Catapult» Accelerate BING» 2x search acceleration:» ½ the number of servers»
Cray XT3 Supercomputer Scalable by Design CRAY XT3 DATASHEET
CRAY XT3 DATASHEET Cray XT3 Supercomputer Scalable by Design The Cray XT3 system offers a new level of scalable computing where: a single powerful computing system handles the most complex problems every
Energy efficient computing on Embedded and Mobile devices. Nikola Rajovic, Nikola Puzovic, Lluis Vilanova, Carlos Villavieja, Alex Ramirez
Energy efficient computing on Embedded and Mobile devices Nikola Rajovic, Nikola Puzovic, Lluis Vilanova, Carlos Villavieja, Alex Ramirez A brief look at the (outdated) Top500 list Most systems are built
PSAM, NEC PCIe SSD Appliance for Microsoft SQL Server (Reference Architecture) September 11 th, 2014 NEC Corporation
PSAM, NEC PCIe SSD Appliance for Microsoft SQL Server (Reference Architecture) September 11 th, 2014 NEC Corporation 1. Overview of NEC PCIe SSD Appliance for Microsoft SQL Server Page 2 NEC Corporation
Resource Scheduling Best Practice in Hybrid Clusters
Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Resource Scheduling Best Practice in Hybrid Clusters C. Cavazzoni a, A. Federico b, D. Galetti a, G. Morelli b, A. Pieretti
Performance Engineering of the Community Atmosphere Model
Performance Engineering of the Community Atmosphere Model Patrick H. Worley Oak Ridge National Laboratory Arthur A. Mirin Lawrence Livermore National Laboratory 11th Annual CCSM Workshop June 20-22, 2006
Large Scale Simulation on Clusters using COMSOL 4.2
Large Scale Simulation on Clusters using COMSOL 4.2 Darrell W. Pepper 1 Xiuling Wang 2 Steven Senator 3 Joseph Lombardo 4 David Carrington 5 with David Kan and Ed Fontes 6 1 DVP-USAFA-UNLV, 2 Purdue-Calumet,
Binary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
Intel Solid- State Drive Data Center P3700 Series NVMe Hybrid Storage Performance
Intel Solid- State Drive Data Center P3700 Series NVMe Hybrid Storage Performance Hybrid Storage Performance Gains for IOPS and Bandwidth Utilizing Colfax Servers and Enmotus FuzeDrive Software NVMe Hybrid
THE SUN STORAGE AND ARCHIVE SOLUTION FOR HPC
THE SUN STORAGE AND ARCHIVE SOLUTION FOR HPC The Right Data, in the Right Place, at the Right Time José Martins Storage Practice Sun Microsystems 1 Agenda Sun s strategy and commitment to the HPC or technical
Introduction to High Performance Cluster Computing. Cluster Training for UCL Part 1
Introduction to High Performance Cluster Computing Cluster Training for UCL Part 1 What is HPC HPC = High Performance Computing Includes Supercomputing HPCC = High Performance Cluster Computing Note: these
A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS
A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS SUDHAKARAN.G APCF, AERO, VSSC, ISRO 914712564742 [email protected] THOMAS.C.BABU APCF, AERO, VSSC, ISRO 914712565833
