Power-Aware High-Performance Scientific Computing
|
|
|
- Carmella Johnson
- 10 years ago
- Views:
Transcription
1 Power-Aware High-Performance Scientific Computing Padma Raghavan Scalable Computing Laboratory Department of Computer Science Engineering The Pennsylvania State University Supported by NSF STHEC: PxP:Co-Managing PerformancexPower
2 Trends Microprocessor Design & HPC Microprocessor design Gordon Moore, 1966: 2 X # transistors in 18 months= Focus on peak rates, LAPACK benchmarks with dense codes Patrick Gelsinger, 2004: power is the only real limiter DAC Keynote HPC and science through simulation High costs of installation, cooling Petascale system is infeasible without new low-power designs (Simon, Boku ) Gap between peak (TOP500) and sustained rates on real workloads Petascale instrument vs. desktop supercomputing CMPs/multicores and performance, power and productivity issues
3 Why Sparse Scientific Codes Sparse codes (irregular meshes, matrices, graphs), unlike tuned dense codes, do not operate at peak rates (despite tuning) Sparse codes represent scalable formulations for many applications but Limited data locality and data re-use Memory and network latency bound Load imbalances despite partitioning/re-partitioning Multiple algorithms, implementations with different quality/performance trade-offs Present many opportunities for adaptive Q(uality)xP(erformance)xP(power) tuning
4 Sparse Codes and Data Example: Sparse y= Ax Used in many PDE simulations in explicit codes, in implicit codes with linear system solution, data clustering with K-means Ordering (RCM) to get locality of access in x Data locality and data reuse for elements of x
5 This Presentation Microprocessor/network architectural optimizations X Application features PxP results for sparse scientific computing Optimizing CPU + Memory for sparse PxP PxP models for adaptive feature selection PxP trends on MPPs with CPU+Link scaling Summary and conclusions
6 PxP Results - I Characterizing power reductions and performance improvements for a single node, i.e., CPU +Memory There is locality of data access in many sparse codes when matrices are reordered, right data structures are used etc. Konrad Malkowski (lead)
7 Power-Aware+ High Performance Computing Power of CMOS chips: P = C * V dd 2 * f + V dd * I leak Typically higher performance = higher f with higher transistor counts thermal limits Tuning Power DVS: Dynamic voltage and frequency scaling for CPUs Drowsy/low-power modes of caches, DRAM memory banks ABB: Adaptive body biasing, reduces I leak If these low-power knobs are exposed in the ISA, they can be used to control power in applications If some of the power savings are directed for memory/network optimizations, we can increase performance while lowering power for PxP reductions in energy
8 Methodology Cycle accurate architectural emulations using Simplescalar, Wattch and Cacti Emulate CPU with caches + off chip DRAM memory starting with a PowerPC-like core (like a BGL processor) Emulate low power modes Model DVS by scaling frequency and supply voltage Model low power modes of caches by emulating smaller caches Emulate memory subsystem optimizations Extend Simplescalar/Wattch to add structures for optimizations to reduce memory latency
9 Base (B) Architecture Power PC-like, 1 GHz core 4 MB SRAM L3 (26 cycle latency) 2 KB SRAM L2 ( 7 cycle latency) 32 KB SRAM L1 instruction and data caches (1 cycle latency) Memory bus: 64 bits Memory size 256 MB (9 x 256Mbit x 8 pins DRAM)
10 Architectural Extensions Wider memory bus: 128 bits, original 64 (W) Memory page policy: Open or Closed (MO) Prefetcher (stride 1) in memory controller (MP) Prefetcher (stride 1) in L2 cache (LP) Load Miss Predictor in L1 cache (LMP) Prefetchers can reduce latency if there is locality of access If sparse matrix is highly irregular (inherent or from implementation) an LMP can avoid latency of cache hierarchy Developed LMP similar to a branch prediction structure
11 Memory Prefetcher (MP) Added a prefetch buffer to the memory controller 16 element table with 128 byte cache line LRU replacement
12 L2 Cache Prefetcher (LP) Benefits codes with locality of data access but poor data re-use
13 Memory Page Policy: Open / Closed (MO) Accesses to open rows have lower latency Memory control is more complex Access latencies are not as predictable
14 Load Miss Predictor
15 Experiments Base (B), Wider path (W), Memory page policy (MO), Memory prefetcher (MP), L2-prefetcher (LP), Load Miss Prediction (LMP) Base (B) at 1000 MHz Sparse codes SMV-U: no blocking, RCM ordering, 4 matrices SMV-O: Sparsity SMV, 2x2 blocking, RCM ordering, 4 matrices NAS MG Benchmark Full scale application: Driven Cavity Flow Metrics: Time, Power, Energy, Ops/J (shown relative to code at B, 1000 MHz, 4 MB L3 cache)
16 Relative Time: All features, 300 Mhz 1 GHz, 256 K L3 Values < 1 are faster than at base
17 Relative Time at 600 MHz, Smaller L3 B +W +MO +MP +LP +LMP X-axis: features added incrementally to include all Time for each code at B set to 1 Base at 3 Over 40% performance improvements Without optimizations 40 % performance degradation
18 Relative Power at 600 MHz, Smaller L3 +W +MO +MP +LP +LMP X-axis: features added incrementally to include all Power for each code at B set to 1 Base at 3 Over 66% power saved from DVS (600 Mhz), smallest cache with no performance penalty
19 Relative Energy at 600 MHz, Smaller L3 X-axis: features added incrementally to include all Energy for each code at B set to 1 Base at 3 Over 80% improvements with all features Without optimizations 40 % savings but with performance penalty
20 Ops/J at 600 MHz, Smaller L3 X-axis: features added incrementally to include all Ops/J for each code at B set to 1 Base at 3 Factor 5 improvement in energy efficiency
21 PxP Results - II PxP for a `real driven cavity flow application with typical complex code/algorithm features Sayaka Akioka (lead)
22 Driven Cavity :Relative Time, Energy Time +MP +LP Energy +w +MO +LMP Al l Al l With all features, code is faster by 20% even at 400MHz, with 60% less power, energy
23 PxP Results - III Models to select optimal sets of features subject to performance/power constraints Detecting phases in application Adaptively selecting feature set for each application phase: Reduce power subject to performance constraint Reduce time subject to power constraint Konrad Malkowski (lead)
24 Optimal Feature Sets Least squares fit to derive models of power or time (F feature set combination) per code T a N i i F i i Errors of less than 5% Define workload, select optimal configuration with power constraints, Example: Best time 2-feature set, even workload, < 50% base power At 600 MHz :W+ LP; At 800 MHz: MO +MP
25 S/W Phases & Their H/W Detection Different S/W phases can benefit from different H/W features Challenges: How do known s/w phases correspond to h/w detectable phases? What H/W metric can be used to detect phase change? (lightweight)
26 NAS MG: LSQ and 10M cycle window
27 NAS MG: LSQ and 100K cycle window
28 MG: Min P, T constraint Phase Time Freq. L3 size Page LP MP LMP T P Constraint (MHz) policy Restriction MB MO Interp MB MO - p Interp MB MO p p Remainder MB MO p Restriction MB MO p p p Interp MB MO p Interp MB MC p Remainder MB MC p Restriction MB MO - p p I MB MO p I MB MO - p
29 All Vs Adaptive (Using LSQ) Min Power, T constraint Min Time, P constraint All features on
30 PxP Results: MPPs+ MPI codes Utilizing load imbalance in tree-structured parallel sparse computations for energy savings Apps run for days/weeks % of ideal load/processors ~ hours/days Mahmut Kandemir, F. Li, G. Chen
31 Tree-Based Parallel Sparse Computation Tree node =dense/ sparse data-parallel operations Tree structure dictates data-dependencies A node depends only on subtree rooted at the node Computation in disjoint subtrees can proceed independently Imbalance (despite best data-mapping) can be 10% of ideal load/processor Exploit task-parallelism at lower levels and dataparallelism at higher levels Represents Barnes-Hut, FMM N-body tree-codes, sparse solvers,..
32 Example Participating Processors 0,1,2,3 N 0 70/35 [0,6] N 1 50/25 [0,3] N 2 40/25[4,6] Weight (Computation/Communication) Routing requirements cause conflicts p 0 p 1 p 2 p 3 p 4 p 5 N 3 90/10 [0,1] N 4 85/10 [2,3] N 5 80/10[4,5] p 6 p 7 p 8 N 6 N 7 N 8 N 9 N 10 N 11 N /0 95/0 100/0 100/0 100/0 100/0 120/0 P 0 P 1 P 2 P 3 P 4 P 5 P 6 Critical Path Integrated Link/CPU Voltage Scaling to convert imbalance to energy savings without performance penalties (recursive scheme, multiple passes) Network topology constrains link scaling
33 Energy Consumption Average Savings: CPU-VS (27%), LINK-VS (23%), CPU-LINK-VS (40%)
34 Other Results Non-uniform cache architectures (NUCA) and CMPs NUCA configurations for scientific computing Utilizing network on chip (NOC) with NUCA Sayaka Akioka (in progress) Modeling network PxP TorusSim Tool by Sarah Conner A single collective communication: link shutdown possible for 55%-97% of time No performance penalty + energy savings
35 Summary Substantial single processor PxP improvements For kernels, codes and full applications Time 30% 50% faster Power/energy 50%--80% lower Further savings from LSQ-based H/Q adaptivity Multiprocessor (MPP) PxP scaling trends from CPU-link scaling are promising Near ideal conversion of slack to savings Link shutdown possible 60-97% /collective communication
Multicore Parallel Computing with OpenMP
Multicore Parallel Computing with OpenMP Tan Chee Chiang (SVU/Academic Computing, Computer Centre) 1. OpenMP Programming The death of OpenMP was anticipated when cluster systems rapidly replaced large
1. Memory technology & Hierarchy
1. Memory technology & Hierarchy RAM types Advances in Computer Architecture Andy D. Pimentel Memory wall Memory wall = divergence between CPU and RAM speed We can increase bandwidth by introducing concurrency
Optimizing Configuration and Application Mapping for MPSoC Architectures
Optimizing Configuration and Application Mapping for MPSoC Architectures École Polytechnique de Montréal, Canada Email : [email protected] 1 Multi-Processor Systems on Chip (MPSoC) Design Trends
DEPLOYING AND MONITORING HADOOP MAP-REDUCE ANALYTICS ON SINGLE-CHIP CLOUD COMPUTER
DEPLOYING AND MONITORING HADOOP MAP-REDUCE ANALYTICS ON SINGLE-CHIP CLOUD COMPUTER ANDREAS-LAZAROS GEORGIADIS, SOTIRIOS XYDIS, DIMITRIOS SOUDRIS MICROPROCESSOR AND MICROSYSTEMS LABORATORY ELECTRICAL AND
LS DYNA Performance Benchmarks and Profiling. January 2009
LS DYNA Performance Benchmarks and Profiling January 2009 Note The following research was performed under the HPC Advisory Council activities AMD, Dell, Mellanox HPC Advisory Council Cluster Center The
In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller
In-Memory Databases Algorithms and Data Structures on Modern Hardware Martin Faust David Schwalb Jens Krüger Jürgen Müller The Free Lunch Is Over 2 Number of transistors per CPU increases Clock frequency
Network Architecture and Topology
1. Introduction 2. Fundamentals and design principles 3. Network architecture and topology 4. Network control and signalling 5. Network components 5.1 links 5.2 switches and routers 6. End systems 7. End-to-end
COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)
COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook) Vivek Sarkar Department of Computer Science Rice University [email protected] COMP
High Performance Computing. Course Notes 2007-2008. HPC Fundamentals
High Performance Computing Course Notes 2007-2008 2008 HPC Fundamentals Introduction What is High Performance Computing (HPC)? Difficult to define - it s a moving target. Later 1980s, a supercomputer performs
Overlapping Data Transfer With Application Execution on Clusters
Overlapping Data Transfer With Application Execution on Clusters Karen L. Reid and Michael Stumm [email protected] [email protected] Department of Computer Science Department of Electrical and Computer
Principles and characteristics of distributed systems and environments
Principles and characteristics of distributed systems and environments Definition of a distributed system Distributed system is a collection of independent computers that appears to its users as a single
18-742 Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two
age 1 18-742 Lecture 4 arallel rogramming II Spring 2005 rof. Babak Falsafi http://www.ece.cmu.edu/~ece742 write X Memory send X Memory read X Memory Slides developed in part by rofs. Adve, Falsafi, Hill,
Dell High-Performance Computing Clusters and Reservoir Simulation Research at UT Austin. http://www.dell.com/clustering
Dell High-Performance Computing Clusters and Reservoir Simulation Research at UT Austin Reza Rooholamini, Ph.D. Director Enterprise Solutions Dell Computer Corp. [email protected] http://www.dell.com/clustering
ECLIPSE Performance Benchmarks and Profiling. January 2009
ECLIPSE Performance Benchmarks and Profiling January 2009 Note The following research was performed under the HPC Advisory Council activities AMD, Dell, Mellanox, Schlumberger HPC Advisory Council Cluster
Energy-aware job scheduler for highperformance
Energy-aware job scheduler for highperformance computing 7.9.2011 Olli Mämmelä (VTT), Mikko Majanen (VTT), Robert Basmadjian (University of Passau), Hermann De Meer (University of Passau), André Giesler
Chapter 1 Computer System Overview
Operating Systems: Internals and Design Principles Chapter 1 Computer System Overview Eighth Edition By William Stallings Operating System Exploits the hardware resources of one or more processors Provides
Measuring Cache and Memory Latency and CPU to Memory Bandwidth
White Paper Joshua Ruggiero Computer Systems Engineer Intel Corporation Measuring Cache and Memory Latency and CPU to Memory Bandwidth For use with Intel Architecture December 2008 1 321074 Executive Summary
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,
AMD Opteron Quad-Core
AMD Opteron Quad-Core a brief overview Daniele Magliozzi Politecnico di Milano Opteron Memory Architecture native quad-core design (four cores on a single die for more efficient data sharing) enhanced
Performance Monitoring of Parallel Scientific Applications
Performance Monitoring of Parallel Scientific Applications Abstract. David Skinner National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory This paper introduces an infrastructure
Analysis of Memory Sensitive SPEC CPU2006 Integer Benchmarks for Big Data Benchmarking
Analysis of Memory Sensitive SPEC CPU2006 Integer Benchmarks for Big Data Benchmarking Kathlene Hurt and Eugene John Department of Electrical and Computer Engineering University of Texas at San Antonio
Mesh Generation and Load Balancing
Mesh Generation and Load Balancing Stan Tomov Innovative Computing Laboratory Computer Science Department The University of Tennessee April 04, 2012 CS 594 04/04/2012 Slide 1 / 19 Outline Motivation Reliable
Binary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
Naveen Muralimanohar Rajeev Balasubramonian Norman P Jouppi
Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian Norman P Jouppi University of Utah & HP Labs 1 Large Caches Cache hierarchies
Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis
Performance Metrics and Scalability Analysis 1 Performance Metrics and Scalability Analysis Lecture Outline Following Topics will be discussed Requirements in performance and cost Performance metrics Work
Workshop on Parallel and Distributed Scientific and Engineering Computing, Shanghai, 25 May 2012
Scientific Application Performance on HPC, Private and Public Cloud Resources: A Case Study Using Climate, Cardiac Model Codes and the NPB Benchmark Suite Peter Strazdins (Research School of Computer Science),
Resource Efficient Computing for Warehouse-scale Datacenters
Resource Efficient Computing for Warehouse-scale Datacenters Christos Kozyrakis Stanford University http://csl.stanford.edu/~christos DATE Conference March 21 st 2013 Computing is the Innovation Catalyst
OC By Arsene Fansi T. POLIMI 2008 1
IBM POWER 6 MICROPROCESSOR OC By Arsene Fansi T. POLIMI 2008 1 WHAT S IBM POWER 6 MICROPOCESSOR The IBM POWER6 microprocessor powers the new IBM i-series* and p-series* systems. It s based on IBM POWER5
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.
Vorlesung Rechnerarchitektur 2 Seite 178 DASH
Vorlesung Rechnerarchitektur 2 Seite 178 Architecture for Shared () The -architecture is a cache coherent, NUMA multiprocessor system, developed at CSL-Stanford by John Hennessy, Daniel Lenoski, Monica
Distributed communication-aware load balancing with TreeMatch in Charm++
Distributed communication-aware load balancing with TreeMatch in Charm++ The 9th Scheduling for Large Scale Systems Workshop, Lyon, France Emmanuel Jeannot Guillaume Mercier Francois Tessier In collaboration
CHAPTER 5 FINITE STATE MACHINE FOR LOOKUP ENGINE
CHAPTER 5 71 FINITE STATE MACHINE FOR LOOKUP ENGINE 5.1 INTRODUCTION Finite State Machines (FSMs) are important components of digital systems. Therefore, techniques for area efficiency and fast implementation
Why Latency Lags Bandwidth, and What it Means to Computing
Why Latency Lags Bandwidth, and What it Means to Computing David Patterson U.C. Berkeley [email protected] October 2004 Bandwidth Rocks (1) Preview: Latency Lags Bandwidth Over last 20 to 25 years,
Motivation: Smartphone Market
Motivation: Smartphone Market Smartphone Systems External Display Device Display Smartphone Systems Smartphone-like system Main Camera Front-facing Camera Central Processing Unit Device Display Graphics
Parallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
Photonic Networks for Data Centres and High Performance Computing
Photonic Networks for Data Centres and High Performance Computing Philip Watts Department of Electronic Engineering, UCL Yury Audzevich, Nick Barrow-Williams, Robert Mullins, Simon Moore, Andrew Moore
Optimizing Shared Resource Contention in HPC Clusters
Optimizing Shared Resource Contention in HPC Clusters Sergey Blagodurov Simon Fraser University Alexandra Fedorova Simon Fraser University Abstract Contention for shared resources in HPC clusters occurs
So#ware Tools and Techniques for HPC, Clouds, and Server- Class SoCs Ron Brightwell
So#ware Tools and Techniques for HPC, Clouds, and Server- Class SoCs Ron Brightwell R&D Manager, Scalable System So#ware Department Sandia National Laboratories is a multi-program laboratory managed and
Petascale Software Challenges. Piyush Chaudhary [email protected] High Performance Computing
Petascale Software Challenges Piyush Chaudhary [email protected] High Performance Computing Fundamental Observations Applications are struggling to realize growth in sustained performance at scale Reasons
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association
Making Multicore Work and Measuring its Benefits Markus Levy, president EEMBC and Multicore Association Agenda Why Multicore? Standards and issues in the multicore community What is Multicore Association?
PERFORMANCE TUNING ORACLE RAC ON LINUX
PERFORMANCE TUNING ORACLE RAC ON LINUX By: Edward Whalen Performance Tuning Corporation INTRODUCTION Performance tuning is an integral part of the maintenance and administration of the Oracle database
Introduction History Design Blue Gene/Q Job Scheduler Filesystem Power usage Performance Summary Sequoia is a petascale Blue Gene/Q supercomputer Being constructed by IBM for the National Nuclear Security
Outline. Introduction. Multiprocessor Systems on Chip. A MPSoC Example: Nexperia DVP. A New Paradigm: Network on Chip
Outline Modeling, simulation and optimization of Multi-Processor SoCs (MPSoCs) Università of Verona Dipartimento di Informatica MPSoCs: Multi-Processor Systems on Chip A simulation platform for a MPSoC
Physical Data Organization
Physical Data Organization Database design using logical model of the database - appropriate level for users to focus on - user independence from implementation details Performance - other major factor
Memory Hierarchy. Arquitectura de Computadoras. Centro de Investigación n y de Estudios Avanzados del IPN. [email protected]. MemoryHierarchy- 1
Hierarchy Arturo Díaz D PérezP Centro de Investigación n y de Estudios Avanzados del IPN [email protected] Hierarchy- 1 The Big Picture: Where are We Now? The Five Classic Components of a Computer Processor
benchmarking Amazon EC2 for high-performance scientific computing
Edward Walker benchmarking Amazon EC2 for high-performance scientific computing Edward Walker is a Research Scientist with the Texas Advanced Computing Center at the University of Texas at Austin. He received
The Orca Chip... Heart of IBM s RISC System/6000 Value Servers
The Orca Chip... Heart of IBM s RISC System/6000 Value Servers Ravi Arimilli IBM RISC System/6000 Division 1 Agenda. Server Background. Cache Heirarchy Performance Study. RS/6000 Value Server System Structure.
Benchmarking Cassandra on Violin
Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract
big.little Technology Moves Towards Fully Heterogeneous Global Task Scheduling Improving Energy Efficiency and Performance in Mobile Devices
big.little Technology Moves Towards Fully Heterogeneous Global Task Scheduling Improving Energy Efficiency and Performance in Mobile Devices Brian Jeff November, 2013 Abstract ARM big.little processing
Clusters: Mainstream Technology for CAE
Clusters: Mainstream Technology for CAE Alanna Dwyer HPC Division, HP Linux and Clusters Sparked a Revolution in High Performance Computing! Supercomputing performance now affordable and accessible Linux
OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC
OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC Driving industry innovation The goal of the OpenPOWER Foundation is to create an open ecosystem, using the POWER Architecture to share expertise,
Low Power AMD Athlon 64 and AMD Opteron Processors
Low Power AMD Athlon 64 and AMD Opteron Processors Hot Chips 2004 Presenter: Marius Evers Block Diagram of AMD Athlon 64 and AMD Opteron Based on AMD s 8 th generation architecture AMD Athlon 64 and AMD
Keys to node-level performance analysis and threading in HPC applications
Keys to node-level performance analysis and threading in HPC applications Thomas GUILLET (Intel; Exascale Computing Research) IFERC seminar, 18 March 2015 Legal Disclaimer & Optimization Notice INFORMATION
McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures
McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures Sheng Li, Junh Ho Ahn, Richard Strong, Jay B. Brockman, Dean M Tullsen, Norman Jouppi MICRO 2009
Performance Evaluation of 2D-Mesh, Ring, and Crossbar Interconnects for Chip Multi- Processors. NoCArc 09
Performance Evaluation of 2D-Mesh, Ring, and Crossbar Interconnects for Chip Multi- Processors NoCArc 09 Jesús Camacho Villanueva, José Flich, José Duato Universidad Politécnica de Valencia December 12,
ACCELERATING COMMERCIAL LINEAR DYNAMIC AND NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH- PERFORMANCE COMPUTING
ACCELERATING COMMERCIAL LINEAR DYNAMIC AND Vladimir Belsky Director of Solver Development* Luis Crivelli Director of Solver Development* Matt Dunbar Chief Architect* Mikhail Belyi Development Group Manager*
In-network Monitoring and Control Policy for DVFS of CMP Networkson-Chip and Last Level Caches
In-network Monitoring and Control Policy for DVFS of CMP Networkson-Chip and Last Level Caches Xi Chen 1, Zheng Xu 1, Hyungjun Kim 1, Paul V. Gratz 1, Jiang Hu 1, Michael Kishinevsky 2 and Umit Ogras 2
Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture
White Paper Intel Xeon processor E5 v3 family Intel Xeon Phi coprocessor family Digital Design and Engineering Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture Executive
Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) [email protected] http://www.mzahran.com
CSCI-GA.3033-012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) [email protected] http://www.mzahran.com Modern GPU
ECLIPSE Best Practices Performance, Productivity, Efficiency. March 2009
ECLIPSE Best Practices Performance, Productivity, Efficiency March 29 ECLIPSE Performance, Productivity, Efficiency The following research was performed under the HPC Advisory Council activities HPC Advisory
Data Centric Systems (DCS)
Data Centric Systems (DCS) Architecture and Solutions for High Performance Computing, Big Data and High Performance Analytics High Performance Computing with Data Centric Systems 1 Data Centric Systems
System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1
System Interconnect Architectures CSCI 8150 Advanced Computer Architecture Hwang, Chapter 2 Program and Network Properties 2.4 System Interconnect Architectures Direct networks for static connections Indirect
CHAPTER 1 INTRODUCTION
1 CHAPTER 1 INTRODUCTION 1.1 MOTIVATION OF RESEARCH Multicore processors have two or more execution cores (processors) implemented on a single chip having their own set of execution and architectural recourses.
Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes
Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes Eric Petit, Loïc Thebault, Quang V. Dinh May 2014 EXA2CT Consortium 2 WPs Organization Proto-Applications
Why the Network Matters
Week 2, Lecture 2 Copyright 2009 by W. Feng. Based on material from Matthew Sottile. So Far Overview of Multicore Systems Why Memory Matters Memory Architectures Emerging Chip Multiprocessors (CMP) Increasing
Networking Virtualization Using FPGAs
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Massachusetts,
Intel Data Direct I/O Technology (Intel DDIO): A Primer >
Intel Data Direct I/O Technology (Intel DDIO): A Primer > Technical Brief February 2012 Revision 1.0 Legal Statements INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE,
and RISC Optimization Techniques for the Hitachi SR8000 Architecture
1 KONWIHR Project: Centre of Excellence for High Performance Computing Pseudo-Vectorization and RISC Optimization Techniques for the Hitachi SR8000 Architecture F. Deserno, G. Hager, F. Brechtefeld, G.
Multi-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
Fast Multipole Method for particle interactions: an open source parallel library component
Fast Multipole Method for particle interactions: an open source parallel library component F. A. Cruz 1,M.G.Knepley 2,andL.A.Barba 1 1 Department of Mathematics, University of Bristol, University Walk,
Intel Labs at ISSCC 2012. Copyright Intel Corporation 2012
Intel Labs at ISSCC 2012 Copyright Intel Corporation 2012 Intel Labs ISSCC 2012 Highlights 1. Efficient Computing Research: Making the most of every milliwatt to make computing greener and more scalable
Towards Energy Efficient Query Processing in Database Management System
Towards Energy Efficient Query Processing in Database Management System Report by: Ajaz Shaik, Ervina Cergani Abstract Rising concerns about the amount of energy consumed by the data centers, several computer
:Introducing Star-P. The Open Platform for Parallel Application Development. Yoel Jacobsen E&M Computing LTD [email protected]
:Introducing Star-P The Open Platform for Parallel Application Development Yoel Jacobsen E&M Computing LTD [email protected] The case for VHLLs Functional / applicative / very high-level languages allow
Recommended hardware system configurations for ANSYS users
Recommended hardware system configurations for ANSYS users The purpose of this document is to recommend system configurations that will deliver high performance for ANSYS users across the entire range
361 Computer Architecture Lecture 14: Cache Memory
1 361 Computer Architecture Lecture 14 Memory cache.1 The Motivation for s Memory System Processor DRAM Motivation Large memories (DRAM) are slow Small memories (SRAM) are fast Make the average access
Concept of Cache in web proxies
Concept of Cache in web proxies Chan Kit Wai and Somasundaram Meiyappan 1. Introduction Caching is an effective performance enhancing technique that has been used in computer systems for decades. However,
Computer Architecture
Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 11 Memory Management Computer Architecture Part 11 page 1 of 44 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin
Memory ICS 233. Computer Architecture and Assembly Language Prof. Muhamed Mudawar
Memory ICS 233 Computer Architecture and Assembly Language Prof. Muhamed Mudawar College of Computer Sciences and Engineering King Fahd University of Petroleum and Minerals Presentation Outline Random
EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES
ABSTRACT EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES Tyler Cossentine and Ramon Lawrence Department of Computer Science, University of British Columbia Okanagan Kelowna, BC, Canada [email protected]
PERFORMANCE TOOLS DEVELOPMENTS
PERFORMANCE TOOLS DEVELOPMENTS Roberto A. Vitillo presented by Paolo Calafiura & Wim Lavrijsen Lawrence Berkeley National Laboratory Future computing in particle physics, 16 June 2011 1 LINUX PERFORMANCE
DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION
DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION A DIABLO WHITE PAPER AUGUST 2014 Ricky Trigalo Director of Business Development Virtualization, Diablo Technologies
Trends in High-Performance Computing for Power Grid Applications
Trends in High-Performance Computing for Power Grid Applications Franz Franchetti ECE, Carnegie Mellon University www.spiral.net Co-Founder, SpiralGen www.spiralgen.com This talk presents my personal views
FLOW-3D Performance Benchmark and Profiling. September 2012
FLOW-3D Performance Benchmark and Profiling September 2012 Note The following research was performed under the HPC Advisory Council activities Participating vendors: FLOW-3D, Dell, Intel, Mellanox Compute
Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1
Performance Study Performance Characteristics of and RDM VMware ESX Server 3.0.1 VMware ESX Server offers three choices for managing disk access in a virtual machine VMware Virtual Machine File System
OpenSPARC T1 Processor
OpenSPARC T1 Processor The OpenSPARC T1 processor is the first chip multiprocessor that fully implements the Sun Throughput Computing Initiative. Each of the eight SPARC processor cores has full hardware
FPGA-based Multithreading for In-Memory Hash Joins
FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded
A NOR Emulation Strategy over NAND Flash Memory
A NOR Emulation Strategy over NAND Flash Memory Jian-Hong Lin, Yuan-Hao Chang, Jen-Wei Hsieh, and Tei-Wei Kuo Embedded Systems and Wireless Networking Laboratory Dept. of Computer Science and Information
Memory Architecture and Management in a NoC Platform
Architecture and Management in a NoC Platform Axel Jantsch Xiaowen Chen Zhonghai Lu Chaochao Feng Abdul Nameed Yuang Zhang Ahmed Hemani DATE 2011 Overview Motivation State of the Art Data Management Engine
Parallel Computing 37 (2011) 26 41. Contents lists available at ScienceDirect. Parallel Computing. journal homepage: www.elsevier.
Parallel Computing 37 (2011) 26 41 Contents lists available at ScienceDirect Parallel Computing journal homepage: www.elsevier.com/locate/parco Architectural support for thread communications in multi-core
This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?
This Unit: Putting It All Together CIS 501 Computer Architecture Unit 11: Putting It All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Amir Roth with contributions by Milo
Icepak High-Performance Computing at Rockwell Automation: Benefits and Benchmarks
Icepak High-Performance Computing at Rockwell Automation: Benefits and Benchmarks Garron K. Morris Senior Project Thermal Engineer [email protected] Standard Drives Division Bruce W. Weiss Principal
Rambus Smart Data Acceleration
Rambus Smart Data Acceleration Back to the Future Memory and Data Access: The Final Frontier As an industry, if real progress is to be made towards the level of computing that the future mandates, then
22S:295 Seminar in Applied Statistics High Performance Computing in Statistics
22S:295 Seminar in Applied Statistics High Performance Computing in Statistics Luke Tierney Department of Statistics & Actuarial Science University of Iowa August 30, 2007 Luke Tierney (U. of Iowa) HPC
Scalability and Classifications
Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static
