Keys to node-level performance analysis and threading in HPC applications

Size: px
Start display at page:

Download "Keys to node-level performance analysis and threading in HPC applications"

Transcription

1 Keys to node-level performance analysis and threading in HPC applications Thomas GUILLET (Intel; Exascale Computing Research) IFERC seminar, 18 March 2015

2 Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED AS IS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright, Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Xeon Phi, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries. Optimization Notice Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #

3 Application performance: a multiscale problem Microarch Core Socket Node Cluster Multicore: vector ISA, cores, cache hierarchies, Manycore: new vector ISAs, MPI+OMP?, memory/core? Optimization space is getting larger Goal of this presentation: Provide keys to application performance and threading analysis Based on characterization & projection experience with full applications 3

4 Node-level performance Choice of algorithm or scheme Source code implementation Binary code Actual execution Programmer Data access patterns Compiler Vectorization Code generation Architecture Cache behavior Execution pathologies Memory bandwidth/data reuse optimizations Vectorization/code quality optimizations 2 main performance factors (at first order) : Memory (DRAM) bandwidth demand Computation: Flops (but also non-flop instructions sometimes), use of execution units Key questions: What are the requirements of my algorithm, in terms of compute vs. memory transfers? What performance can I expect? Where am I with respect to ideal performance? How can I get closer to ideal? 4

5 Flops, bytes & arithmetic intensity Arithmetic intensity = Flop/byte: a measure of compute vs. ideal data transfer balance for a particular kernel DAXPY (Triad) do i=1,n y(i) = y(i) + a*x(i) end do Read x Read y Compute y Write y 8N bytes 8N bytes 2N Flops 8N bytes Flop/byte = 2/24 = D Stencil (Gauss-Seidel) do k=1,n do j=1,n do i=1,n x(i,j,k) = ONE_SIXTH * ( & x(i+1,j,k) + x(i-1,j,k) + & x(i,j+1,k) + x(i,j-1,k) + & x(i,j,k+1) + x(i,j,k-1)) end do end do end do Read x Compute update Write new x 8N^3 bytes 6N^3 Flops 8N^3 bytes Flop/byte = 6/16 = Source code level analysis: Count floating point operations Count bytes (arrays) read & written, assume perfect reuse (infinite cache) ideal case 5

6 Compute vs. bandwidth analysis Quantitative System Performance, D. Lazowska, J. Zahorjan, G. Graham, K. Sevcik Williams et al., log GFLOP/s = performance Compute bound Ideal execution Actual vs. ideal execution: Efficiency (% peak) depends on microarch. Finite cache size will reduce flop/byte Actual execution Vectorization, Code generation Data reuse, Cache optims Actual Flop/byte Theoretical Flop/byte log Flop/byte = arithmetic intensity Measuring data for actual execution: GFlops/s derived from code performance: GFlops/s = Gcells/s Flops/cell DRAM bandwidth Flop/byte = (GFlop/s) / (GB/s) Intel VTune Amplifier XE Open source tools, e.g. Requires root access or special kernel module 6

7 Illustration: GYSELA kernels on Xeon 2 sockets, Xeon E (Sandy Bridge, 2.6 GHz) This kernel is BW bound when vectorized, but compute bound when not vectorized! 7

8 Illustration: GYSELA kernels on Xeon Phi Xeon Phi 7120 (16 GB GDDR, 61 cores, 1.2GHz) Efficiency drops for complex loop bodies Smaller caches incur more memory traffic 8

9 Node-level characterization: Wrap Up Simple compute vs. bandwidth characterization («roofline») Helps determine max performance expectations Allows to identify optimization directions Can be complemented by quick analysis tricks Measure time on 1 full node (avail b/w = BW 1 ), and write: T 1 full = T compute + T bw Measure time on 2 half-filled nodes (avail b/w = BW 2 > BW 1 ), and write: T 2 half = T compute + T bw (BW 1 BW 2 ) Solve for T compute and T bw to estimate «memory-boundedness» of app on this architecture Also useful for quick projections across similar architectures General trends on Xeon Phi Smaller caches incur more memory demand In-order core, complex vector ISA compiler and code generation matter So far, we assumed good parallelism (no threading or MPI issues) 9

10 Shared memory: To thread or not to thread? Why is threading interesting in applications? Allows «larger» MPI ranks (for domain decomp.) for a same problem May improve surface/volume ratio Amortizes memory footprint of MPI runtime Allows dynamic load balancing for imbalanced applications What could possibly go wrong? Amdahl s law strikes back On computation: getting good coverage is hard On communications MPI+X is not intrinsically «better» than MPI 4x1 v.s. 1x4 10

11 200 Illustration: CFD application Configurations with {#ranks} x {#threads} = 24 cores Temporal loop wtime [s] 120 Footprint/core [MB] 2.5E+11 App instructions/core E E E E E Measured [s] Amdahl projection OMP threads/rank

12 Ranks x Threads Illustration: CFD application Configurations with {#ranks} x {#threads} = 24 cores Wtime spent inside OpenMP parallel regions CFD app example: Wall time [s] on master thread x1 12x2 6x4 Wtime spent in MPI library grows with # threads OMP Serial MPI 4x6 2x12 Non-threaded computation wtime («Amdahl s law on threads»)

13 Can threading help with imbalance? [synthetic data for illustration] Small-scale 50% imbalance Large-scale 50% imbalance Imbalance time = max - mean Shared mem dynamic load balancing may be effective against imbalance Shared mem dynamic load balancing ineffective alone against imbalance core id core id

14 Ranks x Threads Threading and imbalance: Highly imbalanced adaptive mesh refinement code OMP computation scales less than ideally Wall time [s] on master thread, rank x1 12x2 OMP Serial MPI 8x3 6x4 Threading helps reduce extreme MPI imbalance 4x6 2x12 But Amdahl s law still overtakes at high thread counts

15 OpenMP: things to watch for in apps Code coverage (a.k.a. Amdahl s law) Extensive coverage is critical for scalability Can be very tedious/impossible to achieve for flat-profile applications Coarse threading ( loop-level) helps, but reimplementing MPI doesn t Granularity Important metric = average wall time of OpenMP regions Compare to OpenMP barrier/sync time Both points grow in importance on Xeon Phi Lots of threads coverage grows in importance Limited memory/core short loops Vtune profiling can help diagnose both issues 15

16 Wrap-up Careful performance analysis is essential to guide code optimizations Set pragmatic performance targets Collect data on application behavior Simple compute vs. bandwidth model can provide: Robust first-order characterization Insights into specific or second-order effects Threading can help address some strong-scaling issues Amortize halo overheads, level out imbalance No magic: obtaining good coverage is hard work Threading: an important adjustment variable for Heterogeneous computing resources (e.g. symmetric mode) Available memory/core 16

17

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,

More information

Scaling up to Production

Scaling up to Production 1 Scaling up to Production Overview Productionize then Scale Building Production Systems Scaling Production Systems Use Case: Scaling a Production Galaxy Instance Infrastructure Advice 2 PRODUCTIONIZE

More information

INTEL PARALLEL STUDIO XE EVALUATION GUIDE

INTEL PARALLEL STUDIO XE EVALUATION GUIDE Introduction This guide will illustrate how you use Intel Parallel Studio XE to find the hotspots (areas that are taking a lot of time) in your application and then recompiling those parts to improve overall

More information

Contributed Article Program and Intel DPD Search Optimization Training. John McHugh and Steve Moore January 2012

Contributed Article Program and Intel DPD Search Optimization Training. John McHugh and Steve Moore January 2012 Contributed Article Program and Intel DPD Search Optimization Training John McHugh and Steve Moore January 2012 Contributed Article Program Publish good stuff and get paid John McHugh Marcom 2 Contributed

More information

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture White Paper Intel Xeon processor E5 v3 family Intel Xeon Phi coprocessor family Digital Design and Engineering Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture Executive

More information

Towards OpenMP Support in LLVM

Towards OpenMP Support in LLVM Towards OpenMP Support in LLVM Alexey Bataev, Andrey Bokhanko, James Cownie Intel 1 Agenda What is the OpenMP * language? Who Can Benefit from the OpenMP language? OpenMP Language Support Early / Late

More information

The ROI from Optimizing Software Performance with Intel Parallel Studio XE

The ROI from Optimizing Software Performance with Intel Parallel Studio XE The ROI from Optimizing Software Performance with Intel Parallel Studio XE Intel Parallel Studio XE delivers ROI solutions to development organizations. This comprehensive tool offering for the entire

More information

The Foundation for Better Business Intelligence

The Foundation for Better Business Intelligence Product Brief Intel Xeon Processor E7-8800/4800/2800 v2 Product Families Data Center The Foundation for Big data is changing the way organizations make business decisions. To transform petabytes of data

More information

High Performance Computing and Big Data: The coming wave.

High Performance Computing and Big Data: The coming wave. High Performance Computing and Big Data: The coming wave. 1 In science and engineering, in order to compete, you must compute Today, the toughest challenges, and greatest opportunities, require computation

More information

New Dimensions in Configurable Computing at runtime simultaneously allows Big Data and fine Grain HPC

New Dimensions in Configurable Computing at runtime simultaneously allows Big Data and fine Grain HPC New Dimensions in Configurable Computing at runtime simultaneously allows Big Data and fine Grain HPC Alan Gara Intel Fellow Exascale Chief Architect Legal Disclaimer Today s presentations contain forward-looking

More information

Intel Media SDK Library Distribution and Dispatching Process

Intel Media SDK Library Distribution and Dispatching Process Intel Media SDK Library Distribution and Dispatching Process Overview Dispatching Procedure Software Libraries Platform-Specific Libraries Legal Information Overview This document describes the Intel Media

More information

Vendor Update Intel 49 th IDC HPC User Forum. Mike Lafferty HPC Marketing Intel Americas Corp.

Vendor Update Intel 49 th IDC HPC User Forum. Mike Lafferty HPC Marketing Intel Americas Corp. Vendor Update Intel 49 th IDC HPC User Forum Mike Lafferty HPC Marketing Intel Americas Corp. Legal Information Today s presentations contain forward-looking statements. All statements made that are not

More information

YALES2 porting on the Xeon- Phi Early results

YALES2 porting on the Xeon- Phi Early results YALES2 porting on the Xeon- Phi Early results Othman Bouizi Ghislain Lartigue Innovation and Pathfinding Architecture Group in Europe, Exascale Lab. Paris CRIHAN - Demi-journée calcul intensif, 16 juin

More information

Intel Platform and Big Data: Making big data work for you.

Intel Platform and Big Data: Making big data work for you. Intel Platform and Big Data: Making big data work for you. 1 From data comes insight New technologies are enabling enterprises to transform opportunity into reality by turning big data into actionable

More information

Exascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation

Exascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation Exascale Challenges and General Purpose Processors Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation Jun-93 Aug-94 Oct-95 Dec-96 Feb-98 Apr-99 Jun-00 Aug-01 Oct-02 Dec-03

More information

Accelerating Business Intelligence with Large-Scale System Memory

Accelerating Business Intelligence with Large-Scale System Memory Accelerating Business Intelligence with Large-Scale System Memory A Proof of Concept by Intel, Samsung, and SAP Executive Summary Real-time business intelligence (BI) plays a vital role in driving competitiveness

More information

MAQAO Performance Analysis and Optimization Tool

MAQAO Performance Analysis and Optimization Tool MAQAO Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL andres.charif@uvsq.fr Performance Evaluation Team, University of Versailles S-Q-Y http://www.maqao.org VI-HPS 18 th Grenoble 18/22

More information

Intel Media Server Studio - Metrics Monitor (v1.1.0) Reference Manual

Intel Media Server Studio - Metrics Monitor (v1.1.0) Reference Manual Intel Media Server Studio - Metrics Monitor (v1.1.0) Reference Manual Overview Metrics Monitor is part of Intel Media Server Studio 2015 for Linux Server. Metrics Monitor is a user space shared library

More information

OpenMP* 4.0 for HPC in a Nutshell

OpenMP* 4.0 for HPC in a Nutshell OpenMP* 4.0 for HPC in a Nutshell Dr.-Ing. Michael Klemm Senior Application Engineer Software and Services Group (michael.klemm@intel.com) *Other brands and names are the property of their respective owners.

More information

Improve Fortran Code Quality with Static Analysis

Improve Fortran Code Quality with Static Analysis Improve Fortran Code Quality with Static Analysis This document is an introductory tutorial describing how to use static analysis on Fortran code to improve software quality, either by eliminating bugs

More information

Accelerating Business Intelligence with Large-Scale System Memory

Accelerating Business Intelligence with Large-Scale System Memory Accelerating Business Intelligence with Large-Scale System Memory A Proof of Concept by Intel, Samsung, and SAP Executive Summary Real-time business intelligence (BI) plays a vital role in driving competitiveness

More information

OpenMP and Performance

OpenMP and Performance Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group schmidl@itc.rwth-aachen.de IT Center der RWTH Aachen University Tuning Cycle Performance Tuning aims to improve the runtime of an

More information

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes Eric Petit, Loïc Thebault, Quang V. Dinh May 2014 EXA2CT Consortium 2 WPs Organization Proto-Applications

More information

Intel Many Integrated Core Architecture: An Overview and Programming Models

Intel Many Integrated Core Architecture: An Overview and Programming Models Intel Many Integrated Core Architecture: An Overview and Programming Models Jim Jeffers SW Product Application Engineer Technical Computing Group Agenda An Overview of Intel Many Integrated Core Architecture

More information

Monte Carlo Method for Stock Options Pricing Sample

Monte Carlo Method for Stock Options Pricing Sample Monte Carlo Method for Stock Options Pricing Sample User's Guide Copyright 2013 Intel Corporation All Rights Reserved Document Number: 325264-003US Revision: 1.0 Document Number: 325264-003US Intel SDK

More information

Finding Performance and Power Issues on Android Systems. By Eric W Moore

Finding Performance and Power Issues on Android Systems. By Eric W Moore Finding Performance and Power Issues on Android Systems By Eric W Moore Agenda Performance & Power Tuning on Android & Features Needed/Wanted in a tool Some Performance Tools Getting a Device that Supports

More information

Large-Data Software Defined Visualization on CPUs

Large-Data Software Defined Visualization on CPUs Large-Data Software Defined Visualization on CPUs Greg P. Johnson, Bruce Cherniak 2015 Rice Oil & Gas HPC Workshop Trend: Increasing Data Size Measuring / modeling increasingly complex phenomena Rendering

More information

Parallel Programming Survey

Parallel Programming Survey Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory

More information

Overview of Data Fitting Component in Intel Math Kernel Library (Intel MKL) Intel Corporation

Overview of Data Fitting Component in Intel Math Kernel Library (Intel MKL) Intel Corporation Overview of Data Fitting Component in Intel Math Kernel Library (Intel MKL) Intel Corporation Agenda 1D interpolation problem statement Computation flow Application areas Data fitting in Intel MKL Data

More information

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi Assessing the Performance of OpenMP Programs on the Intel Xeon Phi Dirk Schmidl, Tim Cramer, Sandra Wienke, Christian Terboven, and Matthias S. Müller schmidl@rz.rwth-aachen.de Rechen- und Kommunikationszentrum

More information

Intel Media Server Studio Professional Edition for Windows* Server

Intel Media Server Studio Professional Edition for Windows* Server Intel Media Server Studio 2015 R3 Professional Edition for Windows* Server Release Notes Overview What's New System Requirements Installation Installation Folders Known Limitations Legal Information Overview

More information

Intel Service Assurance Administrator. Product Overview

Intel Service Assurance Administrator. Product Overview Intel Service Assurance Administrator Product Overview Running Enterprise Workloads in the Cloud Enterprise IT wants to Start a private cloud initiative to service internal enterprise customers Find an

More information

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

Measuring Cache and Memory Latency and CPU to Memory Bandwidth White Paper Joshua Ruggiero Computer Systems Engineer Intel Corporation Measuring Cache and Memory Latency and CPU to Memory Bandwidth For use with Intel Architecture December 2008 1 321074 Executive Summary

More information

Cloud Computing through Virtualization and HPC technologies

Cloud Computing through Virtualization and HPC technologies Cloud Computing through Virtualization and HPC technologies William Lu, Ph.D. 1 Agenda Cloud Computing & HPC A Case of HPC Implementation Application Performance in VM Summary 2 Cloud Computing & HPC HPC

More information

Intel X38 Express Chipset Memory Technology and Configuration Guide

Intel X38 Express Chipset Memory Technology and Configuration Guide Intel X38 Express Chipset Memory Technology and Configuration Guide White Paper January 2008 Document Number: 318469-002 INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE,

More information

Implementation and Performance of AES-NI in CyaSSL. Embedded SSL

Implementation and Performance of AES-NI in CyaSSL. Embedded SSL Implementation and Performance of AES-NI in CyaSSL Embedded SSL In 2010, Intel introduced the 32nm Intel microarchitecture code name Westmere. With this introduction, Intel announced support for a new

More information

The Transition to PCI Express* for Client SSDs

The Transition to PCI Express* for Client SSDs The Transition to PCI Express* for Client SSDs Amber Huffman Senior Principal Engineer Intel Santa Clara, CA 1 *Other names and brands may be claimed as the property of others. Legal Notices and Disclaimers

More information

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France

More information

Get an Easy Performance Boost Even with Unthreaded Apps. with Intel Parallel Studio XE for Windows*

Get an Easy Performance Boost Even with Unthreaded Apps. with Intel Parallel Studio XE for Windows* Get an Easy Performance Boost Even with Unthreaded Apps for Windows* Can recompiling just one file make a difference? Yes, in many cases it can! Often, you can achieve a major performance boost by recompiling

More information

Evaluating Intel Virtualization Technology FlexMigration with Multi-generation Intel Multi-core and Intel Dual-core Xeon Processors.

Evaluating Intel Virtualization Technology FlexMigration with Multi-generation Intel Multi-core and Intel Dual-core Xeon Processors. Evaluating Intel Virtualization Technology FlexMigration with Multi-generation Intel Multi-core and Intel Dual-core Xeon Processors. Executive Summary: In today s data centers, live migration is a required

More information

-------- Overview --------

-------- Overview -------- ------------------------------------------------------------------- Intel(R) Trace Analyzer and Collector 9.1 Update 1 for Windows* OS Release Notes -------------------------------------------------------------------

More information

COLO: COarse-grain LOck-stepping Virtual Machine for Non-stop Service

COLO: COarse-grain LOck-stepping Virtual Machine for Non-stop Service COLO: COarse-grain LOck-stepping Virtual Machine for Non-stop Service Eddie Dong, Yunhong Jiang 1 Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE,

More information

Intelligent Business Operations

Intelligent Business Operations White Paper Intel Xeon Processor E5 Family Data Center Efficiency Financial Services Intelligent Business Operations Best Practices in Cash Supply Chain Management Executive Summary The purpose of any

More information

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms Amani AlOnazi, David E. Keyes, Alexey Lastovetsky, Vladimir Rychkov Extreme Computing Research Center,

More information

Intel Q35/Q33, G35/G33/G31, P35/P31 Express Chipset Memory Technology and Configuration Guide

Intel Q35/Q33, G35/G33/G31, P35/P31 Express Chipset Memory Technology and Configuration Guide Intel Q35/Q33, G35/G33/G31, P35/P31 Express Chipset Memory Technology and Configuration Guide White Paper August 2007 Document Number: 316971-002 INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION

More information

Elemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus

Elemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus Elemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus A simple C/C++ language extension construct for data parallel operations Robert Geva robert.geva@intel.com Introduction Intel

More information

Using the Intel Xeon Phi (with the Stampede Supercomputer) ISC 13 Tutorial

Using the Intel Xeon Phi (with the Stampede Supercomputer) ISC 13 Tutorial Using the Intel Xeon Phi (with the Stampede Supercomputer) ISC 13 Tutorial Bill Barth, Kent Milfeld, Dan Stanzione Tommy Minyard Texas Advanced Computing Center Jim Jeffers, Intel June 2013, Leipzig, Germany

More information

Irregular Applications and their Architectural Challenges

Irregular Applications and their Architectural Challenges Irregular Applications and their Architectural Challenges Pradeep K. Dubey Intel Fellow and Fellow of IEEE IA^3 - SC12 Workshop Emerging Applications and sources of Irregularity 2 Who Needs Compute Traditional

More information

Big Data Visualization on the MIC

Big Data Visualization on the MIC Big Data Visualization on the MIC Tim Dykes School of Creative Technologies University of Portsmouth timothy.dykes@port.ac.uk Many-Core Seminar Series 26/02/14 Splotch Team Tim Dykes, University of Portsmouth

More information

Improve Fortran Code Quality with Static Security Analysis (SSA)

Improve Fortran Code Quality with Static Security Analysis (SSA) Improve Fortran Code Quality with Static Security Analysis (SSA) with Intel Parallel Studio XE This document is an introductory tutorial describing how to use static security analysis (SSA) on C++ code

More information

Large Scale Simulation on Clusters using COMSOL 4.2

Large Scale Simulation on Clusters using COMSOL 4.2 Large Scale Simulation on Clusters using COMSOL 4.2 Darrell W. Pepper 1 Xiuling Wang 2 Steven Senator 3 Joseph Lombardo 4 David Carrington 5 with David Kan and Ed Fontes 6 1 DVP-USAFA-UNLV, 2 Purdue-Calumet,

More information

Extended Attributes and Transparent Encryption in Apache Hadoop

Extended Attributes and Transparent Encryption in Apache Hadoop Extended Attributes and Transparent Encryption in Apache Hadoop Uma Maheswara Rao G Yi Liu ( 刘 轶 ) Who we are? Uma Maheswara Rao G - umamahesh@apache.org - Software Engineer at Intel - PMC/committer, Apache

More information

Cloud-based Analytics and Map Reduce

Cloud-based Analytics and Map Reduce 1 Cloud-based Analytics and Map Reduce Datasets Many technologies converging around Big Data theme Cloud Computing, NoSQL, Graph Analytics Biology is becoming increasingly data intensive Sequencing, imaging,

More information

Building an energy dashboard. Energy measurement and visualization in current HPC systems

Building an energy dashboard. Energy measurement and visualization in current HPC systems Building an energy dashboard Energy measurement and visualization in current HPC systems Thomas Geenen 1/58 thomas.geenen@surfsara.nl SURFsara The Dutch national HPC center 2H 2014 > 1PFlop GPGPU accelerators

More information

Performance Analysis and Optimization Tool

Performance Analysis and Optimization Tool Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL andres.charif@uvsq.fr Performance Analysis Team, University of Versailles http://www.maqao.org Introduction Performance Analysis Develop

More information

Multi-Threading Performance on Commodity Multi-Core Processors

Multi-Threading Performance on Commodity Multi-Core Processors Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction

More information

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

GPU System Architecture. Alan Gray EPCC The University of Edinburgh GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems

More information

IMAGE SIGNAL PROCESSING PERFORMANCE ON 2 ND GENERATION INTEL CORE MICROARCHITECTURE PRESENTATION PETER CARLSTON, EMBEDDED & COMMUNICATIONS GROUP

IMAGE SIGNAL PROCESSING PERFORMANCE ON 2 ND GENERATION INTEL CORE MICROARCHITECTURE PRESENTATION PETER CARLSTON, EMBEDDED & COMMUNICATIONS GROUP IMAGE SIGNAL PROCESSING PERFORMANCE ON 2 ND GENERATION INTEL CORE MICROARCHITECTURE PRESENTATION PETER CARLSTON, EMBEDDED & COMMUNICATIONS GROUP Q3 2011 325877-001 1 Legal Notices and Disclaimers INFORMATION

More information

Unlocking Hidden Potential at Intel Through Big Data Analytics

Unlocking Hidden Potential at Intel Through Big Data Analytics Unlocking Hidden Potential at Intel Through Big Data Analytics Ivan Harrow Director Insights & Analytics Intel IT @ivanh Legal Notices This presentation is for informational purposes only. INTEL MAKES

More information

Intel and Qihoo 360 Internet Portal Datacenter - Big Data Storage Optimization Case Study

Intel and Qihoo 360 Internet Portal Datacenter - Big Data Storage Optimization Case Study Intel and Qihoo 360 Internet Portal Datacenter - Big Data Storage Optimization Case Study The adoption of cloud computing creates many challenges and opportunities in big data management and storage. To

More information

Developing High-Performance, Scalable, cost effective storage solutions with Intel Cloud Edition Lustre* and Amazon Web Services

Developing High-Performance, Scalable, cost effective storage solutions with Intel Cloud Edition Lustre* and Amazon Web Services Reference Architecture Developing Storage Solutions with Intel Cloud Edition for Lustre* and Amazon Web Services Developing High-Performance, Scalable, cost effective storage solutions with Intel Cloud

More information

* * * Intel RealSense SDK Architecture

* * * Intel RealSense SDK Architecture Multiple Implementations Intel RealSense SDK Architecture Introduction The Intel RealSense SDK is architecturally different from its predecessor, the Intel Perceptual Computing SDK. If you re a developer

More information

Floating-point control in the Intel compiler and libraries or Why doesn t my application always give the expected answer?

Floating-point control in the Intel compiler and libraries or Why doesn t my application always give the expected answer? Floating-point control in the Intel compiler and libraries or Why doesn t my application always give the expected answer? Software Solutions Group Intel Corporation 2012 *Other brands and names are the

More information

Kashif Iqbal - PhD Kashif.iqbal@ichec.ie

Kashif Iqbal - PhD Kashif.iqbal@ichec.ie HPC/HTC vs. Cloud Benchmarking An empirical evalua.on of the performance and cost implica.ons Kashif Iqbal - PhD Kashif.iqbal@ichec.ie ICHEC, NUI Galway, Ireland With acknowledgment to Michele MicheloDo

More information

Accomplish Optimal I/O Performance on SAS 9.3 with

Accomplish Optimal I/O Performance on SAS 9.3 with Accomplish Optimal I/O Performance on SAS 9.3 with Intel Cache Acceleration Software and Intel DC S3700 Solid State Drive ABSTRACT Ying-ping (Marie) Zhang, Jeff Curry, Frank Roxas, Benjamin Donie Intel

More information

Intel Solid-State Drives Increase Productivity of Product Design and Simulation

Intel Solid-State Drives Increase Productivity of Product Design and Simulation WHITE PAPER Intel Solid-State Drives Increase Productivity of Product Design and Simulation Intel Solid-State Drives Increase Productivity of Product Design and Simulation A study of how Intel Solid-State

More information

FLOW-3D Performance Benchmark and Profiling. September 2012

FLOW-3D Performance Benchmark and Profiling. September 2012 FLOW-3D Performance Benchmark and Profiling September 2012 Note The following research was performed under the HPC Advisory Council activities Participating vendors: FLOW-3D, Dell, Intel, Mellanox Compute

More information

Haswell Cryptographic Performance

Haswell Cryptographic Performance White Paper Sean Gulley Vinodh Gopal IA Architects Intel Corporation Haswell Cryptographic Performance July 2013 329282-001 Executive Summary The new Haswell microarchitecture featured in the 4 th generation

More information

Big Data for Big Science. Bernard Doering Business Development, EMEA Big Data Software

Big Data for Big Science. Bernard Doering Business Development, EMEA Big Data Software Big Data for Big Science Bernard Doering Business Development, EMEA Big Data Software Internet of Things 40 Zettabytes of data will be generated WW in 2020 1 SMART CLIENTS INTELLIGENT CLOUD Richer user

More information

Hetero Streams Library 1.0

Hetero Streams Library 1.0 Release Notes for release of Copyright 2013-2016 Intel Corporation All Rights Reserved US Revision: 1.0 World Wide Web: http://www.intel.com Legal Disclaimer Legal Disclaimer You may not use or facilitate

More information

Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis

Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis Performance Metrics and Scalability Analysis 1 Performance Metrics and Scalability Analysis Lecture Outline Following Topics will be discussed Requirements in performance and cost Performance metrics Work

More information

IT@Intel. Comparing Multi-Core Processors for Server Virtualization

IT@Intel. Comparing Multi-Core Processors for Server Virtualization White Paper Intel Information Technology Computer Manufacturing Server Virtualization Comparing Multi-Core Processors for Server Virtualization Intel IT tested servers based on select Intel multi-core

More information

SAP * Mobile Platform 3.0 Scaling on Intel Xeon Processor E5 v2 Family

SAP * Mobile Platform 3.0 Scaling on Intel Xeon Processor E5 v2 Family White Paper SAP* Mobile Platform 3.0 E5 Family Enterprise-class Security SAP * Mobile Platform 3.0 Scaling on Intel Xeon Processor E5 v2 Family Delivering Incredible Experiences to Mobile Users Executive

More information

Intel 965 Express Chipset Family Memory Technology and Configuration Guide

Intel 965 Express Chipset Family Memory Technology and Configuration Guide Intel 965 Express Chipset Family Memory Technology and Configuration Guide White Paper - For the Intel 82Q965, 82Q963, 82G965 Graphics and Memory Controller Hub (GMCH) and Intel 82P965 Memory Controller

More information

Measuring Processor Power

Measuring Processor Power White Paper Intel Xeon Processor Processor Architecture Analysis Measuring Processor Power TDP vs. ACP Specifications for the power a microprocessor can consume and dissipate can be complicated and may

More information

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp. LS-DYNA Scalability on Cray Supercomputers Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp. WP-LS-DYNA-12213 www.cray.com Table of Contents Abstract... 3 Introduction... 3 Scalability

More information

Parallel Algorithm Engineering

Parallel Algorithm Engineering Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis

More information

INTEL PARALLEL STUDIO EVALUATION GUIDE. Intel Cilk Plus: A Simple Path to Parallelism

INTEL PARALLEL STUDIO EVALUATION GUIDE. Intel Cilk Plus: A Simple Path to Parallelism Intel Cilk Plus: A Simple Path to Parallelism Compiler extensions to simplify task and data parallelism Intel Cilk Plus adds simple language extensions to express data and task parallelism to the C and

More information

Intel Cloud Builder Guide to Cloud Design and Deployment on Intel Xeon Processor-based Platforms

Intel Cloud Builder Guide to Cloud Design and Deployment on Intel Xeon Processor-based Platforms Intel Cloud Builder Guide to Cloud Design and Deployment on Intel Xeon Processor-based Platforms Enomaly Elastic Computing Platform, * Service Provider Edition Executive Summary Intel Cloud Builder Guide

More information

MPI Application Tune-Up Four Steps to Performance

MPI Application Tune-Up Four Steps to Performance MPI Application Tune-Up Four Steps to Performance Abstract Cluster systems continue to grow in complexity and capability. Getting optimal performance can be challenging. Making sense of the MPI communications,

More information

HPC & Big Data THE TIME HAS COME FOR A SCALABLE FRAMEWORK

HPC & Big Data THE TIME HAS COME FOR A SCALABLE FRAMEWORK HPC & Big Data THE TIME HAS COME FOR A SCALABLE FRAMEWORK Barry Davis, General Manager, High Performance Fabrics Operation Data Center Group, Intel Corporation Legal Disclaimer Today s presentations contain

More information

OpenFOAM: Computational Fluid Dynamics. Gauss Siedel iteration : (L + D) * x new = b - U * x old

OpenFOAM: Computational Fluid Dynamics. Gauss Siedel iteration : (L + D) * x new = b - U * x old OpenFOAM: Computational Fluid Dynamics Gauss Siedel iteration : (L + D) * x new = b - U * x old What s unique about my tuning work The OpenFOAM (Open Field Operation and Manipulation) CFD Toolbox is a

More information

21152 PCI-to-PCI Bridge

21152 PCI-to-PCI Bridge Product Features Brief Datasheet Intel s second-generation 21152 PCI-to-PCI Bridge is fully compliant with PCI Local Bus Specification, Revision 2.1. The 21152 is pin-to-pin compatible with Intel s 21052,

More information

Bandwidth Calculations for SA-1100 Processor LCD Displays

Bandwidth Calculations for SA-1100 Processor LCD Displays Bandwidth Calculations for SA-1100 Processor LCD Displays Application Note February 1999 Order Number: 278270-001 Information in this document is provided in connection with Intel products. No license,

More information

CLOUD SECURITY: Secure Your Infrastructure

CLOUD SECURITY: Secure Your Infrastructure CLOUD SECURITY: Secure Your Infrastructure 1 Challenges to security Security challenges are growing more complex. ATTACKERS HAVE EVOLVED TECHNOLOGY ARCHITECTURE HAS CHANGED NIST, HIPAA, PCI-DSS, SOX INCREASED

More information

Introducing the First Datacenter Atom SOC

Introducing the First Datacenter Atom SOC Introducing the First Datacenter Atom SOC Diane Bryant Intel Vice President, General Manager, Datacenter & Connected Systems Group, Intel Jason Waxman General Manager, Cloud Platform Group, Intel December

More information

CPU Session 1. Praktikum Parallele Rechnerarchtitekturen. Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14, 2015 1

CPU Session 1. Praktikum Parallele Rechnerarchtitekturen. Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14, 2015 1 CPU Session 1 Praktikum Parallele Rechnerarchtitekturen Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14, 2015 1 Overview Types of Parallelism in Modern Multi-Core CPUs o Multicore

More information

Create Natural User Interfaces with the Next-Generation Intel Perceptual Computing SDK

Create Natural User Interfaces with the Next-Generation Intel Perceptual Computing SDK Create Natural User Interfaces with the Next-Generation Intel Perceptual Computing SDK Ryan Tabrah, Group Manager, UX Developer Products @PerceptualSDK Intel Innovation: Transforming the Game Intel's Vision

More information

Intel True Scale Fabric Architecture. Enhanced HPC Architecture and Performance

Intel True Scale Fabric Architecture. Enhanced HPC Architecture and Performance Intel True Scale Fabric Architecture Enhanced HPC Architecture and Performance 1. Revision: Version 1 Date: November 2012 Table of Contents Introduction... 3 Key Findings... 3 Intel True Scale Fabric Infiniband

More information

Best Practices for Increasing Ceph Performance with SSD

Best Practices for Increasing Ceph Performance with SSD Best Practices for Increasing Ceph Performance with SSD Jian Zhang Jian.zhang@intel.com Jiangang Duan Jiangang.duan@intel.com Agenda Introduction Filestore performance on All Flash Array KeyValueStore

More information

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it Informa(on & Communica(on Technology Sec(on (ICTS) Interna(onal Centre for Theore(cal Physics (ICTP) Mul(ple Socket

More information

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/2015. 2015 CAE Associates

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/2015. 2015 CAE Associates High Performance Computing (HPC) CAEA elearning Series Jonathan G. Dudley, Ph.D. 06/09/2015 2015 CAE Associates Agenda Introduction HPC Background Why HPC SMP vs. DMP Licensing HPC Terminology Types of

More information

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK Steve Oberlin CTO, Accelerated Computing US to Build Two Flagship Supercomputers SUMMIT SIERRA Partnership for Science 100-300 PFLOPS Peak Performance

More information

Intel Cloud Builders Guide to Cloud Design and Deployment on Intel Platforms

Intel Cloud Builders Guide to Cloud Design and Deployment on Intel Platforms Intel Cloud Builders Guide Intel Xeon Processor-based Servers RES Virtual Desktop Extender Intel Cloud Builders Guide to Cloud Design and Deployment on Intel Platforms Client Aware Cloud with RES Virtual

More information

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp www.cs.illinois.edu/~wgropp

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp www.cs.illinois.edu/~wgropp Designing and Building Applications for Extreme Scale Systems CS598 William Gropp www.cs.illinois.edu/~wgropp Welcome! Who am I? William (Bill) Gropp Professor of Computer Science One of the Creators of

More information

What is in Your Workstation?

What is in Your Workstation? Product Brief E3-1200 Family What is in Your Workstation? Why choose E3-based workstations versus i3, i5 and i7 -based desktops -based workstations represent the premier platform used by industry innovators

More information

<Insert Picture Here> An Experimental Model to Analyze OpenMP Applications for System Utilization

<Insert Picture Here> An Experimental Model to Analyze OpenMP Applications for System Utilization An Experimental Model to Analyze OpenMP Applications for System Utilization Mark Woodyard Principal Software Engineer 1 The following is an overview of a research project. It is intended

More information

CUDA programming on NVIDIA GPUs

CUDA programming on NVIDIA GPUs p. 1/21 on NVIDIA GPUs Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford-Man Institute for Quantitative Finance Oxford eresearch Centre p. 2/21 Overview hardware view

More information

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching

More information