# Keys to node-level performance analysis and threading in HPC applications

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Keys to node-level performance analysis and threading in HPC applications Thomas GUILLET (Intel; Exascale Computing Research) IFERC seminar, 18 March 2015

3 Application performance: a multiscale problem Microarch Core Socket Node Cluster Multicore: vector ISA, cores, cache hierarchies, Manycore: new vector ISAs, MPI+OMP?, memory/core? Optimization space is getting larger Goal of this presentation: Provide keys to application performance and threading analysis Based on characterization & projection experience with full applications 3

4 Node-level performance Choice of algorithm or scheme Source code implementation Binary code Actual execution Programmer Data access patterns Compiler Vectorization Code generation Architecture Cache behavior Execution pathologies Memory bandwidth/data reuse optimizations Vectorization/code quality optimizations 2 main performance factors (at first order) : Memory (DRAM) bandwidth demand Computation: Flops (but also non-flop instructions sometimes), use of execution units Key questions: What are the requirements of my algorithm, in terms of compute vs. memory transfers? What performance can I expect? Where am I with respect to ideal performance? How can I get closer to ideal? 4

5 Flops, bytes & arithmetic intensity Arithmetic intensity = Flop/byte: a measure of compute vs. ideal data transfer balance for a particular kernel DAXPY (Triad) do i=1,n y(i) = y(i) + a*x(i) end do Read x Read y Compute y Write y 8N bytes 8N bytes 2N Flops 8N bytes Flop/byte = 2/24 = D Stencil (Gauss-Seidel) do k=1,n do j=1,n do i=1,n x(i,j,k) = ONE_SIXTH * ( & x(i+1,j,k) + x(i-1,j,k) + & x(i,j+1,k) + x(i,j-1,k) + & x(i,j,k+1) + x(i,j,k-1)) end do end do end do Read x Compute update Write new x 8N^3 bytes 6N^3 Flops 8N^3 bytes Flop/byte = 6/16 = Source code level analysis: Count floating point operations Count bytes (arrays) read & written, assume perfect reuse (infinite cache) ideal case 5

6 Compute vs. bandwidth analysis Quantitative System Performance, D. Lazowska, J. Zahorjan, G. Graham, K. Sevcik Williams et al., log GFLOP/s = performance Compute bound Ideal execution Actual vs. ideal execution: Efficiency (% peak) depends on microarch. Finite cache size will reduce flop/byte Actual execution Vectorization, Code generation Data reuse, Cache optims Actual Flop/byte Theoretical Flop/byte log Flop/byte = arithmetic intensity Measuring data for actual execution: GFlops/s derived from code performance: GFlops/s = Gcells/s Flops/cell DRAM bandwidth Flop/byte = (GFlop/s) / (GB/s) Intel VTune Amplifier XE Open source tools, e.g. Requires root access or special kernel module 6

7 Illustration: GYSELA kernels on Xeon 2 sockets, Xeon E (Sandy Bridge, 2.6 GHz) This kernel is BW bound when vectorized, but compute bound when not vectorized! 7

8 Illustration: GYSELA kernels on Xeon Phi Xeon Phi 7120 (16 GB GDDR, 61 cores, 1.2GHz) Efficiency drops for complex loop bodies Smaller caches incur more memory traffic 8

9 Node-level characterization: Wrap Up Simple compute vs. bandwidth characterization («roofline») Helps determine max performance expectations Allows to identify optimization directions Can be complemented by quick analysis tricks Measure time on 1 full node (avail b/w = BW 1 ), and write: T 1 full = T compute + T bw Measure time on 2 half-filled nodes (avail b/w = BW 2 > BW 1 ), and write: T 2 half = T compute + T bw (BW 1 BW 2 ) Solve for T compute and T bw to estimate «memory-boundedness» of app on this architecture Also useful for quick projections across similar architectures General trends on Xeon Phi Smaller caches incur more memory demand In-order core, complex vector ISA compiler and code generation matter So far, we assumed good parallelism (no threading or MPI issues) 9

10 Shared memory: To thread or not to thread? Why is threading interesting in applications? Allows «larger» MPI ranks (for domain decomp.) for a same problem May improve surface/volume ratio Amortizes memory footprint of MPI runtime Allows dynamic load balancing for imbalanced applications What could possibly go wrong? Amdahl s law strikes back On computation: getting good coverage is hard On communications MPI+X is not intrinsically «better» than MPI 4x1 v.s. 1x4 10

11 200 Illustration: CFD application Configurations with {#ranks} x {#threads} = 24 cores Temporal loop wtime [s] 120 Footprint/core [MB] 2.5E+11 App instructions/core E E E E E Measured [s] Amdahl projection OMP threads/rank

12 Ranks x Threads Illustration: CFD application Configurations with {#ranks} x {#threads} = 24 cores Wtime spent inside OpenMP parallel regions CFD app example: Wall time [s] on master thread x1 12x2 6x4 Wtime spent in MPI library grows with # threads OMP Serial MPI 4x6 2x12 Non-threaded computation wtime («Amdahl s law on threads»)

13 Can threading help with imbalance? [synthetic data for illustration] Small-scale 50% imbalance Large-scale 50% imbalance Imbalance time = max - mean Shared mem dynamic load balancing may be effective against imbalance Shared mem dynamic load balancing ineffective alone against imbalance core id core id

14 Ranks x Threads Threading and imbalance: Highly imbalanced adaptive mesh refinement code OMP computation scales less than ideally Wall time [s] on master thread, rank x1 12x2 OMP Serial MPI 8x3 6x4 Threading helps reduce extreme MPI imbalance 4x6 2x12 But Amdahl s law still overtakes at high thread counts

15 OpenMP: things to watch for in apps Code coverage (a.k.a. Amdahl s law) Extensive coverage is critical for scalability Can be very tedious/impossible to achieve for flat-profile applications Coarse threading ( loop-level) helps, but reimplementing MPI doesn t Granularity Important metric = average wall time of OpenMP regions Compare to OpenMP barrier/sync time Both points grow in importance on Xeon Phi Lots of threads coverage grows in importance Limited memory/core short loops Vtune profiling can help diagnose both issues 15

16 Wrap-up Careful performance analysis is essential to guide code optimizations Set pragmatic performance targets Collect data on application behavior Simple compute vs. bandwidth model can provide: Robust first-order characterization Insights into specific or second-order effects Threading can help address some strong-scaling issues Amortize halo overheads, level out imbalance No magic: obtaining good coverage is hard work Threading: an important adjustment variable for Heterogeneous computing resources (e.g. symmetric mode) Available memory/core 16

17

### Intel Parallel Studio XE 2013 SP1 for Windows* Installation Guide and Release Notes

Intel Parallel Studio XE 2013 SP1 for Windows* Installation Guide and Release Notes Document number: 323803-004US 31 January 2014 Table of Contents 1 Introduction... 1 1.1 What s New... 2 1.1.1 Changes

### Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,

### Scaling up to Production

1 Scaling up to Production Overview Productionize then Scale Building Production Systems Scaling Production Systems Use Case: Scaling a Production Galaxy Instance Infrastructure Advice 2 PRODUCTIONIZE

### INTEL PARALLEL STUDIO XE EVALUATION GUIDE

Introduction This guide will illustrate how you use Intel Parallel Studio XE to find the hotspots (areas that are taking a lot of time) in your application and then recompiling those parts to improve overall

### Programming Models for Intel Xeon processors and Intel Many Integrated Core (Intel MIC) Architecture

Programming Models for Intel processors and Intel Many Integrated Core (Intel ) Architecture Scott McMillan Senior Software Engineer Software & Services Group April 11, 2012 TACC-Intel Highly Parallel

### Contributed Article Program and Intel DPD Search Optimization Training. John McHugh and Steve Moore January 2012

Contributed Article Program and Intel DPD Search Optimization Training John McHugh and Steve Moore January 2012 Contributed Article Program Publish good stuff and get paid John McHugh Marcom 2 Contributed

### Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture

White Paper Intel Xeon processor E5 v3 family Intel Xeon Phi coprocessor family Digital Design and Engineering Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture Executive

### Towards OpenMP Support in LLVM

Towards OpenMP Support in LLVM Alexey Bataev, Andrey Bokhanko, James Cownie Intel 1 Agenda What is the OpenMP * language? Who Can Benefit from the OpenMP language? OpenMP Language Support Early / Late

### The ROI from Optimizing Software Performance with Intel Parallel Studio XE

The ROI from Optimizing Software Performance with Intel Parallel Studio XE Intel Parallel Studio XE delivers ROI solutions to development organizations. This comprehensive tool offering for the entire

### Get Ready for Intel Math Kernel Library on Intel Xeon Phi Coprocessors

Get Ready for Intel Math Kernel Library on Intel Xeon Phi Coprocessors Zhang Zhang Technical Consulting Engineer Intel Math Kernel Library (Intel MKL) Agenda A quick overview of Intel Xeon Phi coprocessors

### Hybrid parallelism for Weather Research and Forecasting Model on Intel platforms (performance evaluation)

Hybrid parallelism for Weather Research and Forecasting Model on Intel platforms (performance evaluation) Roman Dubtsov*, Mark Lubin, Alexander Semenov {roman.s.dubtsov,mark.lubin,alexander.l.semenov}@intel.com

### High Performance Computing and Big Data: The coming wave.

High Performance Computing and Big Data: The coming wave. 1 In science and engineering, in order to compete, you must compute Today, the toughest challenges, and greatest opportunities, require computation

### Installation Guide and Release Notes

Intel Parallel Studio XE 2013 for Linux* Installation Guide and Release Notes Document number: 323804-003US 25 June 2013 Table of Contents 1 Introduction... 1 1.1 What s New... 1 1.1.1 Changes since Intel

### Installation Guide and Release Notes

Intel Parallel Studio XE 2013 for Linux* Installation Guide and Release Notes Document number: 323804-003US 30 July 2012 Table of Contents 1 Introduction... 1 1.1 What s New... 1 1.1.1 Changes since Intel

### New Dimensions in Configurable Computing at runtime simultaneously allows Big Data and fine Grain HPC

New Dimensions in Configurable Computing at runtime simultaneously allows Big Data and fine Grain HPC Alan Gara Intel Fellow Exascale Chief Architect Legal Disclaimer Today s presentations contain forward-looking

### Higher Message Rate: Minimum of 56M Non-coalesced MPI Msg/Sec at 16-core pairs running the OSU Message Bandwidth Test using QDR 80.

Introduction to QDR-80 and Dual Plane InfiniBand Fabrics being installed today by Intel utilize the True Scale Fabric. This is a Single Plane Fabric with QDR-40 HCA (Single Rail) connectivity. It provides

### Intel Media SDK Library Distribution and Dispatching Process

Intel Media SDK Library Distribution and Dispatching Process Overview Dispatching Procedure Software Libraries Platform-Specific Libraries Legal Information Overview This document describes the Intel Media

### Performance Tuning for Intel Xeon Phi Coprocessors

Performance Tuning for Intel Xeon Phi Coprocessors Robert Reed Intel Technical Consulting Engineer Agenda Start tuning on host Overview of Intel VTune Amplifier XE Efficiency metrics Problem areas 5 Performance

### Intel Parallel Studio XE 2011 SP1 for Linux* Installation Guide and Release Notes

Intel Parallel Studio XE 2011 SP1 for Linux* Installation Guide and Release Notes Document number: 323804-002US 3 February 2012 Table of Contents 1 Introduction... 1 1.1 What s New... 1 1.2 Product Contents...

Intel OpenCL SDK Sample Documentation Copyright 2010 2011 Intel Corporation All Rights Reserved Document Number: 325264-003US Revision: 1.3 World Wide Web: http://www.intel.com Document Number: 325264-003US

### The Foundation for Better Business Intelligence

Product Brief Intel Xeon Processor E7-8800/4800/2800 v2 Product Families Data Center The Foundation for Big data is changing the way organizations make business decisions. To transform petabytes of data

### Software Rasterizer (SWR) Timothy Rowley, Graphics Software Engineer, Parallel Visual Engineering

Software Rasterizer (SWR) Timothy Rowley, Graphics Software Engineer, Parallel Visual Engineering Software Rasterization A Software Rasterizer for OpenGL Timothy Rowley - Graphics Software Engineer, Parallel

### YALES2 porting on the Xeon- Phi Early results

YALES2 porting on the Xeon- Phi Early results Othman Bouizi Ghislain Lartigue Innovation and Pathfinding Architecture Group in Europe, Exascale Lab. Paris CRIHAN - Demi-journée calcul intensif, 16 juin

Internet, adat, biztonság, sebesség Gacsal József Business Development Manager, Intel Hungary Ltd. 2013. április 9. HOUG Siófok Legal Information Today s presentations contain forward-looking statements.

### Software & Services Group

Software & Services Group Mark Zuckerberg "When I'm introspective about the last few years I think the biggest mistake that we made, as a company, is betting too much on HTML5 as opposed to native... because

### Vendor Update Intel 49 th IDC HPC User Forum. Mike Lafferty HPC Marketing Intel Americas Corp.

Vendor Update Intel 49 th IDC HPC User Forum Mike Lafferty HPC Marketing Intel Americas Corp. Legal Information Today s presentations contain forward-looking statements. All statements made that are not

### Developing an Intel MKL based application in Microsoft Visual Studio*- a Beginner s Guide

Developing an Intel MKL based application in Microsoft Visual Studio*- a Beginner s Guide Contents 1. Creating Intel C/C++project using an Intel MKL in Microsoft Visual studio*... 4 1.1 Creating a C/C++

### Intel Platform and Big Data: Making big data work for you.

Intel Platform and Big Data: Making big data work for you. 1 From data comes insight New technologies are enabling enterprises to transform opportunity into reality by turning big data into actionable

### Exascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation

Exascale Challenges and General Purpose Processors Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation Jun-93 Aug-94 Oct-95 Dec-96 Feb-98 Apr-99 Jun-00 Aug-01 Oct-02 Dec-03

### MAQAO Performance Analysis and Optimization Tool

MAQAO Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL andres.charif@uvsq.fr Performance Evaluation Team, University of Versailles S-Q-Y http://www.maqao.org VI-HPS 18 th Grenoble 18/22

### OpenMP* 4.0 for HPC in a Nutshell

OpenMP* 4.0 for HPC in a Nutshell Dr.-Ing. Michael Klemm Senior Application Engineer Software and Services Group (michael.klemm@intel.com) *Other brands and names are the property of their respective owners.

### Intel Media Server Studio - Metrics Monitor (v1.1.0) Reference Manual

Intel Media Server Studio - Metrics Monitor (v1.1.0) Reference Manual Overview Metrics Monitor is part of Intel Media Server Studio 2015 for Linux Server. Metrics Monitor is a user space shared library

### Accelerating Business Intelligence with Large-Scale System Memory

Accelerating Business Intelligence with Large-Scale System Memory A Proof of Concept by Intel, Samsung, and SAP Executive Summary Real-time business intelligence (BI) plays a vital role in driving competitiveness

### Boosting Long Term Evolution (LTE) Application Performance with Intel System Studio

Case Study Intel Boosting Long Term Evolution (LTE) Application Performance with Intel System Studio Challenge: Deliver high performance code for time-critical tasks in LTE wireless communication applications.

### Improve Fortran Code Quality with Static Analysis

Improve Fortran Code Quality with Static Analysis This document is an introductory tutorial describing how to use static analysis on Fortran code to improve software quality, either by eliminating bugs

### Accelerating Business Intelligence with Large-Scale System Memory

Accelerating Business Intelligence with Large-Scale System Memory A Proof of Concept by Intel, Samsung, and SAP Executive Summary Real-time business intelligence (BI) plays a vital role in driving competitiveness

### OpenMP and Performance

Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group schmidl@itc.rwth-aachen.de IT Center der RWTH Aachen University Tuning Cycle Performance Tuning aims to improve the runtime of an

### Judy Ward, Intel C++ front end engineer

New features in Intel C/C++ Compiler 16.0 Beta Judy Ward, Intel C++ front end engineer Judy.Ward@intel.com Agenda Compile time improvements SSE operators Honoring Parenthesis Current state of C++14 and

### Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes Eric Petit, Loïc Thebault, Quang V. Dinh May 2014 EXA2CT Consortium 2 WPs Organization Proto-Applications

### Monte Carlo Method for Stock Options Pricing Sample

Monte Carlo Method for Stock Options Pricing Sample User's Guide Copyright 2013 Intel Corporation All Rights Reserved Document Number: 325264-003US Revision: 1.0 Document Number: 325264-003US Intel SDK

### Large-Data Software Defined Visualization on CPUs

Large-Data Software Defined Visualization on CPUs Greg P. Johnson, Bruce Cherniak 2015 Rice Oil & Gas HPC Workshop Trend: Increasing Data Size Measuring / modeling increasingly complex phenomena Rendering

### Intel Many Integrated Core Architecture: An Overview and Programming Models

Intel Many Integrated Core Architecture: An Overview and Programming Models Jim Jeffers SW Product Application Engineer Technical Computing Group Agenda An Overview of Intel Many Integrated Core Architecture

### Overview of Data Fitting Component in Intel Math Kernel Library (Intel MKL) Intel Corporation

Overview of Data Fitting Component in Intel Math Kernel Library (Intel MKL) Intel Corporation Agenda 1D interpolation problem statement Computation flow Application areas Data fitting in Intel MKL Data

### Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi Dirk Schmidl, Tim Cramer, Sandra Wienke, Christian Terboven, and Matthias S. Müller schmidl@rz.rwth-aachen.de Rechen- und Kommunikationszentrum

### Parallel Programming Survey

Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory

### Intel OpenCL Implicit Vectorization Module

Intel OpenCL Implicit Vectorization Module Nadav Rotem Software Developer, Intel November 2011 1 Intel OpenCL Team Responsible for Intel OpenCL SDK for Intel Architecture. Develop the LLVM-based OpenCL*

### Intel X38 Express Chipset Memory Technology and Configuration Guide

Intel X38 Express Chipset Memory Technology and Configuration Guide White Paper January 2008 Document Number: 318469-002 INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE,

### Intel Media Server Studio Professional Edition for Windows* Server

Intel Media Server Studio 2015 R3 Professional Edition for Windows* Server Release Notes Overview What's New System Requirements Installation Installation Folders Known Limitations Legal Information Overview

### Improve Fortran Code Quality with Static Security Analysis (SSA)

Improve Fortran Code Quality with Static Security Analysis (SSA) with Intel Parallel Studio XE This document is an introductory tutorial describing how to use static security analysis (SSA) on C++ code

### Measuring Cache and Memory Latency and CPU to Memory Bandwidth

White Paper Joshua Ruggiero Computer Systems Engineer Intel Corporation Measuring Cache and Memory Latency and CPU to Memory Bandwidth For use with Intel Architecture December 2008 1 321074 Executive Summary

### Intel Media Server Studio Essentials Edition for Windows* Server

Intel Media Server Studio 2016 Essentials Edition for Windows* Server Release Notes Overview What's New System Requirements Installation Installation Folders Known Limitations Legal Information Overview

### Intel Service Assurance Administrator. Product Overview

Intel Service Assurance Administrator Product Overview Running Enterprise Workloads in the Cloud Enterprise IT wants to Start a private cloud initiative to service internal enterprise customers Find an

### Pattern-driven Performance Engineering. Basics of Benchmarking Performance Patterns and Signatures

Pattern-driven Performance Engineering Basics of Benchmarking Performance Patterns and Signatures Basics of benchmarking Basics of optimization 1. Define relevant test cases 2. Establish a sensible performance

### Using the Intel Xeon Phi (with the Stampede Supercomputer) ISC 13 Tutorial

Using the Intel Xeon Phi (with the Stampede Supercomputer) ISC 13 Tutorial Bill Barth, Kent Milfeld, Dan Stanzione Tommy Minyard Texas Advanced Computing Center Jim Jeffers, Intel June 2013, Leipzig, Germany

### Finding Performance and Power Issues on Android Systems. By Eric W Moore

Finding Performance and Power Issues on Android Systems By Eric W Moore Agenda Performance & Power Tuning on Android & Features Needed/Wanted in a tool Some Performance Tools Getting a Device that Supports

### Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France

### Cloud Computing through Virtualization and HPC technologies

Cloud Computing through Virtualization and HPC technologies William Lu, Ph.D. 1 Agenda Cloud Computing & HPC A Case of HPC Implementation Application Performance in VM Summary 2 Cloud Computing & HPC HPC

### The Transition to PCI Express* for Client SSDs

The Transition to PCI Express* for Client SSDs Amber Huffman Senior Principal Engineer Intel Santa Clara, CA 1 *Other names and brands may be claimed as the property of others. Legal Notices and Disclaimers

### Do theoretical FLOPs matter for real application s performance?

Do theoretical FLOPs matter for real application s performance? Joshua.Mora@amd.com Abstract: The most intelligent answer to this question is it depends on the application. To proof that, we will show

### Get an Easy Performance Boost Even with Unthreaded Apps. with Intel Parallel Studio XE for Windows*

Get an Easy Performance Boost Even with Unthreaded Apps for Windows* Can recompiling just one file make a difference? Yes, in many cases it can! Often, you can achieve a major performance boost by recompiling

### Irregular Applications and their Architectural Challenges

Irregular Applications and their Architectural Challenges Pradeep K. Dubey Intel Fellow and Fellow of IEEE IA^3 - SC12 Workshop Emerging Applications and sources of Irregularity 2 Who Needs Compute Traditional

### Implementation and Performance of AES-NI in CyaSSL. Embedded SSL

Implementation and Performance of AES-NI in CyaSSL Embedded SSL In 2010, Intel introduced the 32nm Intel microarchitecture code name Westmere. With this introduction, Intel announced support for a new

### Intel Media Server Studio Essentials Edition for Windows* Server

Intel Media Server Studio 2015 R4 Essentials Edition for Windows* Server Release Notes Overview What's New System Requirements Installation Installation Folders Known Limitations Legal Information Overview

Selection Guide Accelerate Your Ability to Create, Test, and Optimize Your Ideas -based Workstations Which Workstation Best Meets Your Needs? Choosing a workstation that s up to your job demands is a smart

### Intel Q35/Q33, G35/G33/G31, P35/P31 Express Chipset Memory Technology and Configuration Guide

Intel Q35/Q33, G35/G33/G31, P35/P31 Express Chipset Memory Technology and Configuration Guide White Paper August 2007 Document Number: 316971-002 INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION

### Lecture 3: Single processor architecture and memory

Lecture 3: Single processor architecture and memory David Bindel 30 Jan 2014 Logistics Raised enrollment from 75 to 94 last Friday. Current enrollment is 90; C4 and CMS should be current? HW 0 (getting

### COLO: COarse-grain LOck-stepping Virtual Machine for Non-stop Service

COLO: COarse-grain LOck-stepping Virtual Machine for Non-stop Service Eddie Dong, Yunhong Jiang 1 Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE,

### Large Scale Simulation on Clusters using COMSOL 4.2

Large Scale Simulation on Clusters using COMSOL 4.2 Darrell W. Pepper 1 Xiuling Wang 2 Steven Senator 3 Joseph Lombardo 4 David Carrington 5 with David Kan and Ed Fontes 6 1 DVP-USAFA-UNLV, 2 Purdue-Calumet,

### Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms Amani AlOnazi, David E. Keyes, Alexey Lastovetsky, Vladimir Rychkov Extreme Computing Research Center,

### Pipeline Depth Tradeoffs and the Intel Pentium 4 Processor

Pipeline Depth Tradeoffs and the Intel Pentium 4 Processor Doug Carmean Principal Architect Intel Architecture Group August 21, 2001 Agenda Review Pipeline Depth Execution Trace Cache L1 Data Cache Summary

### Big Data Visualization on the MIC

Big Data Visualization on the MIC Tim Dykes School of Creative Technologies University of Portsmouth timothy.dykes@port.ac.uk Many-Core Seminar Series 26/02/14 Splotch Team Tim Dykes, University of Portsmouth

### Eight Key Policies to Modernize Code on Multi-Core and Many-Core Platforms

Eight Key Policies to Modernize Code on Multi-Core and Many-Core Platforms Zhe Wang Software Application Engineer, Intel Corporation Shan Zhou Software Application Engineer, Intel Corporation 1 SFTS003

### GPUs: Doing More Than Just Games. Mark Gahagan CSE 141 November 29, 2012

GPUs: Doing More Than Just Games Mark Gahagan CSE 141 November 29, 2012 Outline Introduction: Why multicore at all? Background: What is a GPU? Quick Look: Warps and Threads (SIMD) NVIDIA Tesla: The First

### -------- Overview --------

------------------------------------------------------------------- Intel(R) Trace Analyzer and Collector 9.1 Update 1 for Windows* OS Release Notes -------------------------------------------------------------------

### Evaluating Intel Virtualization Technology FlexMigration with Multi-generation Intel Multi-core and Intel Dual-core Xeon Processors.

Evaluating Intel Virtualization Technology FlexMigration with Multi-generation Intel Multi-core and Intel Dual-core Xeon Processors. Executive Summary: In today s data centers, live migration is a required

### Elemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus

Elemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus A simple C/C++ language extension construct for data parallel operations Robert Geva robert.geva@intel.com Introduction Intel

### Cloud-based Analytics and Map Reduce

1 Cloud-based Analytics and Map Reduce Datasets Many technologies converging around Big Data theme Cloud Computing, NoSQL, Graph Analytics Biology is becoming increasingly data intensive Sequencing, imaging,

### Application performance analysis on Pilatus

Application performance analysis on Pilatus Abstract The US group at CSCS performed a set of benchmarks on Pilatus, using the three Programming Environments available (GNU, Intel, PGI): the results can

### Ralph de Wargny. Intel Corp. / Software & Services Group

Ralph de Wargny Intel Corp. / Software & Services Group 2 Introducing the Intel Xeon Phi Processor 1 st Integrated Fabric 1 st Host CPU for Highly- Parallel Apps 1 st Integrated Memory Leadership performance

### Extended Attributes and Transparent Encryption in Apache Hadoop

Extended Attributes and Transparent Encryption in Apache Hadoop Uma Maheswara Rao G Yi Liu ( 刘 轶 ) Who we are? Uma Maheswara Rao G - umamahesh@apache.org - Software Engineer at Intel - PMC/committer, Apache

### Intel Integrated Performance Primitives Getting Started Tutorial. Legal Information

Intel Integrated Performance Primitives Getting Started Tutorial Legal Information Intel Integrated Performance Primitives Getting Started Tutorial Contents Legal Information... 3 Chapter 1: Using Intel

### Floating-point control in the Intel compiler and libraries or Why doesn t my application always give the expected answer?

Floating-point control in the Intel compiler and libraries or Why doesn t my application always give the expected answer? Software Solutions Group Intel Corporation 2012 *Other brands and names are the

White Paper Intel Xeon Processor E5 Family Data Center Efficiency Financial Services Intelligent Business Operations Best Practices in Cash Supply Chain Management Executive Summary The purpose of any

### Multi-Threading Performance on Commodity Multi-Core Processors

Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction

### Intel Xeon Phi Coprocessor. Software Ecosystem. Intel Xeon Phi Coprocessor Workshop Pawsey Centre & CSIRO, Aug Intel Xeon Phi Coprocessor

Software Ecosystem Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPETY RIGHTS

### Building an energy dashboard. Energy measurement and visualization in current HPC systems

Building an energy dashboard Energy measurement and visualization in current HPC systems Thomas Geenen 1/58 thomas.geenen@surfsara.nl SURFsara The Dutch national HPC center 2H 2014 > 1PFlop GPGPU accelerators

### Performance Analysis and Optimization Tool

Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL andres.charif@uvsq.fr Performance Analysis Team, University of Versailles http://www.maqao.org Introduction Performance Analysis Develop

### Challenges to Obtaining Good Parallel Processing Performance

Outline: Challenges to Obtaining Good Parallel Processing Performance Coverage: The Parallel Processing Challenge of Finding Enough Parallelism Amdahl s Law: o The parallel speedup of any program is limited

### GPU System Architecture. Alan Gray EPCC The University of Edinburgh

GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems

### Boosting Speech Recognition Performance by 5x

Case Study Boosting Speech Recognition Performance by 5x Intel Math Kernel Library, Intel C++ Compiler High-Performance Computing Qihoo360 Technology Co. Ltd. Optimizes its Applications on Intel Architecture

### Intel and Qihoo 360 Internet Portal Datacenter - Big Data Storage Optimization Case Study

Intel and Qihoo 360 Internet Portal Datacenter - Big Data Storage Optimization Case Study The adoption of cloud computing creates many challenges and opportunities in big data management and storage. To

### Big Data for Big Science. Bernard Doering Business Development, EMEA Big Data Software

Big Data for Big Science Bernard Doering Business Development, EMEA Big Data Software Internet of Things 40 Zettabytes of data will be generated WW in 2020 1 SMART CLIENTS INTELLIGENT CLOUD Richer user

### Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip

### Intel 965 Express Chipset Family Memory Technology and Configuration Guide

Intel 965 Express Chipset Family Memory Technology and Configuration Guide White Paper - For the Intel 82Q965, 82Q963, 82G965 Graphics and Memory Controller Hub (GMCH) and Intel 82P965 Memory Controller

### Kashif Iqbal - PhD Kashif.iqbal@ichec.ie

HPC/HTC vs. Cloud Benchmarking An empirical evalua.on of the performance and cost implica.ons Kashif Iqbal - PhD Kashif.iqbal@ichec.ie ICHEC, NUI Galway, Ireland With acknowledgment to Michele MicheloDo

### * * * Intel RealSense SDK Architecture

Multiple Implementations Intel RealSense SDK Architecture Introduction The Intel RealSense SDK is architecturally different from its predecessor, the Intel Perceptual Computing SDK. If you re a developer

### Unlocking Hidden Potential at Intel Through Big Data Analytics

Unlocking Hidden Potential at Intel Through Big Data Analytics Ivan Harrow Director Insights & Analytics Intel IT @ivanh Legal Notices This presentation is for informational purposes only. INTEL MAKES