Ct Technology: Steps Toward Starting a Productivity Revolution in Multi-core Computing



Similar documents
Elemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus

INTEL PARALLEL STUDIO EVALUATION GUIDE. Intel Cilk Plus: A Simple Path to Parallelism

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

What is a programming language?

General Introduction

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Neptune. A Domain Specific Language for Deploying HPC Software on Cloud Platforms. Chris Bunch Navraj Chohan Chandra Krintz Khawaja Shams

Multi-GPU Load Balancing for Simulation and Rendering

GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications

Chapter 13: Program Development and Programming Languages

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Multi-core Programming System Overview

Le langage OCaml et la programmation des GPU

HIGH PERFORMANCE BIG DATA ANALYTICS

GPUs for Scientific Computing

Chapter 1. Dr. Chris Irwin Davis Phone: (972) Office: ECSS CS-4337 Organization of Programming Languages

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

MAQAO Performance Analysis and Optimization Tool

Chapter 13: Program Development and Programming Languages

Virtual Machines. Virtual Machines

A Multi-layered Domain-specific Language for Stencil Computations

BSC vision on Big Data and extreme scale computing

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

HPC Wales Skills Academy Course Catalogue 2015

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Jonathan Worthington Scarborough Linux User Group


How To Get A Computer Science Degree

Leveraging Aparapi to Help Improve Financial Java Application Performance

Testing Tools using Visual Studio. Randy Pagels Sr. Developer Technology Specialist Microsoft Corporation

Numerix CrossAsset XL and Windows HPC Server 2008 R2

AMD WHITE PAPER GETTING STARTED WITH SEQUENCEL. AMD Embedded Solutions 1

HIGH PERFORMANCE CONSULTING COURSE OFFERINGS

Keys to node-level performance analysis and threading in HPC applications

On the Importance of Thread Placement on Multicore Architectures

Next Generation GPU Architecture Code-named Fermi

SQL Server 2005 Features Comparison

All ju The State of Software Development Today: A Parallel View. June 2012

Language Evaluation Criteria. Evaluation Criteria: Readability. Evaluation Criteria: Writability. ICOM 4036 Programming Languages

Turbomachinery CFD on many-core platforms experiences and strategies

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

PERFORMANCE ENHANCEMENTS IN TreeAge Pro 2014 R1.0

PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)

Stream Processing on GPUs Using Distributed Multimedia Middleware

Virtual Machines.

GPU Computing - CUDA

Full and Para Virtualization

UF EDGE brings the classroom to you with online, worldwide course delivery!

How To Build A Cloud Computer

64-Bit versus 32-Bit CPUs in Scientific Computing

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Xeon+FPGA Platform for the Data Center

Introduction to Virtual Machines

Java in Education. Choosing appropriate tool for creating multimedia is the first step in multimedia design

Data Analysis with MATLAB The MathWorks, Inc. 1

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Paper Robert Bonham, Gregory A. Smith, SAS Institute Inc., Cary NC

Towards OpenMP Support in LLVM

Data Centric Systems (DCS)

Radeon GPU Architecture and the Radeon 4800 series. Michael Doggett Graphics Architecture Group June 27, 2008

Petascale Software Challenges. William Gropp

MENDIX FOR MOBILE APP DEVELOPMENT WHITE PAPER

End-user Tools for Application Performance Analysis Using Hardware Counters

Green-Marl: A DSL for Easy and Efficient Graph Analysis

Parallel Computing: Strategies and Implications. Dori Exterman CTO IncrediBuild.

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

White Paper: 5GL RAD Development

Programming directives are popular in

Case Study on Productivity and Performance of GPGPUs

Retour d expérience : portage d une application haute-performance vers un langage de haut niveau

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

OpenACC Programming and Best Practices Guide

CUDA programming on NVIDIA GPUs

NVIDIA GeForce GTX 580 GPU Datasheet

Introduction to GPGPU. Tiziano Diamanti

Four Keys to Successful Multicore Optimization for Machine Vision. White Paper

Performance analysis with Periscope

How To Program With Adaptive Vision Studio

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam

picojava TM : A Hardware Implementation of the Java Virtual Machine

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design

Analytic Modeling in Python

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

INTEL PARALLEL STUDIO XE EVALUATION GUIDE

Introduction to GPU Programming Languages

Course Development of Programming for General-Purpose Multicore Processors

Java in Web 2.0. Alexis Roos Principal Field Technologist, CTO Office OEM SW Sales Sun Microsystems, Inc.

Peter Mileff PhD SOFTWARE ENGINEERING. The Basics of Software Engineering. University of Miskolc Department of Information Technology

<Insert Picture Here> Oracle Database Directions Fred Louis Principal Sales Consultant Ohio Valley Region

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp

Automating Big Data Benchmarking for Different Architectures with ALOJA

5x in 5 hours Porting SEISMIC_CPML using the PGI Accelerator Model

Managed Virtualized Platforms: From Multicore Nodes to Distributed Cloud Infrastructures

Multi-Threading Performance on Commodity Multi-Core Processors

An Introduction to Parallel Computing/ Programming

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

A Case Study - Scaling Legacy Code on Next Generation Platforms

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Transcription:

Ct Technology: Steps Toward Starting a Productivity Revolution in Multi-core Computing Anwar Ghuloum, Intel Corporation http://www.intel.com/go/ct 3/1/2010 1

Agenda Why Dynamic Compilation is important multi-core What is Ct? (And how it solves this.) 3/1/2010 2

The Product Lifecycle in Throughput (aka Multi-core) Computing ~6-12 months ~6-12 months Research: Algorithms, Next Gen Tech Perf Tuning/Core Technologies: Optimized Libraries/Frameworks, Algorithms App/ISV Developer Use & Programming Performance tuning for platform(s) concentration Productivity Languages and Libraries Refactoring Out Low Performance Productivity Paths: ~6-12 months High-performance Languages & Libraries Product Development: 12-18 Months Product deployment/ship 3/1/2010 3

The Product Lifecycle in Throughput (aka Multi-core) Computing ~6-12 months ~6-12 months Research: Algorithms, Next Gen Tech Perf Tuning/Core Technologies: Optimized Libraries/Frameworks, Algorithms App/ISV Developer Use & Programming What Performance will it take tuning to close for this gap platform(s) without sacrificing concentration performance and scalability? Productivity Languages and Libraries Refactoring Out Low Performance Productivity Paths: ~6-12 months High-performance Languages & Libraries Product Development: 12-18 Months Product deployment/ship 3/1/2010 4

Language Trends: Do We Really Only Care About C and Fortran? Cumulative "Languages" Languages with some commercial adoption: Java, C#, Ocaml, F#, Ruby, Python, Lua, PHP, Java/Ecmascript, Actionscript, OpenCL, Scala, OpenMP, C for Cuda, Cilk, R, D New language every 12-18 months! 25 20 15 Cumulative New Languages Since 1990 Scripting, Early Web Dev Productive Parallel Computing Spike? (Not including webapp frameworks, custom scripting engines in game platforms, etc. ) 10 Mostly off the parallel computing radar until ca. 2005 5 0 1985 1990 1995 2000 2005 2010 Year 3/1/2010 5

Domain Specific Languages and Libraries Domain specific languages: why narrow applicability? Tradeoff between performance and productivity can only be relaxed by leveraging domain knowledge Multi-language development & new language adoption isn t the barrier we once thought it was Blurring the line between languages and libraries Modern language mechanisms allow library development that significantly extends capabilities of languages Lowers developer resistance to adoption Examples: Domain specific libs: ITK, CTL, QuantLib High functionality/modularity, low performance Template meta-programmed libs: ublas Dynamic meta-programmed APIs: Ct 3/1/2010 6

C++ Developers: Is This An Issue? It s not just a single kernel Productivity craters when many kernels have to be tuned Focusing energy on 1 algorithm makes sense, if it is the dominant algorithm in one place Widely used libraries often give up performance for well designed generic interfaces Examples: ITK, Quantlib Inherently spreads compute across many (virtual) functions 3/1/2010 7

Reducing the Impact of Modularity Providing user programmability at high performance Libraries with highly configurable interfaces often have reduced performance due to dynamic overhead of late binding and parameter generality This is relatively more punishing for Multi-core! Example: QuantLib is a financial modeling package designed to allow quantitative analysts to model and then price complex financial instruments Provides a variety of ways to configure pricing and process models, often with user-provided functions and parameters Test case: binomial tree option pricing Simple recurrence structure User-configurable spot price and process functions 3/1/2010 8

Financial example: high-level interface High-Level Interface Financial analysts want highlevel interface for modeling instruments, processes, pricing Concerned with mathematics, not details of parallelization Ct Technology can not only parallelize&vectorize, but can remove overhead of C++ modularity Real expiry = 1; Real strike = 40.0, spot = 36.0; Real vol = 0.2, r = 0.05; shared_ptr<payoff> callpay(new PayoffCall(strike)); shared_ptr<exercise> euexercise(new EuropeanExercise(expiry)); shared_ptr<option> eucallopt(new VanillaOption(callPay, euexercise)); shared_ptr<stochasticprocess> bsm(new BlackScholesProcess(r, vol, S0)); float *npvarray = new float[binomial.get_numjobs()]; BOPMEngine<LocalArrayEvaluator> binomial_lattice(eucallopt, bsm); binomial_lattice.npv( npvarray, npvarray + binomial.get_numjobs() ); 3/1/2010 9

Performance Without De-architecting Software Software is often architected for reuse, replacement, extension: Use of generic algorithms, abstract classes, virtual function calls, C++ iterators, indirection is the norm Performance paths are often spread across many objects and files Performance Paths 3/1/2010 10

Performance Without De-architecting Software Performance tools typically want to see everything! You look at all possible/likely paths Brittle Difficult to maintain Difficult to extend Difficult to program De-architecting/Flattening for performance 3/1/2010 11

Binomial lattice: performance Relative performance across implementations QL: QuantLib baseline, not parallel Modular library Microsoft Visual Studio* at O2 pc: written with plain C, not parallel Modularity flattened by hand Scalar code is 10.6x faster than QL Microsoft Visual Studio* at O2 Ct: using Ct Technology Scalar performance slightly better than pc On 4 cores is 4.3x faster than pc Number of threads (with 2 threads per core) Intel Core i7 microprocessor 920 quadcore @ 2.67GHz, double precision *Other names and brands may be claimed as the property of Binomial lattice for 1024 options with 1500 timesteps each others. 50 45 40 35 30 25 20 15 10 5 0 1 2 3 4 5 6 7 8 Ct pc QL

Performance Without De-architecting Software Combine good software practices and performance with Ct: Pepper your models/classes with Ct Ct s VM takes care of generatively collecting the performance paths at run time (more later ) Ct in your Classes 3/1/2010 13

Binomial lattice: performance Relative performance across implementations QL: QuantLib baseline, not parallel Modular library Microsoft Visual Studio* at O2 pc: written with plain C, not parallel Modularity flattened by hand Scalar code is 10.6x faster than QL Microsoft Visual Studio* at O2 Ct: using Ct Technology Scalar performance slightly better than pc On 4 cores is 4.3x faster than pc Number of threads (with 2 threads per core) Intel Core i7 microprocessor 920 quadcore @ 2.67GHz, double precision *Other names and brands may be claimed as the property of Binomial lattice for 1024 options with 1500 timesteps each others. 50 45 40 35 30 25 20 15 10 5 0 1 2 3 4 5 6 7 8 Ct pc QL

So What is Intel Ct Technology? Ct adds parallel collection objects & methods to C++ Library interface and is fully ANSI/ISO-compliant (works with ICC, VC++, GCC) Ct abstracts away architectural details Vector ISA width / Core count / Memory model / Cache sizes Focus on what to do, not how to do it Sequential semantics Ct forward-scales software written today Ct is designed to be dynamically retargetable to SSE, AVX, LRB, Ct is safe, by default but with expert controls to override for performance Programmers think sequential, not parallel 3/1/2010 15

Operations over parallel collections Regular Vecs Irregular Vecs Vec Vec2D VecNested Vec3D Vec<Tuple< >> VecIndexed & growing Priorities: VecSparse, Vec2DSparse, VecND repeatcol, shuffle, transpose, swaprows, shift, rotate, scatter, 3/1/2010 16

Parallel Operations on Ct Collections The Ct Runtime Automates This Transformation Vector Processing Kernel Processing Native/Intrinsic Coding CMP VPREFETCH FMADD INC JMP Elt<F32> kernel(elt<f32> a, b, c, NVec<F32>native(NVec<F32> ) { d) { Vec<F32> A, B, C, D; asm { return a + (b/c)*d; A += B/C * D; } }; Or Programmers Can Choose Desired Level of Abstraction } Vec<F32> A, B, C, D; Linear algebra, global data movement/communication A = map(kernel)(a, B, C, D); Embarrassingly parallel, shaders, image processing Vec<F32> A, B, C, D; A = map(native)(a, B, C, D); 3/1/2010 17

IDE support Use Ct Technology in your favorite development environment User-controlled display of Ct data structures Choose your display format, e.g. image, spreadsheet Invoke from either Ct operators or interactively from the IDE, e.g. Microsoft Visual Studio There s no substitute for being able to visualize the results of transformation steps This is a key non-performance productivity feature *Other names and brands may be claimed as the property of others. 3/1/2010 18

3D order-6 stencil Original Code Ct Code 3/1/2010 19

Back Projection Original Code Ct Code 3/1/2010 20

What can it be used for? Bioinformatics Genomics and sequence analysis Molecular dynamics Engineering design Finite element and finite difference simulation Monte Carlo simulation Financial analytics Option and instrument pricing Risk analysis Oil and gas Seismic reconstruction Reservoir simulation Medical imaging Image and volume reconstruction Software & Services Group, Developer Products Database Division search Analysis and computer aided detection (CAD) Business information Visual computing Digital content creation (DCC) Physics engines and advanced rendering Visualization Compression/decompression Signal and image processing Computer vision Radar and sonar processing Microscopy and satellite image processing Science and research Machine learning and artificial intelligence Climate and weather simulation Planetary exploration and astrophysics Enterprise Intel Confidential 21

Conclusions Dynamic code generation can significantly reduce the performance impact of high levels of abstraction and modularity Elimination of cost for late binding of functions Freezing of control flow once parameters known Freezing size of dynamically size data structures Dynamic code generation can support high performance in productivity languages Dynamic code generation allows for radical program-driven hardwareadaptive restructuring of data flow at fine granularities In order to improve data locality while respecting limits of microarchitecture Support autotuning as mainstream programming technology 3/1/2010 22

Fini Questions? http://www.intel.com/go/ct 3/1/2010 23

Medical imaging: deformable registration Contrast optical flow Portable to both multicore and manycore Ct Technology implementation of basic optical flow approx 2x faster than an also parallelized ITK baseline Algorithmic improvements (multigrid) give additional approx 10x speedup. 3/1/2010 24

How Does it Really Work? Ct is really a high-level APIs that streams opcodes to an optimizing virtual machine The source (front-end) can be anything: A new language A bytecode parser Experiments with Python, HLSL An application-specific library A compiler front-end 3/1/2010 25

The Ct VM Ct API (Average C++ Developer) Other Languages! VM IR (Language Implementor) CVI (Hand Tuning) Ct s Hardware Abstraction Layer Ct+ Opcode API Ct JIT/Compiler IA-based Virtual ISA Backend JIT/Compiler Debug/Perf Svcs Memory Manager Task/Threading Runtime C, C2, Ci* LRB Hybrid Other Back-ends 3/1/2010 26

Runtime Evaluation Model: Generative Programming float src1[], src2[], dest[]; IR Builder Trigger JIT Runtime Compiler Vec<F32>a(src1,N), b(src2,n); rcall(foo)(a, b) V1 V2 High-Level Optimizer foo(vec<f32> a, Vec<F32> b) { + V1 Low-Level Optimizer Vec<F32> c = a + b; Vec<F32> d = c * a; CVI* Code Gen } return; V2 SSE LRB AVX Ct Dynamic Engine Memory Manager a b * CVI = Converged Vector Intrinsics d Data Partition Parallel Runtime Thread Scheduler All Intel Platforms 3/1/2010 27

Ct Virtual Machine Interface The VM interface Human readable, editable C++-like form Extensible bytecode interface for compact storage Extensible with compiler metadata for encoding domain knowledge Not specific to C++ Ct API! Also, lower-level unmanaged interface: CVI A reusable infrastructure Allow others focus on value add for vertical vs. infrastructure deffunc vaddf32( in = vec<f32> a, vec<f32> b; out = vec<f32> c) { c = add<vec<f32>>(a, b); } Infrastructure to close the productivity gap! 3/1/2010 28

Productivity/Scripting Language Proof Points Excel front-end via VB Python byte-code translator HLSL compiler more to come! 3/1/2010 29

3/1/2010 30