Agile High-Performance Software Development



Similar documents
Multi-core Programming System Overview

HPC Wales Skills Academy Course Catalogue 2015

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

MAQAO Performance Analysis and Optimization Tool

Writing Applications for the GPU Using the RapidMind Development Platform

MATLAB in Business Critical Applications Arvind Hosagrahara Principal Technical Consultant

Virtual Machine Learning: Thinking Like a Computer Architect

Virtual Servers. Virtual machines. Virtualization. Design of IBM s VM. Virtual machine systems can give everyone the OS (and hardware) that they want.

Linux. Reverse Debugging. Target Communication Framework. Nexus. Intel Trace Hub GDB. PIL Simulation CONTENTS

Software and the Concurrency Revolution

ELEC 377. Operating Systems. Week 1 Class 3

CE 504 Computational Hydrology Computational Environments and Tools Fritz R. Fiedler

Elemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Scientific Computing Programming with Parallel Objects

What is a programming language?

Virtual Machines. Virtual Machines

Lecture 1: Introduction

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

Effective Java Programming. efficient software development

Scalability and Classifications

Virtual Machines.


Introduction to Cloud Computing

Parallel Processing using the LOTUS cluster

Going Linux on Massive Multicore

Example of Standard API

INTEL PARALLEL STUDIO EVALUATION GUIDE. Intel Cilk Plus: A Simple Path to Parallelism

Introduction to Virtual Machines

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Program Optimization for Multi-core Architectures

Technical paper review. Program visualization and explanation for novice C programmers by Matthew Heinsen Egan and Chris McDonald.

Outline. hardware components programming environments. installing Python executing Python code. decimal and binary notations running Sage

System Structures. Services Interface Structure

The Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud.

The Java Virtual Machine and Mobile Devices. John Buford, Ph.D. Oct 2003 Presented to Gordon College CS 311

Kernel Types System Calls. Operating Systems. Autumn 2013 CS4023

LSN 2 Computer Processors

Chapter 3.2 C++, Java, and Scripting Languages. The major programming languages used in game development.

#820 Computer Programming 1A

Embedded Systems. Review of ANSI C Topics. A Review of ANSI C and Considerations for Embedded C Programming. Basic features of C

Operating Systems. Lecture 03. February 11, 2013

A New, High-Performance, Low-Power, Floating-Point Embedded Processor for Scientific Computing and DSP Applications

Parallelism and Cloud Computing

Le langage OCaml et la programmation des GPU

Performance Analysis and Optimization Tool

Jonathan Worthington Scarborough Linux User Group

Optimizing Code for Accelerators: The Long Road to High Performance

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr Teruzzi Roberto matr IBM CELL. Politecnico di Milano Como Campus

PERFORMANCE ENHANCEMENTS IN TreeAge Pro 2014 R1.0

Research and Design of Universal and Open Software Development Platform for Digital Home

Real-time Debugging using GDB Tracepoints and other Eclipse features

General Introduction

HIGH PERFORMANCE BIG DATA ANALYTICS

Driving force. What future software needs. Potential research topics

INTRODUCTION TO JAVA PROGRAMMING LANGUAGE

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Introduction to GPU Programming Languages

The C Programming Language course syllabus associate level

PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN

Chapter 1. Dr. Chris Irwin Davis Phone: (972) Office: ECSS CS-4337 Organization of Programming Languages

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

Introduction to programming

Introduction to Cluster Computing

Lecture 7: Machine-Level Programming I: Basics Mohamed Zahran (aka Z)

Real-Time Operating Systems for MPSoCs

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

İSTANBUL AYDIN UNIVERSITY

Computer System: User s View. Computer System Components: High Level View. Input. Output. Computer. Computer System: Motherboard Level

Instruction Set Design

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Embedded Software development Process and Tools: Lesson-4 Linking and Locating Software

MayaVi: A free tool for CFD data visualization

1/20/2016 INTRODUCTION

McGraw-Hill The McGraw-Hill Companies, Inc.,

Red Hat Enterprise Linux is open, scalable, and flexible

Embedded Software development Process and Tools: Lesson-3 Host and Target Machines

BSC vision on Big Data and extreme scale computing

SERVER CLUSTERING TECHNOLOGY & CONCEPT

CSC230 Getting Starting in C. Tyler Bletsch

RISC-V Software Ecosystem. Andrew Waterman UC Berkeley

Chapter 12. Development Tools for Microcontroller Applications

C Compiler Targeting the Java Virtual Machine

Practical Programming, 2nd Edition

The Hotspot Java Virtual Machine: Memory and Architecture

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Introduction to the Latest Tensilica Baseband Solutions

Neptune. A Domain Specific Language for Deploying HPC Software on Cloud Platforms. Chris Bunch Navraj Chohan Chandra Krintz Khawaja Shams

Virtualization Technologies and Blackboard: The Future of Blackboard Software on Multi-Core Technologies

Compilers. Introduction to Compilers. Lecture 1. Spring term. Mick O Donnell: michael.odonnell@uam.es Alfonso Ortega: alfonso.ortega@uam.

Transcription:

Agile High-Performance Software Development Chris Mueller and Andrew Lumsdaine Open Systems Lab/Indiana University RIDMS-2 February 10, 2007 Phoenix, AZ

Modern Processors Intel Core Duo IBM Cell BE

For all your multi-core and SIMD programming needs * Featuring * Advanced make build system! Cutting edge gdb debugger! Unparalleled C standard library! Works with any text editor! *Auto-parallelizing, auto-simdizing, optimizing compiler not yet available. For maximum SIMD performance, use of assembly may be required. Void where prohibited, prohibited where void.

A Brief History of High Performance Computing 1950s 1970s 1980s FORTRAN John Backus, et al. (BCPL)/C Denis Ritchie, et al. (mini/micro computers) (personal computers) (Commodity hardware and language edition) Captures and improves common assembly practices for scientific computing Captures and simplifies best assembly practices for systems programming 1990s 2000s Java James Gosling, et al. (commodity SIMD, dual processor) VB/Python/Perl van Rossum, Wall, et al. Abstract, single-processor machine model + runtime optimizer for all computing tasks, provides rich environment for Web applications Scripting language + low level language for rapid application development (heterogeneous multi-core pushes C to its semantic limits)

State of the Art for High Performance Computing 1950s 1970s FORTRAN John Backus, et al. (BCPL)/C Denis Ritchie, et al. (mini/micro computers) (Commodity hardware and language edition) Captures and improves common assembly practices for scientific computing Captures and simplifies best assembly practices for systems programming

State of the Art for High Performance Computing 1950s 1970s FORTRAN John Backus, et. al. (BCPL)/C Denis Ritchie, et. al. (mini/micro computers) (Commodity hardware and language edition) Captures and improves common assembly practices for scientific computing Captures and simplifies best assembly practices for systems programming Is there an alternative? (short of developing a new language, of course)

Our Approach Take a modern programming technique (Python + C + agile development)

Our Approach Take a modern programming technique provide direct access to the hardware (somewhere between machine code and assembly)

Our Approach Take a modern programming technique provide direct access to the hardware and let programmers explore the SIMD and multi-core design spaces. (power to the people?)

CorePy A layered collection of Python libraries for generating and executing high-performance code at run-time. Variables Iterators Extended Instructions Memory Models Types, Control Flow, and Optimizers Instruction Set Architecture (ISA) Instruction Stream Processor Memory Hardware/OS Abstractions PPC AltiVec/VMX SPU Linux OS X Supported Platforms

A Simple Example r = ((0 + 31) + 11) 1. c = InstructionStream() 2. ppc.set_active_code(c) 3. ppc.addi(gp_return, 0, 31) 4. ppc.addi(gp_return, gp_return, 11) 5. p = Processor() 6. r = p.execute(c) 7. print r 8. --> 42

Variables CorePy Variables encapsulate a register, backing store, and valid operations for a user defined data type. Scalar example: Vector example: 1. a = SignedWord(11) 2. b = SignedWord(31) 3. c = SignedWord(0, reg=gp_return) 4. c.v = (a + b) * 10 5. --> c = 420 1. a = VecWord([2,3,4,5]) 2. b = VecWord([3,3,3,3]) 3. c = VecWord(0) 4. c.v = vmin(a, b) * b + 10 5. --> c = [16, 19, 19, 19]

Iterators Iterators enable user-defined loop semantics. 1. # Basic Iteration 2. a = SignedWord(c, 0) 3. for i in syn_iter(c, 5): 4. for j in syn_iter(c, 5, mode = ctr ): 5. a.v = a + 1 6. proc.execute(c) 7. --> a = 25

Iterator Examples 1. # Array iteration 2. for x in var_iter(c, a): sum.v = sum + x 3. for x in vec_iter(c, a): sum.v = sum + x 4. # Data stream merge 5. for x,y,z,r in zip_iter(c, X,Y,Z,R): 6. r = vmadd(x,y,z) 7. # Loop unrolling 8. for x in unroll(vec_iter(c, a), 3): body(x) 9. # Auto-parallelization 10.for x in parallel(vec_iter(c, a)): body(x) 11.t1 = proc.execute(c, mode= async, params=[0,2,0]) 12.t2 = proc.execute(c, mode= async, params=[1,2,0])

CorePy Research Model Use CorePy to develop real applications Use Python for coarse-grained application and data flow Use CorePy libraries for high-performance code sections Identify common implementation patterns esp. SIMD/multi-core Generalize patterns into library components Develop a user community

Example: Particle System Development Iterations: v1: Numeric Python (~20k particles/sec) v2: CorePy asm (~200k particles/sec) v3: CorePy variables/iters (~200k particles/sec) for vel, point in parallel(zip_iter(c, vels, points)): # Forces - Gravity and air resistance vel.v = vel + gravity vel.v = vel + vmadd(vsel(one, negone, (zero > vel)), air, zero) point.v = point + vel # Bounce off the zero extents (floor and left wall) # and positive extents (ceiling and right wall) vel.v = vmadd(vel, vsel(one, floor, (zero > point)), zero) vel.v = vmadd(vel, vsel(one, negone, (point > extents)), zero) # Add a 'floor' at y = 1.0 point.v = vsel(point, one, (one > point))

Example: BLASTP on the Cell CorePy Components: Cell SPU support Blocked memory components Stream shift iterator Instruction replication Python multi-core control components

Community Projects Cell SPU Big Num library (Andrew Friedley) ~5G inst/s on 1 SPU Image processing, fractals (Ben Martin) DGEMM/BLAS (Andrew Lumsdaine) Generic Convolution Framework (Alex Breuer) Alex: CorePy makes assembly fun!

Thank You! Funding: Lilly Endowment Support and Feedback: IBM Cell Ecosystem Team, especially: Hema Reddy, Gordon Ellison, Jennifer Turner, Bob Arenburg Ben Martin, Andrew Friedley, Alex Breuer, Jeremiah Willcock More Information: www.synthetic-programming.org chemuell@cs.indiana.edu