Agile High-Performance Software Development Chris Mueller and Andrew Lumsdaine Open Systems Lab/Indiana University RIDMS-2 February 10, 2007 Phoenix, AZ
Modern Processors Intel Core Duo IBM Cell BE
For all your multi-core and SIMD programming needs * Featuring * Advanced make build system! Cutting edge gdb debugger! Unparalleled C standard library! Works with any text editor! *Auto-parallelizing, auto-simdizing, optimizing compiler not yet available. For maximum SIMD performance, use of assembly may be required. Void where prohibited, prohibited where void.
A Brief History of High Performance Computing 1950s 1970s 1980s FORTRAN John Backus, et al. (BCPL)/C Denis Ritchie, et al. (mini/micro computers) (personal computers) (Commodity hardware and language edition) Captures and improves common assembly practices for scientific computing Captures and simplifies best assembly practices for systems programming 1990s 2000s Java James Gosling, et al. (commodity SIMD, dual processor) VB/Python/Perl van Rossum, Wall, et al. Abstract, single-processor machine model + runtime optimizer for all computing tasks, provides rich environment for Web applications Scripting language + low level language for rapid application development (heterogeneous multi-core pushes C to its semantic limits)
State of the Art for High Performance Computing 1950s 1970s FORTRAN John Backus, et al. (BCPL)/C Denis Ritchie, et al. (mini/micro computers) (Commodity hardware and language edition) Captures and improves common assembly practices for scientific computing Captures and simplifies best assembly practices for systems programming
State of the Art for High Performance Computing 1950s 1970s FORTRAN John Backus, et. al. (BCPL)/C Denis Ritchie, et. al. (mini/micro computers) (Commodity hardware and language edition) Captures and improves common assembly practices for scientific computing Captures and simplifies best assembly practices for systems programming Is there an alternative? (short of developing a new language, of course)
Our Approach Take a modern programming technique (Python + C + agile development)
Our Approach Take a modern programming technique provide direct access to the hardware (somewhere between machine code and assembly)
Our Approach Take a modern programming technique provide direct access to the hardware and let programmers explore the SIMD and multi-core design spaces. (power to the people?)
CorePy A layered collection of Python libraries for generating and executing high-performance code at run-time. Variables Iterators Extended Instructions Memory Models Types, Control Flow, and Optimizers Instruction Set Architecture (ISA) Instruction Stream Processor Memory Hardware/OS Abstractions PPC AltiVec/VMX SPU Linux OS X Supported Platforms
A Simple Example r = ((0 + 31) + 11) 1. c = InstructionStream() 2. ppc.set_active_code(c) 3. ppc.addi(gp_return, 0, 31) 4. ppc.addi(gp_return, gp_return, 11) 5. p = Processor() 6. r = p.execute(c) 7. print r 8. --> 42
Variables CorePy Variables encapsulate a register, backing store, and valid operations for a user defined data type. Scalar example: Vector example: 1. a = SignedWord(11) 2. b = SignedWord(31) 3. c = SignedWord(0, reg=gp_return) 4. c.v = (a + b) * 10 5. --> c = 420 1. a = VecWord([2,3,4,5]) 2. b = VecWord([3,3,3,3]) 3. c = VecWord(0) 4. c.v = vmin(a, b) * b + 10 5. --> c = [16, 19, 19, 19]
Iterators Iterators enable user-defined loop semantics. 1. # Basic Iteration 2. a = SignedWord(c, 0) 3. for i in syn_iter(c, 5): 4. for j in syn_iter(c, 5, mode = ctr ): 5. a.v = a + 1 6. proc.execute(c) 7. --> a = 25
Iterator Examples 1. # Array iteration 2. for x in var_iter(c, a): sum.v = sum + x 3. for x in vec_iter(c, a): sum.v = sum + x 4. # Data stream merge 5. for x,y,z,r in zip_iter(c, X,Y,Z,R): 6. r = vmadd(x,y,z) 7. # Loop unrolling 8. for x in unroll(vec_iter(c, a), 3): body(x) 9. # Auto-parallelization 10.for x in parallel(vec_iter(c, a)): body(x) 11.t1 = proc.execute(c, mode= async, params=[0,2,0]) 12.t2 = proc.execute(c, mode= async, params=[1,2,0])
CorePy Research Model Use CorePy to develop real applications Use Python for coarse-grained application and data flow Use CorePy libraries for high-performance code sections Identify common implementation patterns esp. SIMD/multi-core Generalize patterns into library components Develop a user community
Example: Particle System Development Iterations: v1: Numeric Python (~20k particles/sec) v2: CorePy asm (~200k particles/sec) v3: CorePy variables/iters (~200k particles/sec) for vel, point in parallel(zip_iter(c, vels, points)): # Forces - Gravity and air resistance vel.v = vel + gravity vel.v = vel + vmadd(vsel(one, negone, (zero > vel)), air, zero) point.v = point + vel # Bounce off the zero extents (floor and left wall) # and positive extents (ceiling and right wall) vel.v = vmadd(vel, vsel(one, floor, (zero > point)), zero) vel.v = vmadd(vel, vsel(one, negone, (point > extents)), zero) # Add a 'floor' at y = 1.0 point.v = vsel(point, one, (one > point))
Example: BLASTP on the Cell CorePy Components: Cell SPU support Blocked memory components Stream shift iterator Instruction replication Python multi-core control components
Community Projects Cell SPU Big Num library (Andrew Friedley) ~5G inst/s on 1 SPU Image processing, fractals (Ben Martin) DGEMM/BLAS (Andrew Lumsdaine) Generic Convolution Framework (Alex Breuer) Alex: CorePy makes assembly fun!
Thank You! Funding: Lilly Endowment Support and Feedback: IBM Cell Ecosystem Team, especially: Hema Reddy, Gordon Ellison, Jennifer Turner, Bob Arenburg Ben Martin, Andrew Friedley, Alex Breuer, Jeremiah Willcock More Information: www.synthetic-programming.org chemuell@cs.indiana.edu