Developing Scalable Parallel Applications in X10

Transcription

1 Developing Scalable Parallel Applications in X10 David Grove, Vijay Saraswat, Olivier Tardieu IBM TJ Watson Research Center Based on material from previous X10 Tutorials by Bard Bloom, David Cunningham, Evelyn Duesterwald Steve Fink, Nate Nystrom, Igor Peshansky, Christoph von Praun, and Vivek Sarkar Additional contributions from Anju Kambadur, Josh Milthorpe and Juemin Zhang This material is based upon work supported in part by the Defense Advanced Research Projects Agency under its Agreement No. HR Please see for the most up-to-date version of this tutorial.

2 Tutorial Objectives Learn to develop scalable parallel applications using X10 X10 language and class library fundamentals Effective patterns of concurrency and distribution Case studies: draw material from real working code Exposure to Asynchronous Partitioned Global Address Space (APGAS) Programming Model Concepts broader than any single language Move beyond Single Program/Multiple Data (SPMD) to more flexible paradigms Updates on the X10 project and user/research community Highlight what is new in X10 since SC 11 Gather feedback from you About your application domains About X10

3 Tutorial Outline X10 Overview (70 minutes) Programming model concepts Key language features Implementation highlights Scaling SPMD-style computations in X10 (40 minutes) Scaling finish/async Scaling communication Collective operations Unbalanced computations (20 minutes) Global Load Balancing & UTS Application frameworks (20 minutes) Global Matrix Library SAT X10 Summary and Discussion (20 minutes)

4 X10 Genesis: DARPA HPCS Program (2004) Central Challenge: Productive Programming of Large Scale Supercomputers Clustered systems 1000 s of SMP Nodes connected by high-performance interconnect Large aggregate memory, disk storage, etc. Massively Parallel Processor Massively Parallel Processor Systems Systems SMP SMPClusters Clusters SMP Node SMP Node PEs, PEs, PEs, PEs, Memory... Memory Interconnect IBM IBMBlue BlueGene Gene

5 Flash forward a few years Big Data and Commercial HPC Central Challenge: Productive programming of large commodity clusters Clustered systems 100 s to 1000 s of SMP Nodes connected by high-performance network Large aggregate memory, disk storage, etc. SMP SMPClusters Clusters SMP Node PEs, PEs,... SMP Node... Memory PEs, PEs,... Memory Network Commodity cluster!= MPP system but the programming model problem is highly similar

6 X10 Performance and Productivity at Scale An evolution of Java for concurrency, scale out, and heterogeneity Language focuses on high productivity and high performance Bring productivity gains from commercial world to HPC developers The X10 language provides: Java-like language (statically typed, object oriented, garbage-collected) Ability to specify scale-out computations (multiple places, exploit modern networks) Ability to specify fine-grained concurrency (exploit multi-core) Single programming model for computation offload and heterogeneity (exploit GPUs) Migration path X10 concurrency/distribution idioms can be realized in other languages via library APIs that wrap the X10 runtime X10 interoperability with Java and C/C++ enables reuse of existing libraries

7 Partitioned Global Address Space (PGAS) Languages In clustered systems, memory is only accessible to the CPUs on its node Managing local vs. remote memory is a key programming task PGAS combines a single logical global address with locality awareness PGAS Languages: Titanium, UPC, CAF, X10, Chapel

8 X10 combines PGAS with asynchrony (APGAS) Global Reference Local Heap Local Heap Activities Activities Place 0 Place N Fine grained concurrency async S Place-shifting operations at (P) S at (P) { e Sequencing finish S Atomicity when (c) S atomic S

9 Hello Whole World 1/class HelloWholeWorld { 2/ public static def main(args:rail[string]) { 3/ finish 4/ for (p in Place.places()) 5/ at (p) 6/ async 7/ Console.OUT.println(p+" says " +args(0)); 8/ 9/ % x10c++ HelloWholeWorld.x10 % X10_NPLACES=4;./a.out hello Place 0 says hello (" Place 2 says hello Place 3 says hello Place 1 says hello

10 Sequential Monty Pi import x10.io.console; import x10.util.random; class MontyPi { public static def main(args:array[string](1)) { val N = Int.parse(args(0)); val r = new Random(); var result:double = 0; for (1..N) { val x = r.nextdouble(); val y = r.nextdouble(); if (x*x + y*y <= 1) result++; val pi = 4*result/N; Console.OUT.println( The value of pi is + pi);

11 Concurrent Monty Pi import x10.io.console; import x10.util.random; class MontyPi { public static def main(args:array[string](1)) { val N = Int.parse(args(0)); val P = Int.parse(args(1)); val result = new Cell[Double](0); finish for (1..P) async { val r = new Random(); var myresult:double = 0; for (1..(N/P)) { val x = r.nextdouble(); val y = r.nextdouble(); if (x*x + y*y <= 1) myresult++; atomic result() += myresult; val pi = 4*(result())/N; Console.OUT.println( The value of pi is + pi);

12 Concurrent Monty Pi (Collecting Finish) import x10.io.console; import x10.util.random; class MontyPi { public static def main(args:array[string](1)) { val N = Int.parse(args(0)); val P = Int.parse(args(1)); val result = finish (Reducible.SumReducer[Double]()) { for (1..P) async { val r = new Random(); var myresult:double = 0; for (1..(N/P)) { val x = r.nextdouble(); val y = r.nextdouble(); if (x*x + y*y <= 1) myresult++; offer myresult; ; val pi = 4*result/N; Console.OUT.println( The value of pi is + pi);

13 Distributed Monty Pi (Collecting Finish) import x10.io.console; import x10.util.random; class MontyPi { public static def main(args:array[string](1)) { val N = Int.parse(args(0)); val result = finish (Reducible.SumReducer[Double]()) { for (p in Place.places()) at (p) async { val r = new Random(); var myresult:double = 0; for (1..(N/Place.MAX_PLACES)) { val x = r.nextdouble(); val y = r.nextdouble(); if (x*x + y*y <= 1) myresult++; offer myresult; ; val pi = 4*result/N; Console.OUT.println( The value of pi is + pi);

14 Distributed Monty Pi (GlobalRef) import x10.io.console; import x10.util.random; class MontyPi { public static def main(args:array[string](1)) { val N = Int.parse(args(0)); val result = GlobalRef[Cell[Double]](new Cell[Double](0)); finish for (p in Place.places()) at (p) async { val r = new Random(); var myresult:double = 0; for (1..(N/Place.MAX_PLACES)) { val x = r.nextdouble(); val y = r.nextdouble(); if (x*x + y*y <= 1) myresult++; val valresult = myresult; at (result) atomic result()() += valresult; val pi = 4*(result()())/N; Console.OUT.println( The value of pi is + pi);

15 Overview of Language Features Many sequential features of Java inherited unchanged Classes (single inheritance) Interfaces (multiple inheritance) Instance and static fields Constructors Overloaded, over-rideable methods Automatic memory management (GC) Familiar control structures, etc Structs Headerless, inlined, user-defined primitives Closures Encapsulate function as code block + dynamic environment Type system extensions Dependent types Generic types (instantiation-based) Type inference (local) Concurrency Fine-grained concurrency async S; Atomicity and synchronization atomic S; when (c) S; Ordering finish S; Distribution Place shifting at (p) S; Distributed object model Automatic serialization object graphs GlobalRef[T] to create cross-place references PlaceLocalHandle[T] to define objects with distributed state Class library support for RDMA, collectives, etc.

16 async Stmt ::= async Stmt cf Cilk s spawn async S Creates a new child activity that executes statement S Returns immediately S may reference variables in enclosing blocks Activities cannot be named Activity cannot be aborted or cancelled 16 // Compute the Fibonacci // sequence in parallel. def fib(n:int):int { if (n < 2) return 1; val f1:int; val f2:int; finish { async f1 = fib(n-1); f2 = fib(n-2); return f1+f2;

17 finish Stmt ::= finish Stmt cf Cilk s sync finish S Execute S, but wait until all (transitively) spawned asyncs have terminated. finish is useful for expressing synchronous operations on (local or) remote data. Rooted exception model 17 Trap all exceptions thrown by spawned activities. Throw an (aggregate) exception if any spawned async terminates abruptly. // Compute the Fibonacci // sequence in parallel. def fib(n:int):int { if (n < 2) return 1; val f1:int; val f2:int; finish { async f1 = fib(n-1); f2 = fib(n-2); return f1+f2; implicit finish around program main

18 atomic atomic S Execute statement S atomically Atomic blocks are conceptually executed in a serialized order with respect to all other atomic blocks: isolation and weak atomicity. An atomic block body (S)... must be nonblocking must not create concurrent activities (sequential) must not access remote data (local) restrictions checked dynamically 18 Stmt ::= atomic Statement MethodModifier ::= atomic // target defined in lexically // enclosing scope. atomic def CAS(old:Object, n:object) { if (target.equals(old)) { target = n; return true; return false; // push data onto concurrent // list-stack val node = new Node(data); atomic { node.next = head; head = node;

19 when when (E) S Activity suspends until a state in which the guard E is true. In that state, S is executed atomically and in isolation. Guard E is a boolean expression must be nonblocking must not create concurrent activities (sequential) must not access remote data (local) must not have side-effects (const) restrictions mostly checked dynamically; const restriction is not checked. 19 Stmt ::= WhenStmt WhenStmt ::= when ( Expr ) Stmt WhenStmt or (Expr) Stmt class OneBuffer { var datum:object = null; var filled:boolean = false; def send(v:object) { when (!filled ) { datum = v; filled = true; def receive():object { when (filled) { val v = datum; datum = null; filled = false; return v;

20 at Stmt ::= at (p) Stmt at (p) S Execute statement S at place p Current activity is blocked until S completes Deep copy of local object graph to target place; the variables referenced from S define the roots of the graph to be copied. GlobalRef[T] used to create remote pointers across at boundaries class C { var x:int; def this(n:int) { x = n; // Increment remote counter def inc(c:globalref[c]) { at (c.home) c().x++; // Create GR of C static def make(init:int) { val c = new C(init); return GlobalRef[C](c); 20

21 Distributed Object Model Objects live in a single place Objects can only be accessed in the place where they live Implementing at Compiler analyzes the body of at and identifies roots to copy (exposed variables) Complete object graph reachable from roots is serialized and sent to destination place A new (unrelated) copy of the object graph is created at the destination place Controlling object graph serialization Instance fields of class may be declared transient (won t be copied) GlobalRef[T] allows cross-place pointers to be created

22 Take a breath Whirlwind tour through X10 language More details will be introduced as we work through the example codes See for specification, samples, etc. Next topic: brief overview of the X10 implementation Questions so far?

23 X10 Target Environments High-end large HPC clusters BlueGene/P (since 2010); BlueGene/Q (in progress) Power7IH (aka PERCS machine) x86 + InfiniBand, Power + InfiniBand Goal: deliver scalable performance competitive with C+MPI Medium-scale commodity systems ~100 nodes (~1000 core and ~1 terabyte main memory) Goal: deliver main-memory performance with simple programming model (accessible to Java programmers) Developer laptops Linux, Mac OSX, Windows. Eclipse-based IDE, debugger, etc Goal: support developer productivity

24 X10 Implementation Summary X10 Implementations C++ based ( Native X10 ) Multi-process (one place per process; multi-node) Linux, AIX, MacOS, Cygwin, BlueGene x86, x86_64, PowerPC JVM based ( Managed X10 ) Multi-process (one place per JVM process; multi-node) Limitation on Windows to single process (single place) Runs on any Java 6 JVM X10DT (X10 IDE) available for Windows, Linux, Mac OS X Based on Eclipse 3.7 Supports many core development tasks including remote build/execute facilities IBM Parallel Debugger for X10 Programming Adds X10 language support to IBM Parallel Debugger Available on IBM developerworks (Native X10 on Linux only)

25 X10 Compilation X10 Compiler Front-End Parsing / Type Check X10 Source X10 AST AST Optimizations AST Lowering X10 AST C++ Back-End XRC C++ Code Generation Java Code Generation C++ Source Java Source C++ Compiler XRX Java Compiler Native Code XRJ Bytecode Native X10 Managed X10 JNI Native Env Java Back-End X10RT Java VMs

26 X10 Runtime Software Stack X10 Runtime X10 Application Program X10 Core Class Libraries XRX Runtime X10 Language Native Runtime X10RT PAMI DCMF MPI TCP/IP Core Class Libraries Fundamental classes & primitives, Arrays, core I/O, collections, etc Written in X10; compiled to C++ or Java XRX (X10 Runtime in X10) APGAS functionality Concurrency: async/finish (workstealing) Distribution: Places/at Written in X10; compiled to C++ or Java X10 Language Native Runtime Runtime support for core sequential X10 language features Two versions: C++ and Java X10RT Active messages, collectives, bulk data transfer Implemented in C Abstracts/unifies network layers (PAMI, DCMF, MPI, etc) to enable X10 on a range of transports

27 Performance Model: async, finish, and when Pool of worker threads in each Place Each worker maintains a queue of pending asyncs Worker pops async from its own queue, executes it, repeat If worker s queue empty, steal from another worker s queue Finish typically does not block worker (run subtask in queue instead of blocking) Finish may block because of thieves or remote async worker 1 reaches end of finish block while worker 2 is still running async A blocking finish (or a when) incurs OS-level context switch May incur thread creation cost (maintain set of active workers) Cost analysis async queue operations (fences, at worst CAS) heap allocated task object (allocation + GC cost) finish (single place) synchronized counter heap allocated finish object (allocation + GC cost)

28 Performance Model: Places and at At runtime, each Place is a separate process Costs of at Serialization/deserialization (function of size of captured state) Communication (if destination!= here) Scheduling (worker thread may block) (if destination!= here)

29 Performance Model: Native X10 vs. Managed X10 Native X10: base performance model is C++ Managed X10: base performance model is Java Object-oriented features more expensive C++ compilers don t optimize virtual dispatch, interface invocation, instanceof, etc Memory management: BDWGC Allocation approx like malloc Conservative GC, (non-moving) Parallel, but not concurrent Object-oriented features less expensive JVMs do optimize virtual dispatch, interface invocation, instanceof, etc Memory management Very fast allocation Highly tuned garbage collectors Generics map to C++ templates Structs map to C++ classes with no virtual methods that are inlined in containing object/array C++ compilation is mostly file-based, which limits cross-class optimization More deterministic execution (less runtime jitter) Generics map to Java generics Mostly erased Primitives boxed Structs map to Java classes (no difference between X10 structs and X10 classes) JVM dynamically optimizes entire program Less deterministic execution (more jitter) due to JIT compilation, profile-directed optimization, GC

31 Towards a Scalable HelloWholeWorld 1/class HelloWholeWorld { 2/ public static def main(args:rail[string]) { 3/ finish 4/ for (p in Place.places()) 5/ at (p) 6/ async 7/ Console.OUT.println(p+" says " +args(0)); 8/ 9/ Why won t HelloWholeWorld scale? 1. Using most general (default) implementation of finish 2. Sequential initiation of asyncs 3. Serializing more data than is actually necessary (minor improvement)

32 Towards a Scalable HelloWholeWorld (2) public static def main(args:rail[string]) { finish for (p in Place.places()) { val arg = args(0); at (p) async Console.OUT.println("(At " + here + ") : " + arg); Minor change: optimize serialization to reduce message size

33 Towards a Scalable HelloWholeWorld (3) public static def main(args:rail[string]) finish for (p in Place.places()) { val arg = args(0); at (p) async Console.OUT.println("(At " + here + ") : " + arg); Optimize finish implementation via Pragma annotation

34 Towards a Scalable HelloWholeWorld (4) public static def main(args:rail[string]) finish for(var i:int=place.numplaces()-1; i>=0; i-=32) { at (Place(i)) async { val max = Runtime.hereInt(); val min = Math.max(max-31, finish for (var j:int=min; j<=max; ++j) { at (Place(j)) async Console.OUT.println("(At "+here+ ") : " + arg); Parallelize async initiation this scales nicely, but it is getting complicated!

35 Finally: a Simple and Scalable HelloWholeWorld public static def main(args:rail[string]) { val arg = args(0); PlaceGroup.WORLD.broadcastFlat(()=>{ Console.OUT.println("(At " + here + ") : " + arg); ); Put the complexity in the standard library! broadcastflat encapsulates the chunked loop & pragma; use a closure (function) to get a clean API Recurring theme in X10: simple primitives + strong abstraction = new powerful user-defined primitives

36 K-Means: description of the problem Given: A set of points (black dots) in D-dimensional space (example, D = 2) Desired number of partitions K (example, K = 4) Calculate: The K cluster locations (red blobs). Minimize average distance to cluster 36

37 Lloyd's Algorithm Computes good (not best) clusters. Initialize with a guess for the K positions. Iteratively improve until we reach fixed point. Algorithm: while (!bored) { classify points by nearest cluster average equivalence classes for improved clusters 37

38 Lloyd's Algorithm Computes good (not best) clusters. Initialize with a guess for the K positions. Iteratively improve until we reach fixed point. Algorithm: while (!bored) { classify points by nearest cluster average equivalence classes for improved clusters 38

39 Summary of Scalable KMeans in X10 Natural formulation as an SPMD-style computation Evenly distribute data points across Places Iteratively Each Place assigns its points to the closest cluster Each Place computes new cluster centroids Global reduction to average together computed centroids Use X10 Team APIs for scalable reductions Complete benchmark: 150 lines of X10 code

40 KMEANS Goal unsupervised clustering Performance 1 task: 6.13s 1470 hosts: 6.27s 97% efficiency Configuration 40000*#tasks points 4096 clusters 12 dimensions run time for 5 iterations Performance on 1470 node IBM Power775 40

41 KMeans initialization PlaceGroup.WORLD.broadcastFlat(()=>{ val role = Runtime.hereInt(); val random = new Random(role); val host_points = new Rail[Float](num_slice_points*dim, (Int)=>random.next()); val host_clusters = new Rail[Float](num_clusters*dim); val host_cluster_counts = new Rail[Int](num_clusters); val team = Team.WORLD; if (role == 0) for (k in 0..(num_clusters-1)) for (d in 0..(dim-1)) host_clusters(k*dim+d) = host_points(k+d*num_slice_points); val old_clusters = new Rail[Float](num_clusters*dim); team.allreduce(role, host_clusters, 0, host_clusters, 0, host_clusters.size, Team.ADD);... Main loop implementing Lloyd s algorithm (next slide)... );

42 KMeans compute kernel for (iter in 0..(iterations-1)) { Array.copy(host_clusters, old_clusters); host_clusters.clear(); host_cluster_counts.clear(); // Assign each point to the closest cluster for (p in ) { sequential compute kernel elided // Compute new clusters team.allreduce(role, host_clusters, 0, host_clusters, 0, host_clusters.size, Team.ADD); team.allreduce(role, host_cluster_counts, 0, host_cluster_counts, 0, host_cluster_counts.size, Team.ADD); for (k in 0..(num_clusters-1)) for (d in 0..(dim-1)) host_clusters(k*dim+d) /= host_cluster_counts(k); team.barrier(role);

43 LU Factorization HPL-like algorithm 2-d block-cyclic data distribution right-looking variant with row partial pivoting (no look ahead) recursive panel factorization with pivot search and column broadcast combined

44 Summary of Scalable LU in X10 SPMD-style implementation one top-level async per place (broadcastflat) globally synchronous progress collectives: barriers, broadcast, reductions (row, column, WORLD) point-to-point RDMAs for row swaps uses BLAS for local computations Complete Benchmark: 650 lines of X lines of C++ to wrap native BLAS

45 LU Goal LU decomposition Performance 1 task: gflops 1 host: gflops 1 drawer: gflops 1024 hosts: gflops 80% efficiency Configuration 71% memory (16M pages) p*p or 2p*p grid block size: 360 panel block size: 20 Performance on 1470 node IBM Power775 45

46 Accessing Native code class LU native static def blocktrisolve(me:rail[double], diag:rail[double], native static def blocktrisolvediag(diag:rail[double], min:int, max:int, B:Int):void;... // Use of blocktrisolve: blocktrisolvediag(diag, min, max, B);

47 Allocation of Distributed BlockedArray (in Huge Pages) public final class BlockedArray implements (Int,Int)=>Double { public def this(i:int, J:Int, bx:int, by:int, rand:random) { min_x = I*bx; min_y = J*by; max_x = (I+1)*bx-1; max_y = (J+1)*by-1; delta = by; offset = (I*bx+J)*by; raw = new Rail[Double]( IndexedMemoryChunk.allocateZeroed[Double](bx*by, 8, IndexedMemoryChunk.hugePages())); public static def make(m:int, N:Int, bx:int, by:int, px:int, py:int) { return PlaceLocalHandle.makeFlat[BlockedArray](Dist.makeUnique(), ()=>new BlockedArray(M, N, bx, by, px,py)); // In main program: val A = BlockedArray.make(M, N, B, B, px, py);

48 Row exchange via asynchronous RDMAs def exchange(row1:int, row2:int, min:int, max:int, dest:int) { val source = here; ready = false; val size = A_here.getRow(row1, min, max, buffer); val _buffers = buffers; // use local variable to avoid serializing this! val _A = A; val _remotebuffer = remotebuffer; // RemoteArray (GlobalRef) wrapping finish { at (Place(dest)) async finish { Array.asyncCopy[Double](_remoteBuffer, 0, _buffers(), 0, size); val size2 = _A().swapRow(row2, min, max, _buffers()); Array.asyncCopy[Double](_buffers(), 0, _remotebuffer, 0, size2); A_here.setRow(row1, min, max, buffer);

50 Global Load Balancing A significant class of compute-intensive problems can be cast as Collecting Finish (or Map Reduce ) problems (significantly simpler than Hadoop Map Shuffle Reduce problems) A parallel Map phases spawns work (may be driven by examining data in some shard global data structure) Each mapper may contribute 0 or more pieces of information to a result. These pieces can be combined by a commutative, associative reduction operator Cf collecting finish Q: How do you efficiently execute such programs if the amount of work associated with each mapper could vary dramatically? Example: State-space search, Unbalanced Tree Search (UTS), Branch and Bound Search.

51 Global Load Balancing: Central Problem Goal: Efficiently (~ 90% efficiency) execute collecting finish programs on a large number of node. Problems to be solved: Balance work across a large number of places efficiently Detect termination of the computation quickly. Conventional Approach: Worker steals from randomly chosen victim when local work is exhausted. Key problem: When does a worker know there is no more work?

52 Global Load Balance: The Lifeline Solution Intuition: X0 already has a framework for distributed termination detection finish. Use it! But this requires activities terminate at some point! Idea: Let activities terminate after w rounds of failed steals. But ensure that a worker with some work can distribute work (= use at(p) async S) to nodes known to be idle. Lifeline graphs. Before a worker dies it creates z lifelines to other nodes. These form a distribution overlay graph. What kind of graph? Need low out-degree, low diameter graph. Our paper/implementation uses hyper-cube.

53 Algorithm sketch for Lifeline-based GLB Framework Main activity spawns an activity per place within a finish. Augment work-stealing with work-distributing: Each activity processes. When no work left, steals at most w-times. (Thief) If no work found, visits at most z-lifelines in turn. Donor records thief needs work (Thief) If no work found, dies. (Donor) Pushes work to recorded thieves, when it has some. When all activities die, main activity continues to completion. Caveat: This scheme assumes that work is relocatable

54 UTS: Unbalanced Tree Search Characteristics highly irregular => need for global load balancing geometric fixed law ( t 1 a 3) => wide tree but shallow depth Implementation work queue of SHA1 hash of node + depth + interval of child ids global work stealing n random victims (among m possible targets) then p lifelines (hypercube) thief steals half of work (half of each node in the local work queue) finish counts lifeline steals only (25% performance improvement at scale) single thread (thief and victim cooperate) Complete benchmark: 560 lines of X10 (+ 800 lines of sequential C for SHA1 hash) 54

55 UTS Goal unbalanced tree search Performance 1 task: 10.9 M nodes/s 1470 hosts: 505 B nodes/s 98% efficiency Configuration geometric fixed law depth between 14 and 22 at scale (1470 hosts) -t 1 -a 3 -b 4 -r 19 -d 22 69,312,400,211,958 nodes Performance on 1470 node IBM Power775 55

56 UTS Main Loop: final def processatmostn() { var i:int=0; for (; (i<n) && (queue.size>0); ++i) { queue.expand(); queue.count += i; return queue.size > 0; final def processstack(st:placelocalhandle[uts]) { do { while (processatmostn()) { Runtime.probe(); distribute(st); reject(st); while (steal(st));

57 UTS: steal def steal(st:placelocalhandle[uts]) { if (P == 1) return false; val p = Runtime.hereInt(); empty = true; for (var i:int=0; i < w && empty; ++i) { waiting = true; at async st().request(st, p, false); while (waiting) Runtime.probe(); for (var i:int=0; (i<lifelines.size) && empty && (0<=lifelines(i)); ++i) { val lifeline = lifelines(i); if (!lifelinesactivated(lifeline)) { lifelinesactivated(lifeline) = true; waiting = true; at async st().request(st, p, true); while (waiting) Runtime.probe(); return!empty;

58 UTS: give def give(st:placelocalhandle[uts], loot:queue.fragment) { val victim = Runtime.hereInt(); if (thieves.size() > 0) { val thief = thieves.pop(); if (thief >= 0) { at async { st().deal(st, loot, victim); st().waiting = false; else { at async { st().deal(st, loot, -1); st().waiting = false; else { val thief = lifelinethieves.pop(); at (Place(thief)) async st().deal(st, loot, victim);

59 UTS: distribute & reject def distribute(st:placelocalhandle[uts]) { var loot:queue.fragment; while ((lifelinethieves.size() + thieves.size() > 0) && (loot = queue.grab())!= null) { give(st, loot); reject(st); def reject(st:placelocalhandle[uts]) { while (thieves.size() > 0) { val thief = thieves.pop(); if (thief >= 0) { lifelinethieves.push(thief); at async { st().waiting = false; else { at async { st().waiting = false;

60 UTS: deal def deal(st:placelocalhandle[uts], loot:queue.fragment, source:int) { val lifeline = source >= 0; if (lifeline) lifelinesactivated(source) = false; if (active) { empty = false; processloot(loot, lifeline); else { active = true; processloot(loot, lifeline); processstack(st); active = false;

62 X10 Global Matrix Library Easy-use matrix library in X10 supporting parallel execution on multiple places Presents standard sequential interface (internal parallelism and distribution). Matrix data structures: Dense matrix symmetric and triangular matrix: column-wise storage Sparse matrix: CSC-LT Vector, distributed and duplicated vector Block matrix, distributed and duplicated block matrix Matrix operations scaling, cellwise add, subtract, multiply and division Matrix multiplication Norm, trace, linear equation solver Approximately 45,000 lines of X10 code (open source)

63 Global Matrix Library GML Vector SparseCSC Dense Block matrix Dupl. block Distr. block Dense matrix BLAS wrap X10 Native C/C++ back end Team MPI Dense matrix X10 driver Sparse matrix X10 driver Socket PAMI BLAS Managed Java back end PGAS Socket Multi-thread BLAS (GotoBLAS) 3 party C-MPI library rd LAPACK

64 Matrix Partitioning and Distribution Grid partitioning B0 Bm Bm*(n-1) B1 Balanced partitioning User-specified partitioning Block matrix Bm-1 Bm*n-1 Dense, sparse, or mixed blocks Block distribution: block - place mapping Unique, constant, cyclic, grid, vertical, horizontal and random B0 P0 B1 P1 B2 P2 B3 P3

65 Matrix Partitioning and Distribution Grid partitioning B0 Bm Bm*(n-1) B1 Balanced partitioning User-specified partitioning Block matrix Bm-1 Dense, sparse, or mixed blocks P0 Block distribution: block - place mapping Bm*n-1 P1 Pn-1 Unique, constant, cyclic, grid, vertical, horizontal and random B0 B1 Bm-1 P0 P1 P2 P3

66 Matrix Partitioning and Distribution Grid partitioning P0 Balanced partitioning B0 Bm Bm*(n-1) B1 User-specified partitioning Block matrix Dense, sparse, or mixed blocks Pm-1 Bm-1 Bm*n-1 Block distribution: block - place mapping Unique, constant, cyclic, grid, vertical, horizontal and random Bm*(n-1) B0 Bm B1 P0 P1 P2 P3

67 GML Example: PageRank while (I < max_iteration) { p = alpha*(g%*%p)+(1-alpha)*(e%*%u%*%p); Programmer determines distribution of the data. G is block distributed per Row P is replicated Programmer writes sequential, distribution-aware code. Perform usual optimizations allocate big data-structures outside loop Could equally well have been in Java. DML def step(g:, P:, GP:, dupgp:, UP:, P:, E, : ) { for (1..iteration) { GP.mult(G,P).scale(alpha).copyInto(dupGP);//broadcast P.local().mult(E, UP.mult(U,P.local())).scale(1-alpha).cellAdd(dupGP.local()).sync(); // broadcast X10

68 PageRank Performance Triloka x86 cluster (TK) Managed X10 on socket transport Native X10 on socket transport Watson4P (BlueGene/P) Native X10 on BlueGene DCMF transport Page Rank Page Rank 64 -place, G nonz ero sparsity Java-socket TK 4 Native-socket TK 3.5 Native-pgas BGP 3 14 Java-socket TK 12 Native-socket TK Native-pgas BGP 10 Time(sec) Time(sec) 1000 million nonz eros in G Nonzeros in G (million) Number of places (processors)

69 GML Example: Gaussian Non-Negative Matrix Multiplication Key kernel for topic modeling Involves factoring a large (D x W) matrix D ~ 100M Key decision is representation for matrix, and its distribution. Note: app code is polymorphic in this choice. W ~ 100K, but sparse (0.001) Iterative algorithm, involves distributed sparse matrix multiplication, cell-wise matrix operations. H V W H H H P0 P1 P2 Pn for (1..iteration) { H.cellMult(WV.transMult(W,V,tW).cellDiv(WWH.mult(WW.transMult(W,W),H))); W.cellMult(VH.multTrans(V,H).cellDiv(WHH.mult(W,HH.multTrans(H,H)))); X10

70 GNMF Performance Test platforms: Triloka Cluster (TK): 16 Quad-Core AMD Opteron(tm) 2356 nodes, 1-Gb/s ethernet Watson4P BlueGene/P system (BGP): 2 cards 64 nodes Test settings: run on 64-place V: sparsity 0.001, (1,000,000 x 100,000) ~ (10,000,000 x 1,000,000) W: dense matrix (1,000,000 x 10) ~ (10,000,000 x 10) H: dense matrix (10 x 1,000,000) GNMF runtime Java-socket TK Native-socket TK Native-socket TK Java-socket TK Native-pgas BGP Native-pgas BGP Time(sec) Time(sec) GNMF Computation time Nonzero (million) in V Nonzero (million) in V

71 Global Matrix Library Summary Easy to use Simplified matrix operation interface Managed back end simplifies development, while native back end provides high performance computing capability Hiding concurrency and distribution details from apps developers Hierarchical parallel computing capability Within a node: multi-thread BLAS/LAPACK for intensive computation Between nodes: collective communication for coarse-grained parallel computing Adaptability Data layout compatible to existing specifications (BLAS, CSC-LT). Portability Supporting MPI, socket, PAMI and BlueGene transport Availability Open source; x10.gml project in X10 project main svn on SourceForge.net

72 Sat X10 Framework to combine existing sequential SAT solvers into a parallel portfolio solver Diversity: small number of base solvers, but different parameterization Knowledge Sharing: exchange learned clauses between solvers Structure of SatX10 Minimal modifications to existing solver (add callbacks for knowledge sharing) Small X10/C++ interoperability wrapper for each solver Small (100s of lines) X10 program for communication/distribution Allows parallel solver to run on a single machine with multiple cores and across multiple machines, sharing information such as learned clauses More information:

73 SATX10 Architecture SatX10 Framework 1.0 SolverX10Callback SolverX10Callback SolverSatX10Base SolverSatX10Base ** SolverSatX10Base SolverSatX10Base Data Data Objects Objects Pure Pure Virtual Virtual Methods Methods Other Other Controls Controls Callback Callback ** placeid placeid maxlenshrcl maxlenshrcl outbufsize outbufsize incomingclsq incomingclsq outgoingclsq outgoingclsq x10_parsedimacs() x10_parsedimacs() x10_nvars() x10_nvars() x10_nclauses() x10_nclauses() x10_solve() x10_solve() x10_printsoln() x10_printsoln() x10_printstats() x10_printstats() x10_kill() x10_kill() x10_waskilled() x10_waskilled() x10_accessincbuf() x10_accessincbuf() x10_accessoutbuf() x10_accessoutbuf() Callback Callback Methods Methods x10_step() x10_step() x10_processoutgoingcls() x10_processoutgoingcls() SatX10.x10 SatX10.x10 Main Main X10 X10 routines routines to launch to launch solvers solvers at at various various places: places: Glucose::Solver Glucose::Solver ** Minisat::Solver Minisat::Solver ** SatX10 Solver SatX10 Solver Implement Implement callbacks callbacks of of base base class class CallbackStats CallbackStats (data) (data) Routines Routines for for X10 X10 to to interact with solvers interact with solvers solve() solve() kill() kill() bufferincomingcls() bufferincomingcls() printinstanceinfo() printinstanceinfo() printresults() printresults() SatX10 Minisat SatX10 Minisat SatX10 Glucos SatX10 Glucos e Specializedesolver: Specialized Specialized solver: solver: Specialized solver: Minisat::Solver **Confidential Glucose::Solver IBM Minisat::Solver Glucose::Solver ** Minisat::Solver Minisat::Solver Implement Implement pure pure virtual virtual methods methods of of base base class class Other Other Methods Methods bufferoutgoingcls() bufferoutgoingcls() processincomgcls() processincomgcls() Glucose::Solver Glucose::Solver Implement Implement pure pure virtual virtual methods methods of of base base class class Other Other Methods Methods bufferoutgoingcls() bufferoutgoingcls() processincomgcls() processincomgcls() Specialization for individual solvers

74 Preliminary Empirical Results: Same Machine Same machine, 8 cores, clause lengths=1 and 8 Note: Promising but preliminary results; focus so far has been on developing the framework, not on producing a highly competitive solver

75 Preliminary Empirical Results: Multiple Machines 8 machine/8 cores vs 16 machines/64 cores, clause lengths=1 and 8 Same executable as for single machine --- just different parameters!

77 X10 Project Highlights since SC 11 Three major releases: 2.2.2, 2.2.3, Maintained backwards compatibility with X language (June 2011) Language backwards compatibility with will be maintained in future releases Smooth Java interoperability Major feature of X release Application work at IBM M3R: Main Memory Map Reduce (VLDB 12) Global Matrix Library (open sourced Oct 2011; available in x10 svn) SatX10 (SAT 12) HPC benchmarks (for PERCS) run at scale on 47,000 core Power775

78 Summary of X10 12 Workshop Second annual X10 workshop (co-located with PLDI/ECOOP in June 2012) Papers 1. Fast Method Dispatch and Effective Use of Primitives for Reified Generics in Managed X10 2. Distributed Garbage Collection for Managed X10, 3. StreamX10: A Stream Programming Framework on X10, 4. X10-based Massive Parallel Large-Scale Traffic Flow Simulation 5. Characterization of Smith-Waterman Sequence Database Search in X10 6. Introducing ScaleGraph : An X10 Library for Billion Scale Graph Analytics Invited talks 1. A tutorial on X10 and its implementation 2. M3R: Increased Performance for In-Memory Hadoop Jobs X10 13 will be held in June 2013 co-located with PLDI 13

79 X10 Community: StreamX10 Slide by Haitao Wei, X