Developing Scalable Parallel Applications in X10

Size: px
Start display at page:

Download "Developing Scalable Parallel Applications in X10"

Transcription

1 Developing Scalable Parallel Applications in X10 David Grove, Vijay Saraswat, Olivier Tardieu IBM TJ Watson Research Center Based on material from previous X10 Tutorials by Bard Bloom, David Cunningham, Evelyn Duesterwald Steve Fink, Nate Nystrom, Igor Peshansky, Christoph von Praun, and Vivek Sarkar Additional contributions from Anju Kambadur, Josh Milthorpe and Juemin Zhang This material is based upon work supported in part by the Defense Advanced Research Projects Agency under its Agreement No. HR Please see for the most up-to-date version of this tutorial.

2 Tutorial Objectives Learn to develop scalable parallel applications using X10 X10 language and class library fundamentals Effective patterns of concurrency and distribution Case studies: draw material from real working code Exposure to Asynchronous Partitioned Global Address Space (APGAS) Programming Model Concepts broader than any single language Move beyond Single Program/Multiple Data (SPMD) to more flexible paradigms Updates on the X10 project and user/research community Highlight what is new in X10 since SC 11 Gather feedback from you About your application domains About X10

3 Tutorial Outline X10 Overview (70 minutes) Programming model concepts Key language features Implementation highlights Scaling SPMD-style computations in X10 (40 minutes) Scaling finish/async Scaling communication Collective operations Unbalanced computations (20 minutes) Global Load Balancing & UTS Application frameworks (20 minutes) Global Matrix Library SAT X10 Summary and Discussion (20 minutes)

4 X10 Genesis: DARPA HPCS Program (2004) Central Challenge: Productive Programming of Large Scale Supercomputers Clustered systems 1000 s of SMP Nodes connected by high-performance interconnect Large aggregate memory, disk storage, etc. Massively Parallel Processor Massively Parallel Processor Systems Systems SMP SMPClusters Clusters SMP Node SMP Node PEs, PEs, PEs, PEs, Memory... Memory Interconnect IBM IBMBlue BlueGene Gene

5 Flash forward a few years Big Data and Commercial HPC Central Challenge: Productive programming of large commodity clusters Clustered systems 100 s to 1000 s of SMP Nodes connected by high-performance network Large aggregate memory, disk storage, etc. SMP SMPClusters Clusters SMP Node PEs, PEs,... SMP Node... Memory PEs, PEs,... Memory Network Commodity cluster!= MPP system but the programming model problem is highly similar

6 X10 Performance and Productivity at Scale An evolution of Java for concurrency, scale out, and heterogeneity Language focuses on high productivity and high performance Bring productivity gains from commercial world to HPC developers The X10 language provides: Java-like language (statically typed, object oriented, garbage-collected) Ability to specify scale-out computations (multiple places, exploit modern networks) Ability to specify fine-grained concurrency (exploit multi-core) Single programming model for computation offload and heterogeneity (exploit GPUs) Migration path X10 concurrency/distribution idioms can be realized in other languages via library APIs that wrap the X10 runtime X10 interoperability with Java and C/C++ enables reuse of existing libraries

7 Partitioned Global Address Space (PGAS) Languages In clustered systems, memory is only accessible to the CPUs on its node Managing local vs. remote memory is a key programming task PGAS combines a single logical global address with locality awareness PGAS Languages: Titanium, UPC, CAF, X10, Chapel

8 X10 combines PGAS with asynchrony (APGAS) Global Reference Local Heap Local Heap Activities Activities Place 0 Place N Fine grained concurrency async S Place-shifting operations at (P) S at (P) { e Sequencing finish S Atomicity when (c) S atomic S

9 Hello Whole World 1/class HelloWholeWorld { 2/ public static def main(args:rail[string]) { 3/ finish 4/ for (p in Place.places()) 5/ at (p) 6/ async 7/ Console.OUT.println(p+" says " +args(0)); 8/ 9/ % x10c++ HelloWholeWorld.x10 % X10_NPLACES=4;./a.out hello Place 0 says hello (" Place 2 says hello Place 3 says hello Place 1 says hello

10 Sequential Monty Pi import x10.io.console; import x10.util.random; class MontyPi { public static def main(args:array[string](1)) { val N = Int.parse(args(0)); val r = new Random(); var result:double = 0; for (1..N) { val x = r.nextdouble(); val y = r.nextdouble(); if (x*x + y*y <= 1) result++; val pi = 4*result/N; Console.OUT.println( The value of pi is + pi);

11 Concurrent Monty Pi import x10.io.console; import x10.util.random; class MontyPi { public static def main(args:array[string](1)) { val N = Int.parse(args(0)); val P = Int.parse(args(1)); val result = new Cell[Double](0); finish for (1..P) async { val r = new Random(); var myresult:double = 0; for (1..(N/P)) { val x = r.nextdouble(); val y = r.nextdouble(); if (x*x + y*y <= 1) myresult++; atomic result() += myresult; val pi = 4*(result())/N; Console.OUT.println( The value of pi is + pi);

12 Concurrent Monty Pi (Collecting Finish) import x10.io.console; import x10.util.random; class MontyPi { public static def main(args:array[string](1)) { val N = Int.parse(args(0)); val P = Int.parse(args(1)); val result = finish (Reducible.SumReducer[Double]()) { for (1..P) async { val r = new Random(); var myresult:double = 0; for (1..(N/P)) { val x = r.nextdouble(); val y = r.nextdouble(); if (x*x + y*y <= 1) myresult++; offer myresult; ; val pi = 4*result/N; Console.OUT.println( The value of pi is + pi);

13 Distributed Monty Pi (Collecting Finish) import x10.io.console; import x10.util.random; class MontyPi { public static def main(args:array[string](1)) { val N = Int.parse(args(0)); val result = finish (Reducible.SumReducer[Double]()) { for (p in Place.places()) at (p) async { val r = new Random(); var myresult:double = 0; for (1..(N/Place.MAX_PLACES)) { val x = r.nextdouble(); val y = r.nextdouble(); if (x*x + y*y <= 1) myresult++; offer myresult; ; val pi = 4*result/N; Console.OUT.println( The value of pi is + pi);

14 Distributed Monty Pi (GlobalRef) import x10.io.console; import x10.util.random; class MontyPi { public static def main(args:array[string](1)) { val N = Int.parse(args(0)); val result = GlobalRef[Cell[Double]](new Cell[Double](0)); finish for (p in Place.places()) at (p) async { val r = new Random(); var myresult:double = 0; for (1..(N/Place.MAX_PLACES)) { val x = r.nextdouble(); val y = r.nextdouble(); if (x*x + y*y <= 1) myresult++; val valresult = myresult; at (result) atomic result()() += valresult; val pi = 4*(result()())/N; Console.OUT.println( The value of pi is + pi);

15 Overview of Language Features Many sequential features of Java inherited unchanged Classes (single inheritance) Interfaces (multiple inheritance) Instance and static fields Constructors Overloaded, over-rideable methods Automatic memory management (GC) Familiar control structures, etc Structs Headerless, inlined, user-defined primitives Closures Encapsulate function as code block + dynamic environment Type system extensions Dependent types Generic types (instantiation-based) Type inference (local) Concurrency Fine-grained concurrency async S; Atomicity and synchronization atomic S; when (c) S; Ordering finish S; Distribution Place shifting at (p) S; Distributed object model Automatic serialization object graphs GlobalRef[T] to create cross-place references PlaceLocalHandle[T] to define objects with distributed state Class library support for RDMA, collectives, etc.

16 async Stmt ::= async Stmt cf Cilk s spawn async S Creates a new child activity that executes statement S Returns immediately S may reference variables in enclosing blocks Activities cannot be named Activity cannot be aborted or cancelled 16 // Compute the Fibonacci // sequence in parallel. def fib(n:int):int { if (n < 2) return 1; val f1:int; val f2:int; finish { async f1 = fib(n-1); f2 = fib(n-2); return f1+f2;

17 finish Stmt ::= finish Stmt cf Cilk s sync finish S Execute S, but wait until all (transitively) spawned asyncs have terminated. finish is useful for expressing synchronous operations on (local or) remote data. Rooted exception model 17 Trap all exceptions thrown by spawned activities. Throw an (aggregate) exception if any spawned async terminates abruptly. // Compute the Fibonacci // sequence in parallel. def fib(n:int):int { if (n < 2) return 1; val f1:int; val f2:int; finish { async f1 = fib(n-1); f2 = fib(n-2); return f1+f2; implicit finish around program main

18 atomic atomic S Execute statement S atomically Atomic blocks are conceptually executed in a serialized order with respect to all other atomic blocks: isolation and weak atomicity. An atomic block body (S)... must be nonblocking must not create concurrent activities (sequential) must not access remote data (local) restrictions checked dynamically 18 Stmt ::= atomic Statement MethodModifier ::= atomic // target defined in lexically // enclosing scope. atomic def CAS(old:Object, n:object) { if (target.equals(old)) { target = n; return true; return false; // push data onto concurrent // list-stack val node = new Node(data); atomic { node.next = head; head = node;

19 when when (E) S Activity suspends until a state in which the guard E is true. In that state, S is executed atomically and in isolation. Guard E is a boolean expression must be nonblocking must not create concurrent activities (sequential) must not access remote data (local) must not have side-effects (const) restrictions mostly checked dynamically; const restriction is not checked. 19 Stmt ::= WhenStmt WhenStmt ::= when ( Expr ) Stmt WhenStmt or (Expr) Stmt class OneBuffer { var datum:object = null; var filled:boolean = false; def send(v:object) { when (!filled ) { datum = v; filled = true; def receive():object { when (filled) { val v = datum; datum = null; filled = false; return v;

20 at Stmt ::= at (p) Stmt at (p) S Execute statement S at place p Current activity is blocked until S completes Deep copy of local object graph to target place; the variables referenced from S define the roots of the graph to be copied. GlobalRef[T] used to create remote pointers across at boundaries class C { var x:int; def this(n:int) { x = n; // Increment remote counter def inc(c:globalref[c]) { at (c.home) c().x++; // Create GR of C static def make(init:int) { val c = new C(init); return GlobalRef[C](c); 20

21 Distributed Object Model Objects live in a single place Objects can only be accessed in the place where they live Implementing at Compiler analyzes the body of at and identifies roots to copy (exposed variables) Complete object graph reachable from roots is serialized and sent to destination place A new (unrelated) copy of the object graph is created at the destination place Controlling object graph serialization Instance fields of class may be declared transient (won t be copied) GlobalRef[T] allows cross-place pointers to be created

22 Take a breath Whirlwind tour through X10 language More details will be introduced as we work through the example codes See for specification, samples, etc. Next topic: brief overview of the X10 implementation Questions so far?

23 X10 Target Environments High-end large HPC clusters BlueGene/P (since 2010); BlueGene/Q (in progress) Power7IH (aka PERCS machine) x86 + InfiniBand, Power + InfiniBand Goal: deliver scalable performance competitive with C+MPI Medium-scale commodity systems ~100 nodes (~1000 core and ~1 terabyte main memory) Goal: deliver main-memory performance with simple programming model (accessible to Java programmers) Developer laptops Linux, Mac OSX, Windows. Eclipse-based IDE, debugger, etc Goal: support developer productivity

24 X10 Implementation Summary X10 Implementations C++ based ( Native X10 ) Multi-process (one place per process; multi-node) Linux, AIX, MacOS, Cygwin, BlueGene x86, x86_64, PowerPC JVM based ( Managed X10 ) Multi-process (one place per JVM process; multi-node) Limitation on Windows to single process (single place) Runs on any Java 6 JVM X10DT (X10 IDE) available for Windows, Linux, Mac OS X Based on Eclipse 3.7 Supports many core development tasks including remote build/execute facilities IBM Parallel Debugger for X10 Programming Adds X10 language support to IBM Parallel Debugger Available on IBM developerworks (Native X10 on Linux only)

25 X10 Compilation X10 Compiler Front-End Parsing / Type Check X10 Source X10 AST AST Optimizations AST Lowering X10 AST C++ Back-End XRC C++ Code Generation Java Code Generation C++ Source Java Source C++ Compiler XRX Java Compiler Native Code XRJ Bytecode Native X10 Managed X10 JNI Native Env Java Back-End X10RT Java VMs

26 X10 Runtime Software Stack X10 Runtime X10 Application Program X10 Core Class Libraries XRX Runtime X10 Language Native Runtime X10RT PAMI DCMF MPI TCP/IP Core Class Libraries Fundamental classes & primitives, Arrays, core I/O, collections, etc Written in X10; compiled to C++ or Java XRX (X10 Runtime in X10) APGAS functionality Concurrency: async/finish (workstealing) Distribution: Places/at Written in X10; compiled to C++ or Java X10 Language Native Runtime Runtime support for core sequential X10 language features Two versions: C++ and Java X10RT Active messages, collectives, bulk data transfer Implemented in C Abstracts/unifies network layers (PAMI, DCMF, MPI, etc) to enable X10 on a range of transports

27 Performance Model: async, finish, and when Pool of worker threads in each Place Each worker maintains a queue of pending asyncs Worker pops async from its own queue, executes it, repeat If worker s queue empty, steal from another worker s queue Finish typically does not block worker (run subtask in queue instead of blocking) Finish may block because of thieves or remote async worker 1 reaches end of finish block while worker 2 is still running async A blocking finish (or a when) incurs OS-level context switch May incur thread creation cost (maintain set of active workers) Cost analysis async queue operations (fences, at worst CAS) heap allocated task object (allocation + GC cost) finish (single place) synchronized counter heap allocated finish object (allocation + GC cost)

28 Performance Model: Places and at At runtime, each Place is a separate process Costs of at Serialization/deserialization (function of size of captured state) Communication (if destination!= here) Scheduling (worker thread may block) (if destination!= here)

29 Performance Model: Native X10 vs. Managed X10 Native X10: base performance model is C++ Managed X10: base performance model is Java Object-oriented features more expensive C++ compilers don t optimize virtual dispatch, interface invocation, instanceof, etc Memory management: BDWGC Allocation approx like malloc Conservative GC, (non-moving) Parallel, but not concurrent Object-oriented features less expensive JVMs do optimize virtual dispatch, interface invocation, instanceof, etc Memory management Very fast allocation Highly tuned garbage collectors Generics map to C++ templates Structs map to C++ classes with no virtual methods that are inlined in containing object/array C++ compilation is mostly file-based, which limits cross-class optimization More deterministic execution (less runtime jitter) Generics map to Java generics Mostly erased Primitives boxed Structs map to Java classes (no difference between X10 structs and X10 classes) JVM dynamically optimizes entire program Less deterministic execution (more jitter) due to JIT compilation, profile-directed optimization, GC

30 Tutorial Outline X10 Overview (70 minutes) Programming model concepts Key language features Implementation highlights Scaling SPMD-style computations in X10 (40 minutes) Scaling finish/async Scaling communication Collective operations Unbalanced computations (20 minutes) Global Load Balancing & UTS Application frameworks (20 minutes) Global Matrix Library SAT X10 Summary and Discussion (20 minutes)

31 Towards a Scalable HelloWholeWorld 1/class HelloWholeWorld { 2/ public static def main(args:rail[string]) { 3/ finish 4/ for (p in Place.places()) 5/ at (p) 6/ async 7/ Console.OUT.println(p+" says " +args(0)); 8/ 9/ Why won t HelloWholeWorld scale? 1. Using most general (default) implementation of finish 2. Sequential initiation of asyncs 3. Serializing more data than is actually necessary (minor improvement)

32 Towards a Scalable HelloWholeWorld (2) public static def main(args:rail[string]) { finish for (p in Place.places()) { val arg = args(0); at (p) async Console.OUT.println("(At " + here + ") : " + arg); Minor change: optimize serialization to reduce message size

33 Towards a Scalable HelloWholeWorld (3) public static def main(args:rail[string]) finish for (p in Place.places()) { val arg = args(0); at (p) async Console.OUT.println("(At " + here + ") : " + arg); Optimize finish implementation via Pragma annotation

34 Towards a Scalable HelloWholeWorld (4) public static def main(args:rail[string]) finish for(var i:int=place.numplaces()-1; i>=0; i-=32) { at (Place(i)) async { val max = Runtime.hereInt(); val min = Math.max(max-31, finish for (var j:int=min; j<=max; ++j) { at (Place(j)) async Console.OUT.println("(At "+here+ ") : " + arg); Parallelize async initiation this scales nicely, but it is getting complicated!

35 Finally: a Simple and Scalable HelloWholeWorld public static def main(args:rail[string]) { val arg = args(0); PlaceGroup.WORLD.broadcastFlat(()=>{ Console.OUT.println("(At " + here + ") : " + arg); ); Put the complexity in the standard library! broadcastflat encapsulates the chunked loop & pragma; use a closure (function) to get a clean API Recurring theme in X10: simple primitives + strong abstraction = new powerful user-defined primitives

36 K-Means: description of the problem Given: A set of points (black dots) in D-dimensional space (example, D = 2) Desired number of partitions K (example, K = 4) Calculate: The K cluster locations (red blobs). Minimize average distance to cluster 36

37 Lloyd's Algorithm Computes good (not best) clusters. Initialize with a guess for the K positions. Iteratively improve until we reach fixed point. Algorithm: while (!bored) { classify points by nearest cluster average equivalence classes for improved clusters 37

38 Lloyd's Algorithm Computes good (not best) clusters. Initialize with a guess for the K positions. Iteratively improve until we reach fixed point. Algorithm: while (!bored) { classify points by nearest cluster average equivalence classes for improved clusters 38

39 Summary of Scalable KMeans in X10 Natural formulation as an SPMD-style computation Evenly distribute data points across Places Iteratively Each Place assigns its points to the closest cluster Each Place computes new cluster centroids Global reduction to average together computed centroids Use X10 Team APIs for scalable reductions Complete benchmark: 150 lines of X10 code

40 KMEANS Goal unsupervised clustering Performance 1 task: 6.13s 1470 hosts: 6.27s 97% efficiency Configuration 40000*#tasks points 4096 clusters 12 dimensions run time for 5 iterations Performance on 1470 node IBM Power775 40

41 KMeans initialization PlaceGroup.WORLD.broadcastFlat(()=>{ val role = Runtime.hereInt(); val random = new Random(role); val host_points = new Rail[Float](num_slice_points*dim, (Int)=>random.next()); val host_clusters = new Rail[Float](num_clusters*dim); val host_cluster_counts = new Rail[Int](num_clusters); val team = Team.WORLD; if (role == 0) for (k in 0..(num_clusters-1)) for (d in 0..(dim-1)) host_clusters(k*dim+d) = host_points(k+d*num_slice_points); val old_clusters = new Rail[Float](num_clusters*dim); team.allreduce(role, host_clusters, 0, host_clusters, 0, host_clusters.size, Team.ADD);... Main loop implementing Lloyd s algorithm (next slide)... );

42 KMeans compute kernel for (iter in 0..(iterations-1)) { Array.copy(host_clusters, old_clusters); host_clusters.clear(); host_cluster_counts.clear(); // Assign each point to the closest cluster for (p in ) { sequential compute kernel elided // Compute new clusters team.allreduce(role, host_clusters, 0, host_clusters, 0, host_clusters.size, Team.ADD); team.allreduce(role, host_cluster_counts, 0, host_cluster_counts, 0, host_cluster_counts.size, Team.ADD); for (k in 0..(num_clusters-1)) for (d in 0..(dim-1)) host_clusters(k*dim+d) /= host_cluster_counts(k); team.barrier(role);

43 LU Factorization HPL-like algorithm 2-d block-cyclic data distribution right-looking variant with row partial pivoting (no look ahead) recursive panel factorization with pivot search and column broadcast combined

44 Summary of Scalable LU in X10 SPMD-style implementation one top-level async per place (broadcastflat) globally synchronous progress collectives: barriers, broadcast, reductions (row, column, WORLD) point-to-point RDMAs for row swaps uses BLAS for local computations Complete Benchmark: 650 lines of X lines of C++ to wrap native BLAS

45 LU Goal LU decomposition Performance 1 task: gflops 1 host: gflops 1 drawer: gflops 1024 hosts: gflops 80% efficiency Configuration 71% memory (16M pages) p*p or 2p*p grid block size: 360 panel block size: 20 Performance on 1470 node IBM Power775 45

46 Accessing Native code class LU native static def blocktrisolve(me:rail[double], diag:rail[double], native static def blocktrisolvediag(diag:rail[double], min:int, max:int, B:Int):void;... // Use of blocktrisolve: blocktrisolvediag(diag, min, max, B);

47 Allocation of Distributed BlockedArray (in Huge Pages) public final class BlockedArray implements (Int,Int)=>Double { public def this(i:int, J:Int, bx:int, by:int, rand:random) { min_x = I*bx; min_y = J*by; max_x = (I+1)*bx-1; max_y = (J+1)*by-1; delta = by; offset = (I*bx+J)*by; raw = new Rail[Double]( IndexedMemoryChunk.allocateZeroed[Double](bx*by, 8, IndexedMemoryChunk.hugePages())); public static def make(m:int, N:Int, bx:int, by:int, px:int, py:int) { return PlaceLocalHandle.makeFlat[BlockedArray](Dist.makeUnique(), ()=>new BlockedArray(M, N, bx, by, px,py)); // In main program: val A = BlockedArray.make(M, N, B, B, px, py);

48 Row exchange via asynchronous RDMAs def exchange(row1:int, row2:int, min:int, max:int, dest:int) { val source = here; ready = false; val size = A_here.getRow(row1, min, max, buffer); val _buffers = buffers; // use local variable to avoid serializing this! val _A = A; val _remotebuffer = remotebuffer; // RemoteArray (GlobalRef) wrapping finish { at (Place(dest)) async finish { Array.asyncCopy[Double](_remoteBuffer, 0, _buffers(), 0, size); val size2 = _A().swapRow(row2, min, max, _buffers()); Array.asyncCopy[Double](_buffers(), 0, _remotebuffer, 0, size2); A_here.setRow(row1, min, max, buffer);

49 Tutorial Outline X10 Overview (70 minutes) Programming model concepts Key language features Implementation highlights Scaling SPMD-style computations in X10 (40 minutes) Scaling finish/async Scaling communication Collective operations Unbalanced computations (20 minutes) Global Load Balancing & UTS Application frameworks (20 minutes) Global Matrix Library SAT X10 Summary and Discussion (20 minutes)

50 Global Load Balancing A significant class of compute-intensive problems can be cast as Collecting Finish (or Map Reduce ) problems (significantly simpler than Hadoop Map Shuffle Reduce problems) A parallel Map phases spawns work (may be driven by examining data in some shard global data structure) Each mapper may contribute 0 or more pieces of information to a result. These pieces can be combined by a commutative, associative reduction operator Cf collecting finish Q: How do you efficiently execute such programs if the amount of work associated with each mapper could vary dramatically? Example: State-space search, Unbalanced Tree Search (UTS), Branch and Bound Search.

51 Global Load Balancing: Central Problem Goal: Efficiently (~ 90% efficiency) execute collecting finish programs on a large number of node. Problems to be solved: Balance work across a large number of places efficiently Detect termination of the computation quickly. Conventional Approach: Worker steals from randomly chosen victim when local work is exhausted. Key problem: When does a worker know there is no more work?

52 Global Load Balance: The Lifeline Solution Intuition: X0 already has a framework for distributed termination detection finish. Use it! But this requires activities terminate at some point! Idea: Let activities terminate after w rounds of failed steals. But ensure that a worker with some work can distribute work (= use at(p) async S) to nodes known to be idle. Lifeline graphs. Before a worker dies it creates z lifelines to other nodes. These form a distribution overlay graph. What kind of graph? Need low out-degree, low diameter graph. Our paper/implementation uses hyper-cube.

53 Algorithm sketch for Lifeline-based GLB Framework Main activity spawns an activity per place within a finish. Augment work-stealing with work-distributing: Each activity processes. When no work left, steals at most w-times. (Thief) If no work found, visits at most z-lifelines in turn. Donor records thief needs work (Thief) If no work found, dies. (Donor) Pushes work to recorded thieves, when it has some. When all activities die, main activity continues to completion. Caveat: This scheme assumes that work is relocatable

54 UTS: Unbalanced Tree Search Characteristics highly irregular => need for global load balancing geometric fixed law ( t 1 a 3) => wide tree but shallow depth Implementation work queue of SHA1 hash of node + depth + interval of child ids global work stealing n random victims (among m possible targets) then p lifelines (hypercube) thief steals half of work (half of each node in the local work queue) finish counts lifeline steals only (25% performance improvement at scale) single thread (thief and victim cooperate) Complete benchmark: 560 lines of X10 (+ 800 lines of sequential C for SHA1 hash) 54

55 UTS Goal unbalanced tree search Performance 1 task: 10.9 M nodes/s 1470 hosts: 505 B nodes/s 98% efficiency Configuration geometric fixed law depth between 14 and 22 at scale (1470 hosts) -t 1 -a 3 -b 4 -r 19 -d 22 69,312,400,211,958 nodes Performance on 1470 node IBM Power775 55

56 UTS Main Loop: final def processatmostn() { var i:int=0; for (; (i<n) && (queue.size>0); ++i) { queue.expand(); queue.count += i; return queue.size > 0; final def processstack(st:placelocalhandle[uts]) { do { while (processatmostn()) { Runtime.probe(); distribute(st); reject(st); while (steal(st));

57 UTS: steal def steal(st:placelocalhandle[uts]) { if (P == 1) return false; val p = Runtime.hereInt(); empty = true; for (var i:int=0; i < w && empty; ++i) { waiting = true; at async st().request(st, p, false); while (waiting) Runtime.probe(); for (var i:int=0; (i<lifelines.size) && empty && (0<=lifelines(i)); ++i) { val lifeline = lifelines(i); if (!lifelinesactivated(lifeline)) { lifelinesactivated(lifeline) = true; waiting = true; at async st().request(st, p, true); while (waiting) Runtime.probe(); return!empty;

58 UTS: give def give(st:placelocalhandle[uts], loot:queue.fragment) { val victim = Runtime.hereInt(); if (thieves.size() > 0) { val thief = thieves.pop(); if (thief >= 0) { at async { st().deal(st, loot, victim); st().waiting = false; else { at async { st().deal(st, loot, -1); st().waiting = false; else { val thief = lifelinethieves.pop(); at (Place(thief)) async st().deal(st, loot, victim);

59 UTS: distribute & reject def distribute(st:placelocalhandle[uts]) { var loot:queue.fragment; while ((lifelinethieves.size() + thieves.size() > 0) && (loot = queue.grab())!= null) { give(st, loot); reject(st); def reject(st:placelocalhandle[uts]) { while (thieves.size() > 0) { val thief = thieves.pop(); if (thief >= 0) { lifelinethieves.push(thief); at async { st().waiting = false; else { at async { st().waiting = false;

60 UTS: deal def deal(st:placelocalhandle[uts], loot:queue.fragment, source:int) { val lifeline = source >= 0; if (lifeline) lifelinesactivated(source) = false; if (active) { empty = false; processloot(loot, lifeline); else { active = true; processloot(loot, lifeline); processstack(st); active = false;

61 Tutorial Outline X10 Overview (70 minutes) Programming model concepts Key language features Implementation highlights Scaling SPMD-style computations in X10 (40 minutes) Scaling finish/async Scaling communication Collective operations Unbalanced computations (20 minutes) Global Load Balancing & UTS Application frameworks (20 minutes) Global Matrix Library SAT X10 Summary and Discussion (20 minutes)

62 X10 Global Matrix Library Easy-use matrix library in X10 supporting parallel execution on multiple places Presents standard sequential interface (internal parallelism and distribution). Matrix data structures: Dense matrix symmetric and triangular matrix: column-wise storage Sparse matrix: CSC-LT Vector, distributed and duplicated vector Block matrix, distributed and duplicated block matrix Matrix operations scaling, cellwise add, subtract, multiply and division Matrix multiplication Norm, trace, linear equation solver Approximately 45,000 lines of X10 code (open source)

63 Global Matrix Library GML Vector SparseCSC Dense Block matrix Dupl. block Distr. block Dense matrix BLAS wrap X10 Native C/C++ back end Team MPI Dense matrix X10 driver Sparse matrix X10 driver Socket PAMI BLAS Managed Java back end PGAS Socket Multi-thread BLAS (GotoBLAS) 3 party C-MPI library rd LAPACK

64 Matrix Partitioning and Distribution Grid partitioning B0 Bm Bm*(n-1) B1 Balanced partitioning User-specified partitioning Block matrix Bm-1 Bm*n-1 Dense, sparse, or mixed blocks Block distribution: block - place mapping Unique, constant, cyclic, grid, vertical, horizontal and random B0 P0 B1 P1 B2 P2 B3 P3

65 Matrix Partitioning and Distribution Grid partitioning B0 Bm Bm*(n-1) B1 Balanced partitioning User-specified partitioning Block matrix Bm-1 Dense, sparse, or mixed blocks P0 Block distribution: block - place mapping Bm*n-1 P1 Pn-1 Unique, constant, cyclic, grid, vertical, horizontal and random B0 B1 Bm-1 P0 P1 P2 P3

66 Matrix Partitioning and Distribution Grid partitioning P0 Balanced partitioning B0 Bm Bm*(n-1) B1 User-specified partitioning Block matrix Dense, sparse, or mixed blocks Pm-1 Bm-1 Bm*n-1 Block distribution: block - place mapping Unique, constant, cyclic, grid, vertical, horizontal and random Bm*(n-1) B0 Bm B1 P0 P1 P2 P3

67 GML Example: PageRank while (I < max_iteration) { p = alpha*(g%*%p)+(1-alpha)*(e%*%u%*%p); Programmer determines distribution of the data. G is block distributed per Row P is replicated Programmer writes sequential, distribution-aware code. Perform usual optimizations allocate big data-structures outside loop Could equally well have been in Java. DML def step(g:, P:, GP:, dupgp:, UP:, P:, E, : ) { for (1..iteration) { GP.mult(G,P).scale(alpha).copyInto(dupGP);//broadcast P.local().mult(E, UP.mult(U,P.local())).scale(1-alpha).cellAdd(dupGP.local()).sync(); // broadcast X10

68 PageRank Performance Triloka x86 cluster (TK) Managed X10 on socket transport Native X10 on socket transport Watson4P (BlueGene/P) Native X10 on BlueGene DCMF transport Page Rank Page Rank 64 -place, G nonz ero sparsity Java-socket TK 4 Native-socket TK 3.5 Native-pgas BGP 3 14 Java-socket TK 12 Native-socket TK Native-pgas BGP 10 Time(sec) Time(sec) 1000 million nonz eros in G Nonzeros in G (million) Number of places (processors)

69 GML Example: Gaussian Non-Negative Matrix Multiplication Key kernel for topic modeling Involves factoring a large (D x W) matrix D ~ 100M Key decision is representation for matrix, and its distribution. Note: app code is polymorphic in this choice. W ~ 100K, but sparse (0.001) Iterative algorithm, involves distributed sparse matrix multiplication, cell-wise matrix operations. H V W H H H P0 P1 P2 Pn for (1..iteration) { H.cellMult(WV.transMult(W,V,tW).cellDiv(WWH.mult(WW.transMult(W,W),H))); W.cellMult(VH.multTrans(V,H).cellDiv(WHH.mult(W,HH.multTrans(H,H)))); X10

70 GNMF Performance Test platforms: Triloka Cluster (TK): 16 Quad-Core AMD Opteron(tm) 2356 nodes, 1-Gb/s ethernet Watson4P BlueGene/P system (BGP): 2 cards 64 nodes Test settings: run on 64-place V: sparsity 0.001, (1,000,000 x 100,000) ~ (10,000,000 x 1,000,000) W: dense matrix (1,000,000 x 10) ~ (10,000,000 x 10) H: dense matrix (10 x 1,000,000) GNMF runtime Java-socket TK Native-socket TK Native-socket TK Java-socket TK Native-pgas BGP Native-pgas BGP Time(sec) Time(sec) GNMF Computation time Nonzero (million) in V Nonzero (million) in V

71 Global Matrix Library Summary Easy to use Simplified matrix operation interface Managed back end simplifies development, while native back end provides high performance computing capability Hiding concurrency and distribution details from apps developers Hierarchical parallel computing capability Within a node: multi-thread BLAS/LAPACK for intensive computation Between nodes: collective communication for coarse-grained parallel computing Adaptability Data layout compatible to existing specifications (BLAS, CSC-LT). Portability Supporting MPI, socket, PAMI and BlueGene transport Availability Open source; x10.gml project in X10 project main svn on SourceForge.net

72 Sat X10 Framework to combine existing sequential SAT solvers into a parallel portfolio solver Diversity: small number of base solvers, but different parameterization Knowledge Sharing: exchange learned clauses between solvers Structure of SatX10 Minimal modifications to existing solver (add callbacks for knowledge sharing) Small X10/C++ interoperability wrapper for each solver Small (100s of lines) X10 program for communication/distribution Allows parallel solver to run on a single machine with multiple cores and across multiple machines, sharing information such as learned clauses More information:

73 SATX10 Architecture SatX10 Framework 1.0 SolverX10Callback SolverX10Callback SolverSatX10Base SolverSatX10Base ** SolverSatX10Base SolverSatX10Base Data Data Objects Objects Pure Pure Virtual Virtual Methods Methods Other Other Controls Controls Callback Callback ** placeid placeid maxlenshrcl maxlenshrcl outbufsize outbufsize incomingclsq incomingclsq outgoingclsq outgoingclsq x10_parsedimacs() x10_parsedimacs() x10_nvars() x10_nvars() x10_nclauses() x10_nclauses() x10_solve() x10_solve() x10_printsoln() x10_printsoln() x10_printstats() x10_printstats() x10_kill() x10_kill() x10_waskilled() x10_waskilled() x10_accessincbuf() x10_accessincbuf() x10_accessoutbuf() x10_accessoutbuf() Callback Callback Methods Methods x10_step() x10_step() x10_processoutgoingcls() x10_processoutgoingcls() SatX10.x10 SatX10.x10 Main Main X10 X10 routines routines to launch to launch solvers solvers at at various various places: places: Glucose::Solver Glucose::Solver ** Minisat::Solver Minisat::Solver ** SatX10 Solver SatX10 Solver Implement Implement callbacks callbacks of of base base class class CallbackStats CallbackStats (data) (data) Routines Routines for for X10 X10 to to interact with solvers interact with solvers solve() solve() kill() kill() bufferincomingcls() bufferincomingcls() printinstanceinfo() printinstanceinfo() printresults() printresults() SatX10 Minisat SatX10 Minisat SatX10 Glucos SatX10 Glucos e Specializedesolver: Specialized Specialized solver: solver: Specialized solver: Minisat::Solver **Confidential Glucose::Solver IBM Minisat::Solver Glucose::Solver ** Minisat::Solver Minisat::Solver Implement Implement pure pure virtual virtual methods methods of of base base class class Other Other Methods Methods bufferoutgoingcls() bufferoutgoingcls() processincomgcls() processincomgcls() Glucose::Solver Glucose::Solver Implement Implement pure pure virtual virtual methods methods of of base base class class Other Other Methods Methods bufferoutgoingcls() bufferoutgoingcls() processincomgcls() processincomgcls() Specialization for individual solvers

74 Preliminary Empirical Results: Same Machine Same machine, 8 cores, clause lengths=1 and 8 Note: Promising but preliminary results; focus so far has been on developing the framework, not on producing a highly competitive solver

75 Preliminary Empirical Results: Multiple Machines 8 machine/8 cores vs 16 machines/64 cores, clause lengths=1 and 8 Same executable as for single machine --- just different parameters!

76 Tutorial Outline X10 Overview (70 minutes) Programming model concepts Key language features Implementation highlights Scaling SPMD-style computations in X10 (40 minutes) Scaling finish/async Scaling communication Collective operations Unbalanced computations (20 minutes) Global Load Balancing & UTS Application frameworks (20 minutes) Global Matrix Library SAT X10 Summary and Discussion (20 minutes)

77 X10 Project Highlights since SC 11 Three major releases: 2.2.2, 2.2.3, Maintained backwards compatibility with X language (June 2011) Language backwards compatibility with will be maintained in future releases Smooth Java interoperability Major feature of X release Application work at IBM M3R: Main Memory Map Reduce (VLDB 12) Global Matrix Library (open sourced Oct 2011; available in x10 svn) SatX10 (SAT 12) HPC benchmarks (for PERCS) run at scale on 47,000 core Power775

78 Summary of X10 12 Workshop Second annual X10 workshop (co-located with PLDI/ECOOP in June 2012) Papers 1. Fast Method Dispatch and Effective Use of Primitives for Reified Generics in Managed X10 2. Distributed Garbage Collection for Managed X10, 3. StreamX10: A Stream Programming Framework on X10, 4. X10-based Massive Parallel Large-Scale Traffic Flow Simulation 5. Characterization of Smith-Waterman Sequence Database Search in X10 6. Introducing ScaleGraph : An X10 Library for Billion Scale Graph Analytics Invited talks 1. A tutorial on X10 and its implementation 2. M3R: Increased Performance for In-Memory Hadoop Jobs X10 13 will be held in June 2013 co-located with PLDI 13

79 X10 Community: StreamX10 Slide by Haitao Wei, X

A Tutorial on X10 and its Implementation

A Tutorial on X10 and its Implementation A Tutorial on X10 and its Implementation David Grove IBM TJ Watson Research Center This material is based upon work supported in part by the Defense Advanced Research Projects Agency under its Agreement

More information

Towards Performance and Productivity at Scale

Towards Performance and Productivity at Scale Reflections on X10 Towards Performance and Productivity at Scale David Grove IBM TJ Watson Research Center This material is based upon work supported in part by the Defense Advanced Research Projects Agency

More information

Programming Language X10 on Multiple JVMs

Programming Language X10 on Multiple JVMs Workshop on the State of Art in Software Research, at UC Irvine Programming Language on Multiple JVMs 2013/05/15 Kiyokuni (KIYO) Kawachiya and Mikio Takeuchi IBM Research - Tokyo This Talk Focuses on Managed

More information

Main Memory Map Reduce (M3R)

Main Memory Map Reduce (M3R) Main Memory Map Reduce (M3R) PL Day, September 23, 2013 Avi Shinnar, David Cunningham, Ben Herta, Vijay Saraswat, Wei Zhang In collaboralon with: Yan Li*, Dave Grove, Mikio Takeuchi**, Salikh Zakirov**,

More information

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes Eric Petit, Loïc Thebault, Quang V. Dinh May 2014 EXA2CT Consortium 2 WPs Organization Proto-Applications

More information

Programming Languages for Large Scale Parallel Computing. Marc Snir

Programming Languages for Large Scale Parallel Computing. Marc Snir Programming Languages for Large Scale Parallel Computing Marc Snir Focus Very large scale computing (>> 1K nodes) Performance is key issue Parallelism, load balancing, locality and communication are algorithmic

More information

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop

More information

Scheduling Task Parallelism" on Multi-Socket Multicore Systems"

Scheduling Task Parallelism on Multi-Socket Multicore Systems Scheduling Task Parallelism" on Multi-Socket Multicore Systems" Stephen Olivier, UNC Chapel Hill Allan Porterfield, RENCI Kyle Wheeler, Sandia National Labs Jan Prins, UNC Chapel Hill Outline" Introduction

More information

BlobSeer: Towards efficient data storage management on large-scale, distributed systems

BlobSeer: Towards efficient data storage management on large-scale, distributed systems : Towards efficient data storage management on large-scale, distributed systems Bogdan Nicolae University of Rennes 1, France KerData Team, INRIA Rennes Bretagne-Atlantique PhD Advisors: Gabriel Antoniu

More information

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage Parallel Computing Benson Muite benson.muite@ut.ee http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework

More information

Scientific Computing Programming with Parallel Objects

Scientific Computing Programming with Parallel Objects Scientific Computing Programming with Parallel Objects Esteban Meneses, PhD School of Computing, Costa Rica Institute of Technology Parallel Architectures Galore Personal Computing Embedded Computing Moore

More information

Distributed Computing and Big Data: Hadoop and MapReduce

Distributed Computing and Big Data: Hadoop and MapReduce Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:

More information

- An Essential Building Block for Stable and Reliable Compute Clusters

- An Essential Building Block for Stable and Reliable Compute Clusters Ferdinand Geier ParTec Cluster Competence Center GmbH, V. 1.4, March 2005 Cluster Middleware - An Essential Building Block for Stable and Reliable Compute Clusters Contents: Compute Clusters a Real Alternative

More information

Performance Monitoring of Parallel Scientific Applications

Performance Monitoring of Parallel Scientific Applications Performance Monitoring of Parallel Scientific Applications Abstract. David Skinner National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory This paper introduces an infrastructure

More information

Cloud Computing. Up until now

Cloud Computing. Up until now Cloud Computing Lecture 11 Virtualization 2011-2012 Up until now Introduction. Definition of Cloud Computing Grid Computing Content Distribution Networks Map Reduce Cycle-Sharing 1 Process Virtual Machines

More information

Fundamentals of Java Programming

Fundamentals of Java Programming Fundamentals of Java Programming This document is exclusive property of Cisco Systems, Inc. Permission is granted to print and copy this document for non-commercial distribution and exclusive use by instructors

More information

GTask Developing asynchronous applications for multi-core efficiency

GTask Developing asynchronous applications for multi-core efficiency GTask Developing asynchronous applications for multi-core efficiency February 2009 SCALE 7x Los Angeles Christian Hergert What Is It? GTask is a mini framework to help you write asynchronous code. Dependencies

More information

High Performance Computing. Course Notes 2007-2008. HPC Fundamentals

High Performance Computing. Course Notes 2007-2008. HPC Fundamentals High Performance Computing Course Notes 2007-2008 2008 HPC Fundamentals Introduction What is High Performance Computing (HPC)? Difficult to define - it s a moving target. Later 1980s, a supercomputer performs

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic

More information

Spark: Cluster Computing with Working Sets

Spark: Cluster Computing with Working Sets Spark: Cluster Computing with Working Sets Outline Why? Mesos Resilient Distributed Dataset Spark & Scala Examples Uses Why? MapReduce deficiencies: Standard Dataflows are Acyclic Prevents Iterative Jobs

More information

MPI and Hybrid Programming Models. William Gropp www.cs.illinois.edu/~wgropp

MPI and Hybrid Programming Models. William Gropp www.cs.illinois.edu/~wgropp MPI and Hybrid Programming Models William Gropp www.cs.illinois.edu/~wgropp 2 What is a Hybrid Model? Combination of several parallel programming models in the same program May be mixed in the same source

More information

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin

More information

Petascale Software Challenges. Piyush Chaudhary piyushc@us.ibm.com High Performance Computing

Petascale Software Challenges. Piyush Chaudhary piyushc@us.ibm.com High Performance Computing Petascale Software Challenges Piyush Chaudhary piyushc@us.ibm.com High Performance Computing Fundamental Observations Applications are struggling to realize growth in sustained performance at scale Reasons

More information

Load Balancing Techniques

Load Balancing Techniques Load Balancing Techniques 1 Lecture Outline Following Topics will be discussed Static Load Balancing Dynamic Load Balancing Mapping for load balancing Minimizing Interaction 2 1 Load Balancing Techniques

More information

Petascale Software Challenges. William Gropp www.cs.illinois.edu/~wgropp

Petascale Software Challenges. William Gropp www.cs.illinois.edu/~wgropp Petascale Software Challenges William Gropp www.cs.illinois.edu/~wgropp Petascale Software Challenges Why should you care? What are they? Which are different from non-petascale? What has changed since

More information

Distributed R for Big Data

Distributed R for Big Data Distributed R for Big Data Indrajit Roy HP Vertica Development Team Abstract Distributed R simplifies large-scale analysis. It extends R. R is a single-threaded environment which limits its utility for

More information

Apache Flink Next-gen data analysis. Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas

Apache Flink Next-gen data analysis. Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas Apache Flink Next-gen data analysis Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research

More information

BSC vision on Big Data and extreme scale computing

BSC vision on Big Data and extreme scale computing BSC vision on Big Data and extreme scale computing Jesus Labarta, Eduard Ayguade,, Fabrizio Gagliardi, Rosa M. Badia, Toni Cortes, Jordi Torres, Adrian Cristal, Osman Unsal, David Carrera, Yolanda Becerra,

More information

Introduction to Virtual Machines

Introduction to Virtual Machines Introduction to Virtual Machines Introduction Abstraction and interfaces Virtualization Computer system architecture Process virtual machines System virtual machines 1 Abstraction Mechanism to manage complexity

More information

CSCI E 98: Managed Environments for the Execution of Programs

CSCI E 98: Managed Environments for the Execution of Programs CSCI E 98: Managed Environments for the Execution of Programs Draft Syllabus Instructor Phil McGachey, PhD Class Time: Mondays beginning Sept. 8, 5:30-7:30 pm Location: 1 Story Street, Room 304. Office

More information

UTS: An Unbalanced Tree Search Benchmark

UTS: An Unbalanced Tree Search Benchmark UTS: An Unbalanced Tree Search Benchmark LCPC 2006 1 Coauthors Stephen Olivier, UNC Jun Huan, UNC/Kansas Jinze Liu, UNC Jan Prins, UNC James Dinan, OSU P. Sadayappan, OSU Chau-Wen Tseng, UMD Also, thanks

More information

To Java SE 8, and Beyond (Plan B)

To Java SE 8, and Beyond (Plan B) 11-12-13 To Java SE 8, and Beyond (Plan B) Francisco Morero Peyrona EMEA Java Community Leader 8 9...2012 2020? Priorities for the Java Platforms Grow Developer Base Grow Adoption

More information

Fast Analytics on Big Data with H20

Fast Analytics on Big Data with H20 Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,

More information

Introduction to DISC and Hadoop

Introduction to DISC and Hadoop Introduction to DISC and Hadoop Alice E. Fischer April 24, 2009 Alice E. Fischer DISC... 1/20 1 2 History Hadoop provides a three-layer paradigm Alice E. Fischer DISC... 2/20 Parallel Computing Past and

More information

PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)

PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology) PARALLEL JAVASCRIPT Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology) JAVASCRIPT Not connected with Java Scheme and self (dressed in c clothing) Lots of design errors (like automatic semicolon

More information

Distributed File Systems

Distributed File Systems Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.

More information

Delivering Quality in Software Performance and Scalability Testing

Delivering Quality in Software Performance and Scalability Testing Delivering Quality in Software Performance and Scalability Testing Abstract Khun Ban, Robert Scott, Kingsum Chow, and Huijun Yan Software and Services Group, Intel Corporation {khun.ban, robert.l.scott,

More information

Chapter 6, The Operating System Machine Level

Chapter 6, The Operating System Machine Level Chapter 6, The Operating System Machine Level 6.1 Virtual Memory 6.2 Virtual I/O Instructions 6.3 Virtual Instructions For Parallel Processing 6.4 Example Operating Systems 6.5 Summary Virtual Memory General

More information

Chapter 11 I/O Management and Disk Scheduling

Chapter 11 I/O Management and Disk Scheduling Operating Systems: Internals and Design Principles, 6/E William Stallings Chapter 11 I/O Management and Disk Scheduling Dave Bremer Otago Polytechnic, NZ 2008, Prentice Hall I/O Devices Roadmap Organization

More information

A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin

A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin A Comparison Of Shared Memory Parallel Programming Models Jace A Mogill David Haglin 1 Parallel Programming Gap Not many innovations... Memory semantics unchanged for over 50 years 2010 Multi-Core x86

More information

Software-defined Storage Architecture for Analytics Computing

Software-defined Storage Architecture for Analytics Computing Software-defined Storage Architecture for Analytics Computing Arati Joshi Performance Engineering Colin Eldridge File System Engineering Carlos Carrero Product Management June 2015 Reference Architecture

More information

General Introduction

General Introduction Managed Runtime Technology: General Introduction Xiao-Feng Li (xiaofeng.li@gmail.com) 2012-10-10 Agenda Virtual machines Managed runtime systems EE and MM (JIT and GC) Summary 10/10/2012 Managed Runtime

More information

Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing. October 29th, 2015

Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing. October 29th, 2015 E6893 Big Data Analytics Lecture 8: Spark Streams and Graph Computing (I) Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing

More information

Sourcery Overview & Virtual Machine Installation

Sourcery Overview & Virtual Machine Installation Sourcery Overview & Virtual Machine Installation Damian Rouson, Ph.D., P.E. Sourcery, Inc. www.sourceryinstitute.org Sourcery, Inc. About Us Sourcery, Inc., is a software consultancy founded by and for

More information

Parallel Data Preparation with the DS2 Programming Language

Parallel Data Preparation with the DS2 Programming Language ABSTRACT Paper SAS329-2014 Parallel Data Preparation with the DS2 Programming Language Jason Secosky and Robert Ray, SAS Institute Inc., Cary, NC and Greg Otto, Teradata Corporation, Dayton, OH A time-consuming

More information

Bringing Big Data Modelling into the Hands of Domain Experts

Bringing Big Data Modelling into the Hands of Domain Experts Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks david.willingham@mathworks.com.au 2015 The MathWorks, Inc. 1 Data is the sword of the

More information

Jonathan Worthington Scarborough Linux User Group

Jonathan Worthington Scarborough Linux User Group Jonathan Worthington Scarborough Linux User Group Introduction What does a Virtual Machine do? Hides away the details of the hardware platform and operating system. Defines a common set of instructions.

More information

! E6893 Big Data Analytics Lecture 9:! Linked Big Data Graph Computing (I)

! E6893 Big Data Analytics Lecture 9:! Linked Big Data Graph Computing (I) ! E6893 Big Data Analytics Lecture 9:! Linked Big Data Graph Computing (I) Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Mgr., Dept. of Network Science and

More information

Load balancing. David Bindel. 12 Nov 2015

Load balancing. David Bindel. 12 Nov 2015 Load balancing David Bindel 12 Nov 2015 Inefficiencies in parallel code Poor single processor performance Typically in the memory system Saw this in matrix multiply assignment Overhead for parallelism

More information

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing /35 Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing Zuhair Khayyat 1 Karim Awara 1 Amani Alonazi 1 Hani Jamjoom 2 Dan Williams 2 Panos Kalnis 1 1 King Abdullah University of

More information

Java Virtual Machine: the key for accurated memory prefetching

Java Virtual Machine: the key for accurated memory prefetching Java Virtual Machine: the key for accurated memory prefetching Yolanda Becerra Jordi Garcia Toni Cortes Nacho Navarro Computer Architecture Department Universitat Politècnica de Catalunya Barcelona, Spain

More information

In-Memory Computing for Iterative CPU-intensive Calculations in Financial Industry In-Memory Computing Summit 2015

In-Memory Computing for Iterative CPU-intensive Calculations in Financial Industry In-Memory Computing Summit 2015 In-Memory Computing for Iterative CPU-intensive Calculations in Financial Industry In-Memory Computing Summit 2015 June 29-30, 2015 Contacts Alexandre Boudnik Senior Solution Architect, EPAM Systems Alexandre_Boudnik@epam.com

More information

Virtual Machine Learning: Thinking Like a Computer Architect

Virtual Machine Learning: Thinking Like a Computer Architect Virtual Machine Learning: Thinking Like a Computer Architect Michael Hind IBM T.J. Watson Research Center March 21, 2005 CGO 05 Keynote 2005 IBM Corporation What is this talk about? Virtual Machines? 2

More information

HPC ABDS: The Case for an Integrating Apache Big Data Stack

HPC ABDS: The Case for an Integrating Apache Big Data Stack HPC ABDS: The Case for an Integrating Apache Big Data Stack with HPC 1st JTC 1 SGBD Meeting SDSC San Diego March 19 2014 Judy Qiu Shantenu Jha (Rutgers) Geoffrey Fox gcf@indiana.edu http://www.infomall.org

More information

Improving the performance of data servers on multicore architectures. Fabien Gaud

Improving the performance of data servers on multicore architectures. Fabien Gaud Improving the performance of data servers on multicore architectures Fabien Gaud Grenoble University Advisors: Jean-Bernard Stefani, Renaud Lachaize and Vivien Quéma Sardes (INRIA/LIG) December 2, 2010

More information

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

Seeking Opportunities for Hardware Acceleration in Big Data Analytics Seeking Opportunities for Hardware Acceleration in Big Data Analytics Paul Chow High-Performance Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Toronto Who

More information

Overlapping Data Transfer With Application Execution on Clusters

Overlapping Data Transfer With Application Execution on Clusters Overlapping Data Transfer With Application Execution on Clusters Karen L. Reid and Michael Stumm reid@cs.toronto.edu stumm@eecg.toronto.edu Department of Computer Science Department of Electrical and Computer

More information

PGAS, Global Arrays, and MPI-3 How do they fit together? Brad Chamberlain Chapel Team, Cray Inc. Global Arrays Technical Meeting May 7, 2010

PGAS, Global Arrays, and MPI-3 How do they fit together? Brad Chamberlain Chapel Team, Cray Inc. Global Arrays Technical Meeting May 7, 2010 PGAS, Global Arrays, and MPI-3 How do they fit together? Brad Chamberlain Chapel Team, Cray Inc. Global Arrays Technical Meeting May 7, 2010 Disclaimer This talk s contents should be considered my personal

More information

Compute Cluster Server Lab 3: Debugging the parallel MPI programs in Microsoft Visual Studio 2005

Compute Cluster Server Lab 3: Debugging the parallel MPI programs in Microsoft Visual Studio 2005 Compute Cluster Server Lab 3: Debugging the parallel MPI programs in Microsoft Visual Studio 2005 Compute Cluster Server Lab 3: Debugging the parallel MPI programs in Microsoft Visual Studio 2005... 1

More information

Best practices for efficient HPC performance with large models

Best practices for efficient HPC performance with large models Best practices for efficient HPC performance with large models Dr. Hößl Bernhard, CADFEM (Austria) GmbH PRACE Autumn School 2013 - Industry Oriented HPC Simulations, September 21-27, University of Ljubljana,

More information

Fachbereich Informatik und Elektrotechnik SunSPOT. Ubiquitous Computing. Ubiquitous Computing, Helmut Dispert

Fachbereich Informatik und Elektrotechnik SunSPOT. Ubiquitous Computing. Ubiquitous Computing, Helmut Dispert Ubiquitous Computing Ubiquitous Computing The Sensor Network System Sun SPOT: The Sun Small Programmable Object Technology Technology-Based Wireless Sensor Networks a Java Platform for Developing Applications

More information

Parallel Computing for Data Science

Parallel Computing for Data Science Parallel Computing for Data Science With Examples in R, C++ and CUDA Norman Matloff University of California, Davis USA (g) CRC Press Taylor & Francis Group Boca Raton London New York CRC Press is an imprint

More information

SWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri

SWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri SWARM: A Parallel Programming Framework for Multicore Processors David A. Bader, Varun N. Kanade and Kamesh Madduri Our Contributions SWARM: SoftWare and Algorithms for Running on Multicore, a portable

More information

Replication on Virtual Machines

Replication on Virtual Machines Replication on Virtual Machines Siggi Cherem CS 717 November 23rd, 2004 Outline 1 Introduction The Java Virtual Machine 2 Napper, Alvisi, Vin - DSN 2003 Introduction JVM as state machine Addressing non-determinism

More information

16.1 MAPREDUCE. For personal use only, not for distribution. 333

16.1 MAPREDUCE. For personal use only, not for distribution. 333 For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several

More information

BLM 413E - Parallel Programming Lecture 3

BLM 413E - Parallel Programming Lecture 3 BLM 413E - Parallel Programming Lecture 3 FSMVU Bilgisayar Mühendisliği Öğr. Gör. Musa AYDIN 14.10.2015 2015-2016 M.A. 1 Parallel Programming Models Parallel Programming Models Overview There are several

More information

Next Generation GPU Architecture Code-named Fermi

Next Generation GPU Architecture Code-named Fermi Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time

More information

Lecture 2 Parallel Programming Platforms

Lecture 2 Parallel Programming Platforms Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple

More information

Hybrid Software Architectures for Big Data. Laurence.Hubert@hurence.com @hurence http://www.hurence.com

Hybrid Software Architectures for Big Data. Laurence.Hubert@hurence.com @hurence http://www.hurence.com Hybrid Software Architectures for Big Data Laurence.Hubert@hurence.com @hurence http://www.hurence.com Headquarters : Grenoble Pure player Expert level consulting Training R&D Big Data X-data hot-line

More information

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions Slide 1 Outline Principles for performance oriented design Performance testing Performance tuning General

More information

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person

More information

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 F# Applications to Computational Financial and GPU Computing May 16th Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 Today! Why care about F#? Just another fashion?! Three success stories! How Alea.cuBase

More information

A Multi-layered Domain-specific Language for Stencil Computations

A Multi-layered Domain-specific Language for Stencil Computations A Multi-layered Domain-specific Language for Stencil Computations Christian Schmitt, Frank Hannig, Jürgen Teich Hardware/Software Co-Design, University of Erlangen-Nuremberg Workshop ExaStencils 2014,

More information

KITES TECHNOLOGY COURSE MODULE (C, C++, DS)

KITES TECHNOLOGY COURSE MODULE (C, C++, DS) KITES TECHNOLOGY 360 Degree Solution www.kitestechnology.com/academy.php info@kitestechnology.com technologykites@gmail.com Contact: - 8961334776 9433759247 9830639522.NET JAVA WEB DESIGN PHP SQL, PL/SQL

More information

System Structures. Services Interface Structure

System Structures. Services Interface Structure System Structures Services Interface Structure Operating system services (1) Operating system services (2) Functions that are helpful to the user User interface Command line interpreter Batch interface

More information

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture Last Class: OS and Computer Architecture System bus Network card CPU, memory, I/O devices, network card, system bus Lecture 3, page 1 Last Class: OS and Computer Architecture OS Service Protection Interrupts

More information

11.1 inspectit. 11.1. inspectit

11.1 inspectit. 11.1. inspectit 11.1. inspectit Figure 11.1. Overview on the inspectit components [Siegl and Bouillet 2011] 11.1 inspectit The inspectit monitoring tool (website: http://www.inspectit.eu/) has been developed by NovaTec.

More information

bigdata Managing Scale in Ontological Systems

bigdata Managing Scale in Ontological Systems Managing Scale in Ontological Systems 1 This presentation offers a brief look scale in ontological (semantic) systems, tradeoffs in expressivity and data scale, and both information and systems architectural

More information

Multi-core Programming System Overview

Multi-core Programming System Overview Multi-core Programming System Overview Based on slides from Intel Software College and Multi-Core Programming increasing performance through software multi-threading by Shameem Akhter and Jason Roberts,

More information

The Design and Implementation of Scalable Parallel Haskell

The Design and Implementation of Scalable Parallel Haskell The Design and Implementation of Scalable Parallel Haskell Malak Aljabri, Phil Trinder,and Hans-Wolfgang Loidl MMnet 13: Language and Runtime Support for Concurrent Systems Heriot Watt University May 8,

More information

Performance Analysis and Optimization Tool

Performance Analysis and Optimization Tool Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL andres.charif@uvsq.fr Performance Analysis Team, University of Versailles http://www.maqao.org Introduction Performance Analysis Develop

More information

Glossary of Object Oriented Terms

Glossary of Object Oriented Terms Appendix E Glossary of Object Oriented Terms abstract class: A class primarily intended to define an instance, but can not be instantiated without additional methods. abstract data type: An abstraction

More information

Supercomputing and Big Data: Where are the Real Boundaries and Opportunities for Synergy?

Supercomputing and Big Data: Where are the Real Boundaries and Opportunities for Synergy? HPC2012 Workshop Cetraro, Italy Supercomputing and Big Data: Where are the Real Boundaries and Opportunities for Synergy? Bill Blake CTO Cray, Inc. The Big Data Challenge Supercomputing minimizes data

More information

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012 MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte

More information

ELEC 377. Operating Systems. Week 1 Class 3

ELEC 377. Operating Systems. Week 1 Class 3 Operating Systems Week 1 Class 3 Last Class! Computer System Structure, Controllers! Interrupts & Traps! I/O structure and device queues.! Storage Structure & Caching! Hardware Protection! Dual Mode Operation

More information

Running a Workflow on a PowerCenter Grid

Running a Workflow on a PowerCenter Grid Running a Workflow on a PowerCenter Grid 2010-2014 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording or otherwise)

More information

HIGH PERFORMANCE BIG DATA ANALYTICS

HIGH PERFORMANCE BIG DATA ANALYTICS HIGH PERFORMANCE BIG DATA ANALYTICS Kunle Olukotun Electrical Engineering and Computer Science Stanford University June 2, 2014 Explosion of Data Sources Sensors DoD is swimming in sensors and drowning

More information

C#5.0 IN A NUTSHELL. Joseph O'REILLY. Albahari and Ben Albahari. Fifth Edition. Tokyo. Sebastopol. Beijing. Cambridge. Koln.

C#5.0 IN A NUTSHELL. Joseph O'REILLY. Albahari and Ben Albahari. Fifth Edition. Tokyo. Sebastopol. Beijing. Cambridge. Koln. Koln C#5.0 IN A NUTSHELL Fifth Edition Joseph Albahari and Ben Albahari O'REILLY Beijing Cambridge Farnham Sebastopol Tokyo Table of Contents Preface xi 1. Introducing C# and the.net Framework 1 Object

More information

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Spark. Fast, Interactive, Language- Integrated Cluster Computing Spark Fast, Interactive, Language- Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica UC

More information

Spring 2011 Prof. Hyesoon Kim

Spring 2011 Prof. Hyesoon Kim Spring 2011 Prof. Hyesoon Kim Today, we will study typical patterns of parallel programming This is just one of the ways. Materials are based on a book by Timothy. Decompose Into tasks Original Problem

More information

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

Hadoop. History and Introduction. Explained By Vaibhav Agarwal Hadoop History and Introduction Explained By Vaibhav Agarwal Agenda Architecture HDFS Data Flow Map Reduce Data Flow Hadoop Versions History Hadoop version 2 Hadoop Architecture HADOOP (HDFS) Data Flow

More information

Symmetric Multiprocessing

Symmetric Multiprocessing Multicore Computing A multi-core processor is a processing system composed of two or more independent cores. One can describe it as an integrated circuit to which two or more individual processors (called

More information

Software for High Performance. Computing. Requirements & Research Directions. Marc Snir

Software for High Performance. Computing. Requirements & Research Directions. Marc Snir Software for High Performance Requirements & Research Directions Computing Marc Snir May 2006 Outline Petascale hardware Petascale operating system Programming models 2 Jun-06 Petascale Systems are Coming

More information

Apache Hama Design Document v0.6

Apache Hama Design Document v0.6 Apache Hama Design Document v0.6 Introduction Hama Architecture BSPMaster GroomServer Zookeeper BSP Task Execution Job Submission Job and Task Scheduling Task Execution Lifecycle Synchronization Fault

More information

Client/Server Computing Distributed Processing, Client/Server, and Clusters

Client/Server Computing Distributed Processing, Client/Server, and Clusters Client/Server Computing Distributed Processing, Client/Server, and Clusters Chapter 13 Client machines are generally single-user PCs or workstations that provide a highly userfriendly interface to the

More information

Data Centric Systems (DCS)

Data Centric Systems (DCS) Data Centric Systems (DCS) Architecture and Solutions for High Performance Computing, Big Data and High Performance Analytics High Performance Computing with Data Centric Systems 1 Data Centric Systems

More information

Expansion of a Framework for a Data- Intensive Wide-Area Application to the Java Language

Expansion of a Framework for a Data- Intensive Wide-Area Application to the Java Language Expansion of a Framework for a Data- Intensive Wide-Area Application to the Java Sergey Koren Table of Contents 1 INTRODUCTION... 3 2 PREVIOUS RESEARCH... 3 3 ALGORITHM...4 3.1 APPROACH... 4 3.2 C++ INFRASTRUCTURE

More information

Principles and characteristics of distributed systems and environments

Principles and characteristics of distributed systems and environments Principles and characteristics of distributed systems and environments Definition of a distributed system Distributed system is a collection of independent computers that appears to its users as a single

More information

Cognos8 Deployment Best Practices for Performance/Scalability. Barnaby Cole Practice Lead, Technical Services

Cognos8 Deployment Best Practices for Performance/Scalability. Barnaby Cole Practice Lead, Technical Services Cognos8 Deployment Best Practices for Performance/Scalability Barnaby Cole Practice Lead, Technical Services Agenda > Cognos 8 Architecture Overview > Cognos 8 Components > Load Balancing > Deployment

More information