Developing Scalable Parallel Applications in X10
|
|
- Maurice Atkinson
- 8 years ago
- Views:
Transcription
1 Developing Scalable Parallel Applications in X10 David Grove, Vijay Saraswat, Olivier Tardieu IBM TJ Watson Research Center Based on material from previous X10 Tutorials by Bard Bloom, David Cunningham, Evelyn Duesterwald Steve Fink, Nate Nystrom, Igor Peshansky, Christoph von Praun, and Vivek Sarkar Additional contributions from Anju Kambadur, Josh Milthorpe and Juemin Zhang This material is based upon work supported in part by the Defense Advanced Research Projects Agency under its Agreement No. HR Please see for the most up-to-date version of this tutorial.
2 Tutorial Objectives Learn to develop scalable parallel applications using X10 X10 language and class library fundamentals Effective patterns of concurrency and distribution Case studies: draw material from real working code Exposure to Asynchronous Partitioned Global Address Space (APGAS) Programming Model Concepts broader than any single language Move beyond Single Program/Multiple Data (SPMD) to more flexible paradigms Updates on the X10 project and user/research community Highlight what is new in X10 since SC 11 Gather feedback from you About your application domains About X10
3 Tutorial Outline X10 Overview (70 minutes) Programming model concepts Key language features Implementation highlights Scaling SPMD-style computations in X10 (40 minutes) Scaling finish/async Scaling communication Collective operations Unbalanced computations (20 minutes) Global Load Balancing & UTS Application frameworks (20 minutes) Global Matrix Library SAT X10 Summary and Discussion (20 minutes)
4 X10 Genesis: DARPA HPCS Program (2004) Central Challenge: Productive Programming of Large Scale Supercomputers Clustered systems 1000 s of SMP Nodes connected by high-performance interconnect Large aggregate memory, disk storage, etc. Massively Parallel Processor Massively Parallel Processor Systems Systems SMP SMPClusters Clusters SMP Node SMP Node PEs, PEs, PEs, PEs, Memory... Memory Interconnect IBM IBMBlue BlueGene Gene
5 Flash forward a few years Big Data and Commercial HPC Central Challenge: Productive programming of large commodity clusters Clustered systems 100 s to 1000 s of SMP Nodes connected by high-performance network Large aggregate memory, disk storage, etc. SMP SMPClusters Clusters SMP Node PEs, PEs,... SMP Node... Memory PEs, PEs,... Memory Network Commodity cluster!= MPP system but the programming model problem is highly similar
6 X10 Performance and Productivity at Scale An evolution of Java for concurrency, scale out, and heterogeneity Language focuses on high productivity and high performance Bring productivity gains from commercial world to HPC developers The X10 language provides: Java-like language (statically typed, object oriented, garbage-collected) Ability to specify scale-out computations (multiple places, exploit modern networks) Ability to specify fine-grained concurrency (exploit multi-core) Single programming model for computation offload and heterogeneity (exploit GPUs) Migration path X10 concurrency/distribution idioms can be realized in other languages via library APIs that wrap the X10 runtime X10 interoperability with Java and C/C++ enables reuse of existing libraries
7 Partitioned Global Address Space (PGAS) Languages In clustered systems, memory is only accessible to the CPUs on its node Managing local vs. remote memory is a key programming task PGAS combines a single logical global address with locality awareness PGAS Languages: Titanium, UPC, CAF, X10, Chapel
8 X10 combines PGAS with asynchrony (APGAS) Global Reference Local Heap Local Heap Activities Activities Place 0 Place N Fine grained concurrency async S Place-shifting operations at (P) S at (P) { e Sequencing finish S Atomicity when (c) S atomic S
9 Hello Whole World 1/class HelloWholeWorld { 2/ public static def main(args:rail[string]) { 3/ finish 4/ for (p in Place.places()) 5/ at (p) 6/ async 7/ Console.OUT.println(p+" says " +args(0)); 8/ 9/ % x10c++ HelloWholeWorld.x10 % X10_NPLACES=4;./a.out hello Place 0 says hello (" Place 2 says hello Place 3 says hello Place 1 says hello
10 Sequential Monty Pi import x10.io.console; import x10.util.random; class MontyPi { public static def main(args:array[string](1)) { val N = Int.parse(args(0)); val r = new Random(); var result:double = 0; for (1..N) { val x = r.nextdouble(); val y = r.nextdouble(); if (x*x + y*y <= 1) result++; val pi = 4*result/N; Console.OUT.println( The value of pi is + pi);
11 Concurrent Monty Pi import x10.io.console; import x10.util.random; class MontyPi { public static def main(args:array[string](1)) { val N = Int.parse(args(0)); val P = Int.parse(args(1)); val result = new Cell[Double](0); finish for (1..P) async { val r = new Random(); var myresult:double = 0; for (1..(N/P)) { val x = r.nextdouble(); val y = r.nextdouble(); if (x*x + y*y <= 1) myresult++; atomic result() += myresult; val pi = 4*(result())/N; Console.OUT.println( The value of pi is + pi);
12 Concurrent Monty Pi (Collecting Finish) import x10.io.console; import x10.util.random; class MontyPi { public static def main(args:array[string](1)) { val N = Int.parse(args(0)); val P = Int.parse(args(1)); val result = finish (Reducible.SumReducer[Double]()) { for (1..P) async { val r = new Random(); var myresult:double = 0; for (1..(N/P)) { val x = r.nextdouble(); val y = r.nextdouble(); if (x*x + y*y <= 1) myresult++; offer myresult; ; val pi = 4*result/N; Console.OUT.println( The value of pi is + pi);
13 Distributed Monty Pi (Collecting Finish) import x10.io.console; import x10.util.random; class MontyPi { public static def main(args:array[string](1)) { val N = Int.parse(args(0)); val result = finish (Reducible.SumReducer[Double]()) { for (p in Place.places()) at (p) async { val r = new Random(); var myresult:double = 0; for (1..(N/Place.MAX_PLACES)) { val x = r.nextdouble(); val y = r.nextdouble(); if (x*x + y*y <= 1) myresult++; offer myresult; ; val pi = 4*result/N; Console.OUT.println( The value of pi is + pi);
14 Distributed Monty Pi (GlobalRef) import x10.io.console; import x10.util.random; class MontyPi { public static def main(args:array[string](1)) { val N = Int.parse(args(0)); val result = GlobalRef[Cell[Double]](new Cell[Double](0)); finish for (p in Place.places()) at (p) async { val r = new Random(); var myresult:double = 0; for (1..(N/Place.MAX_PLACES)) { val x = r.nextdouble(); val y = r.nextdouble(); if (x*x + y*y <= 1) myresult++; val valresult = myresult; at (result) atomic result()() += valresult; val pi = 4*(result()())/N; Console.OUT.println( The value of pi is + pi);
15 Overview of Language Features Many sequential features of Java inherited unchanged Classes (single inheritance) Interfaces (multiple inheritance) Instance and static fields Constructors Overloaded, over-rideable methods Automatic memory management (GC) Familiar control structures, etc Structs Headerless, inlined, user-defined primitives Closures Encapsulate function as code block + dynamic environment Type system extensions Dependent types Generic types (instantiation-based) Type inference (local) Concurrency Fine-grained concurrency async S; Atomicity and synchronization atomic S; when (c) S; Ordering finish S; Distribution Place shifting at (p) S; Distributed object model Automatic serialization object graphs GlobalRef[T] to create cross-place references PlaceLocalHandle[T] to define objects with distributed state Class library support for RDMA, collectives, etc.
16 async Stmt ::= async Stmt cf Cilk s spawn async S Creates a new child activity that executes statement S Returns immediately S may reference variables in enclosing blocks Activities cannot be named Activity cannot be aborted or cancelled 16 // Compute the Fibonacci // sequence in parallel. def fib(n:int):int { if (n < 2) return 1; val f1:int; val f2:int; finish { async f1 = fib(n-1); f2 = fib(n-2); return f1+f2;
17 finish Stmt ::= finish Stmt cf Cilk s sync finish S Execute S, but wait until all (transitively) spawned asyncs have terminated. finish is useful for expressing synchronous operations on (local or) remote data. Rooted exception model 17 Trap all exceptions thrown by spawned activities. Throw an (aggregate) exception if any spawned async terminates abruptly. // Compute the Fibonacci // sequence in parallel. def fib(n:int):int { if (n < 2) return 1; val f1:int; val f2:int; finish { async f1 = fib(n-1); f2 = fib(n-2); return f1+f2; implicit finish around program main
18 atomic atomic S Execute statement S atomically Atomic blocks are conceptually executed in a serialized order with respect to all other atomic blocks: isolation and weak atomicity. An atomic block body (S)... must be nonblocking must not create concurrent activities (sequential) must not access remote data (local) restrictions checked dynamically 18 Stmt ::= atomic Statement MethodModifier ::= atomic // target defined in lexically // enclosing scope. atomic def CAS(old:Object, n:object) { if (target.equals(old)) { target = n; return true; return false; // push data onto concurrent // list-stack val node = new Node(data); atomic { node.next = head; head = node;
19 when when (E) S Activity suspends until a state in which the guard E is true. In that state, S is executed atomically and in isolation. Guard E is a boolean expression must be nonblocking must not create concurrent activities (sequential) must not access remote data (local) must not have side-effects (const) restrictions mostly checked dynamically; const restriction is not checked. 19 Stmt ::= WhenStmt WhenStmt ::= when ( Expr ) Stmt WhenStmt or (Expr) Stmt class OneBuffer { var datum:object = null; var filled:boolean = false; def send(v:object) { when (!filled ) { datum = v; filled = true; def receive():object { when (filled) { val v = datum; datum = null; filled = false; return v;
20 at Stmt ::= at (p) Stmt at (p) S Execute statement S at place p Current activity is blocked until S completes Deep copy of local object graph to target place; the variables referenced from S define the roots of the graph to be copied. GlobalRef[T] used to create remote pointers across at boundaries class C { var x:int; def this(n:int) { x = n; // Increment remote counter def inc(c:globalref[c]) { at (c.home) c().x++; // Create GR of C static def make(init:int) { val c = new C(init); return GlobalRef[C](c); 20
21 Distributed Object Model Objects live in a single place Objects can only be accessed in the place where they live Implementing at Compiler analyzes the body of at and identifies roots to copy (exposed variables) Complete object graph reachable from roots is serialized and sent to destination place A new (unrelated) copy of the object graph is created at the destination place Controlling object graph serialization Instance fields of class may be declared transient (won t be copied) GlobalRef[T] allows cross-place pointers to be created
22 Take a breath Whirlwind tour through X10 language More details will be introduced as we work through the example codes See for specification, samples, etc. Next topic: brief overview of the X10 implementation Questions so far?
23 X10 Target Environments High-end large HPC clusters BlueGene/P (since 2010); BlueGene/Q (in progress) Power7IH (aka PERCS machine) x86 + InfiniBand, Power + InfiniBand Goal: deliver scalable performance competitive with C+MPI Medium-scale commodity systems ~100 nodes (~1000 core and ~1 terabyte main memory) Goal: deliver main-memory performance with simple programming model (accessible to Java programmers) Developer laptops Linux, Mac OSX, Windows. Eclipse-based IDE, debugger, etc Goal: support developer productivity
24 X10 Implementation Summary X10 Implementations C++ based ( Native X10 ) Multi-process (one place per process; multi-node) Linux, AIX, MacOS, Cygwin, BlueGene x86, x86_64, PowerPC JVM based ( Managed X10 ) Multi-process (one place per JVM process; multi-node) Limitation on Windows to single process (single place) Runs on any Java 6 JVM X10DT (X10 IDE) available for Windows, Linux, Mac OS X Based on Eclipse 3.7 Supports many core development tasks including remote build/execute facilities IBM Parallel Debugger for X10 Programming Adds X10 language support to IBM Parallel Debugger Available on IBM developerworks (Native X10 on Linux only)
25 X10 Compilation X10 Compiler Front-End Parsing / Type Check X10 Source X10 AST AST Optimizations AST Lowering X10 AST C++ Back-End XRC C++ Code Generation Java Code Generation C++ Source Java Source C++ Compiler XRX Java Compiler Native Code XRJ Bytecode Native X10 Managed X10 JNI Native Env Java Back-End X10RT Java VMs
26 X10 Runtime Software Stack X10 Runtime X10 Application Program X10 Core Class Libraries XRX Runtime X10 Language Native Runtime X10RT PAMI DCMF MPI TCP/IP Core Class Libraries Fundamental classes & primitives, Arrays, core I/O, collections, etc Written in X10; compiled to C++ or Java XRX (X10 Runtime in X10) APGAS functionality Concurrency: async/finish (workstealing) Distribution: Places/at Written in X10; compiled to C++ or Java X10 Language Native Runtime Runtime support for core sequential X10 language features Two versions: C++ and Java X10RT Active messages, collectives, bulk data transfer Implemented in C Abstracts/unifies network layers (PAMI, DCMF, MPI, etc) to enable X10 on a range of transports
27 Performance Model: async, finish, and when Pool of worker threads in each Place Each worker maintains a queue of pending asyncs Worker pops async from its own queue, executes it, repeat If worker s queue empty, steal from another worker s queue Finish typically does not block worker (run subtask in queue instead of blocking) Finish may block because of thieves or remote async worker 1 reaches end of finish block while worker 2 is still running async A blocking finish (or a when) incurs OS-level context switch May incur thread creation cost (maintain set of active workers) Cost analysis async queue operations (fences, at worst CAS) heap allocated task object (allocation + GC cost) finish (single place) synchronized counter heap allocated finish object (allocation + GC cost)
28 Performance Model: Places and at At runtime, each Place is a separate process Costs of at Serialization/deserialization (function of size of captured state) Communication (if destination!= here) Scheduling (worker thread may block) (if destination!= here)
29 Performance Model: Native X10 vs. Managed X10 Native X10: base performance model is C++ Managed X10: base performance model is Java Object-oriented features more expensive C++ compilers don t optimize virtual dispatch, interface invocation, instanceof, etc Memory management: BDWGC Allocation approx like malloc Conservative GC, (non-moving) Parallel, but not concurrent Object-oriented features less expensive JVMs do optimize virtual dispatch, interface invocation, instanceof, etc Memory management Very fast allocation Highly tuned garbage collectors Generics map to C++ templates Structs map to C++ classes with no virtual methods that are inlined in containing object/array C++ compilation is mostly file-based, which limits cross-class optimization More deterministic execution (less runtime jitter) Generics map to Java generics Mostly erased Primitives boxed Structs map to Java classes (no difference between X10 structs and X10 classes) JVM dynamically optimizes entire program Less deterministic execution (more jitter) due to JIT compilation, profile-directed optimization, GC
30 Tutorial Outline X10 Overview (70 minutes) Programming model concepts Key language features Implementation highlights Scaling SPMD-style computations in X10 (40 minutes) Scaling finish/async Scaling communication Collective operations Unbalanced computations (20 minutes) Global Load Balancing & UTS Application frameworks (20 minutes) Global Matrix Library SAT X10 Summary and Discussion (20 minutes)
31 Towards a Scalable HelloWholeWorld 1/class HelloWholeWorld { 2/ public static def main(args:rail[string]) { 3/ finish 4/ for (p in Place.places()) 5/ at (p) 6/ async 7/ Console.OUT.println(p+" says " +args(0)); 8/ 9/ Why won t HelloWholeWorld scale? 1. Using most general (default) implementation of finish 2. Sequential initiation of asyncs 3. Serializing more data than is actually necessary (minor improvement)
32 Towards a Scalable HelloWholeWorld (2) public static def main(args:rail[string]) { finish for (p in Place.places()) { val arg = args(0); at (p) async Console.OUT.println("(At " + here + ") : " + arg); Minor change: optimize serialization to reduce message size
33 Towards a Scalable HelloWholeWorld (3) public static def main(args:rail[string]) finish for (p in Place.places()) { val arg = args(0); at (p) async Console.OUT.println("(At " + here + ") : " + arg); Optimize finish implementation via Pragma annotation
34 Towards a Scalable HelloWholeWorld (4) public static def main(args:rail[string]) finish for(var i:int=place.numplaces()-1; i>=0; i-=32) { at (Place(i)) async { val max = Runtime.hereInt(); val min = Math.max(max-31, finish for (var j:int=min; j<=max; ++j) { at (Place(j)) async Console.OUT.println("(At "+here+ ") : " + arg); Parallelize async initiation this scales nicely, but it is getting complicated!
35 Finally: a Simple and Scalable HelloWholeWorld public static def main(args:rail[string]) { val arg = args(0); PlaceGroup.WORLD.broadcastFlat(()=>{ Console.OUT.println("(At " + here + ") : " + arg); ); Put the complexity in the standard library! broadcastflat encapsulates the chunked loop & pragma; use a closure (function) to get a clean API Recurring theme in X10: simple primitives + strong abstraction = new powerful user-defined primitives
36 K-Means: description of the problem Given: A set of points (black dots) in D-dimensional space (example, D = 2) Desired number of partitions K (example, K = 4) Calculate: The K cluster locations (red blobs). Minimize average distance to cluster 36
37 Lloyd's Algorithm Computes good (not best) clusters. Initialize with a guess for the K positions. Iteratively improve until we reach fixed point. Algorithm: while (!bored) { classify points by nearest cluster average equivalence classes for improved clusters 37
38 Lloyd's Algorithm Computes good (not best) clusters. Initialize with a guess for the K positions. Iteratively improve until we reach fixed point. Algorithm: while (!bored) { classify points by nearest cluster average equivalence classes for improved clusters 38
39 Summary of Scalable KMeans in X10 Natural formulation as an SPMD-style computation Evenly distribute data points across Places Iteratively Each Place assigns its points to the closest cluster Each Place computes new cluster centroids Global reduction to average together computed centroids Use X10 Team APIs for scalable reductions Complete benchmark: 150 lines of X10 code
40 KMEANS Goal unsupervised clustering Performance 1 task: 6.13s 1470 hosts: 6.27s 97% efficiency Configuration 40000*#tasks points 4096 clusters 12 dimensions run time for 5 iterations Performance on 1470 node IBM Power775 40
41 KMeans initialization PlaceGroup.WORLD.broadcastFlat(()=>{ val role = Runtime.hereInt(); val random = new Random(role); val host_points = new Rail[Float](num_slice_points*dim, (Int)=>random.next()); val host_clusters = new Rail[Float](num_clusters*dim); val host_cluster_counts = new Rail[Int](num_clusters); val team = Team.WORLD; if (role == 0) for (k in 0..(num_clusters-1)) for (d in 0..(dim-1)) host_clusters(k*dim+d) = host_points(k+d*num_slice_points); val old_clusters = new Rail[Float](num_clusters*dim); team.allreduce(role, host_clusters, 0, host_clusters, 0, host_clusters.size, Team.ADD);... Main loop implementing Lloyd s algorithm (next slide)... );
42 KMeans compute kernel for (iter in 0..(iterations-1)) { Array.copy(host_clusters, old_clusters); host_clusters.clear(); host_cluster_counts.clear(); // Assign each point to the closest cluster for (p in ) { sequential compute kernel elided // Compute new clusters team.allreduce(role, host_clusters, 0, host_clusters, 0, host_clusters.size, Team.ADD); team.allreduce(role, host_cluster_counts, 0, host_cluster_counts, 0, host_cluster_counts.size, Team.ADD); for (k in 0..(num_clusters-1)) for (d in 0..(dim-1)) host_clusters(k*dim+d) /= host_cluster_counts(k); team.barrier(role);
43 LU Factorization HPL-like algorithm 2-d block-cyclic data distribution right-looking variant with row partial pivoting (no look ahead) recursive panel factorization with pivot search and column broadcast combined
44 Summary of Scalable LU in X10 SPMD-style implementation one top-level async per place (broadcastflat) globally synchronous progress collectives: barriers, broadcast, reductions (row, column, WORLD) point-to-point RDMAs for row swaps uses BLAS for local computations Complete Benchmark: 650 lines of X lines of C++ to wrap native BLAS
45 LU Goal LU decomposition Performance 1 task: gflops 1 host: gflops 1 drawer: gflops 1024 hosts: gflops 80% efficiency Configuration 71% memory (16M pages) p*p or 2p*p grid block size: 360 panel block size: 20 Performance on 1470 node IBM Power775 45
46 Accessing Native code class LU native static def blocktrisolve(me:rail[double], diag:rail[double], native static def blocktrisolvediag(diag:rail[double], min:int, max:int, B:Int):void;... // Use of blocktrisolve: blocktrisolvediag(diag, min, max, B);
47 Allocation of Distributed BlockedArray (in Huge Pages) public final class BlockedArray implements (Int,Int)=>Double { public def this(i:int, J:Int, bx:int, by:int, rand:random) { min_x = I*bx; min_y = J*by; max_x = (I+1)*bx-1; max_y = (J+1)*by-1; delta = by; offset = (I*bx+J)*by; raw = new Rail[Double]( IndexedMemoryChunk.allocateZeroed[Double](bx*by, 8, IndexedMemoryChunk.hugePages())); public static def make(m:int, N:Int, bx:int, by:int, px:int, py:int) { return PlaceLocalHandle.makeFlat[BlockedArray](Dist.makeUnique(), ()=>new BlockedArray(M, N, bx, by, px,py)); // In main program: val A = BlockedArray.make(M, N, B, B, px, py);
48 Row exchange via asynchronous RDMAs def exchange(row1:int, row2:int, min:int, max:int, dest:int) { val source = here; ready = false; val size = A_here.getRow(row1, min, max, buffer); val _buffers = buffers; // use local variable to avoid serializing this! val _A = A; val _remotebuffer = remotebuffer; // RemoteArray (GlobalRef) wrapping finish { at (Place(dest)) async finish { Array.asyncCopy[Double](_remoteBuffer, 0, _buffers(), 0, size); val size2 = _A().swapRow(row2, min, max, _buffers()); Array.asyncCopy[Double](_buffers(), 0, _remotebuffer, 0, size2); A_here.setRow(row1, min, max, buffer);
49 Tutorial Outline X10 Overview (70 minutes) Programming model concepts Key language features Implementation highlights Scaling SPMD-style computations in X10 (40 minutes) Scaling finish/async Scaling communication Collective operations Unbalanced computations (20 minutes) Global Load Balancing & UTS Application frameworks (20 minutes) Global Matrix Library SAT X10 Summary and Discussion (20 minutes)
50 Global Load Balancing A significant class of compute-intensive problems can be cast as Collecting Finish (or Map Reduce ) problems (significantly simpler than Hadoop Map Shuffle Reduce problems) A parallel Map phases spawns work (may be driven by examining data in some shard global data structure) Each mapper may contribute 0 or more pieces of information to a result. These pieces can be combined by a commutative, associative reduction operator Cf collecting finish Q: How do you efficiently execute such programs if the amount of work associated with each mapper could vary dramatically? Example: State-space search, Unbalanced Tree Search (UTS), Branch and Bound Search.
51 Global Load Balancing: Central Problem Goal: Efficiently (~ 90% efficiency) execute collecting finish programs on a large number of node. Problems to be solved: Balance work across a large number of places efficiently Detect termination of the computation quickly. Conventional Approach: Worker steals from randomly chosen victim when local work is exhausted. Key problem: When does a worker know there is no more work?
52 Global Load Balance: The Lifeline Solution Intuition: X0 already has a framework for distributed termination detection finish. Use it! But this requires activities terminate at some point! Idea: Let activities terminate after w rounds of failed steals. But ensure that a worker with some work can distribute work (= use at(p) async S) to nodes known to be idle. Lifeline graphs. Before a worker dies it creates z lifelines to other nodes. These form a distribution overlay graph. What kind of graph? Need low out-degree, low diameter graph. Our paper/implementation uses hyper-cube.
53 Algorithm sketch for Lifeline-based GLB Framework Main activity spawns an activity per place within a finish. Augment work-stealing with work-distributing: Each activity processes. When no work left, steals at most w-times. (Thief) If no work found, visits at most z-lifelines in turn. Donor records thief needs work (Thief) If no work found, dies. (Donor) Pushes work to recorded thieves, when it has some. When all activities die, main activity continues to completion. Caveat: This scheme assumes that work is relocatable
54 UTS: Unbalanced Tree Search Characteristics highly irregular => need for global load balancing geometric fixed law ( t 1 a 3) => wide tree but shallow depth Implementation work queue of SHA1 hash of node + depth + interval of child ids global work stealing n random victims (among m possible targets) then p lifelines (hypercube) thief steals half of work (half of each node in the local work queue) finish counts lifeline steals only (25% performance improvement at scale) single thread (thief and victim cooperate) Complete benchmark: 560 lines of X10 (+ 800 lines of sequential C for SHA1 hash) 54
55 UTS Goal unbalanced tree search Performance 1 task: 10.9 M nodes/s 1470 hosts: 505 B nodes/s 98% efficiency Configuration geometric fixed law depth between 14 and 22 at scale (1470 hosts) -t 1 -a 3 -b 4 -r 19 -d 22 69,312,400,211,958 nodes Performance on 1470 node IBM Power775 55
56 UTS Main Loop: final def processatmostn() { var i:int=0; for (; (i<n) && (queue.size>0); ++i) { queue.expand(); queue.count += i; return queue.size > 0; final def processstack(st:placelocalhandle[uts]) { do { while (processatmostn()) { Runtime.probe(); distribute(st); reject(st); while (steal(st));
57 UTS: steal def steal(st:placelocalhandle[uts]) { if (P == 1) return false; val p = Runtime.hereInt(); empty = true; for (var i:int=0; i < w && empty; ++i) { waiting = true; at async st().request(st, p, false); while (waiting) Runtime.probe(); for (var i:int=0; (i<lifelines.size) && empty && (0<=lifelines(i)); ++i) { val lifeline = lifelines(i); if (!lifelinesactivated(lifeline)) { lifelinesactivated(lifeline) = true; waiting = true; at async st().request(st, p, true); while (waiting) Runtime.probe(); return!empty;
58 UTS: give def give(st:placelocalhandle[uts], loot:queue.fragment) { val victim = Runtime.hereInt(); if (thieves.size() > 0) { val thief = thieves.pop(); if (thief >= 0) { at async { st().deal(st, loot, victim); st().waiting = false; else { at async { st().deal(st, loot, -1); st().waiting = false; else { val thief = lifelinethieves.pop(); at (Place(thief)) async st().deal(st, loot, victim);
59 UTS: distribute & reject def distribute(st:placelocalhandle[uts]) { var loot:queue.fragment; while ((lifelinethieves.size() + thieves.size() > 0) && (loot = queue.grab())!= null) { give(st, loot); reject(st); def reject(st:placelocalhandle[uts]) { while (thieves.size() > 0) { val thief = thieves.pop(); if (thief >= 0) { lifelinethieves.push(thief); at async { st().waiting = false; else { at async { st().waiting = false;
60 UTS: deal def deal(st:placelocalhandle[uts], loot:queue.fragment, source:int) { val lifeline = source >= 0; if (lifeline) lifelinesactivated(source) = false; if (active) { empty = false; processloot(loot, lifeline); else { active = true; processloot(loot, lifeline); processstack(st); active = false;
61 Tutorial Outline X10 Overview (70 minutes) Programming model concepts Key language features Implementation highlights Scaling SPMD-style computations in X10 (40 minutes) Scaling finish/async Scaling communication Collective operations Unbalanced computations (20 minutes) Global Load Balancing & UTS Application frameworks (20 minutes) Global Matrix Library SAT X10 Summary and Discussion (20 minutes)
62 X10 Global Matrix Library Easy-use matrix library in X10 supporting parallel execution on multiple places Presents standard sequential interface (internal parallelism and distribution). Matrix data structures: Dense matrix symmetric and triangular matrix: column-wise storage Sparse matrix: CSC-LT Vector, distributed and duplicated vector Block matrix, distributed and duplicated block matrix Matrix operations scaling, cellwise add, subtract, multiply and division Matrix multiplication Norm, trace, linear equation solver Approximately 45,000 lines of X10 code (open source)
63 Global Matrix Library GML Vector SparseCSC Dense Block matrix Dupl. block Distr. block Dense matrix BLAS wrap X10 Native C/C++ back end Team MPI Dense matrix X10 driver Sparse matrix X10 driver Socket PAMI BLAS Managed Java back end PGAS Socket Multi-thread BLAS (GotoBLAS) 3 party C-MPI library rd LAPACK
64 Matrix Partitioning and Distribution Grid partitioning B0 Bm Bm*(n-1) B1 Balanced partitioning User-specified partitioning Block matrix Bm-1 Bm*n-1 Dense, sparse, or mixed blocks Block distribution: block - place mapping Unique, constant, cyclic, grid, vertical, horizontal and random B0 P0 B1 P1 B2 P2 B3 P3
65 Matrix Partitioning and Distribution Grid partitioning B0 Bm Bm*(n-1) B1 Balanced partitioning User-specified partitioning Block matrix Bm-1 Dense, sparse, or mixed blocks P0 Block distribution: block - place mapping Bm*n-1 P1 Pn-1 Unique, constant, cyclic, grid, vertical, horizontal and random B0 B1 Bm-1 P0 P1 P2 P3
66 Matrix Partitioning and Distribution Grid partitioning P0 Balanced partitioning B0 Bm Bm*(n-1) B1 User-specified partitioning Block matrix Dense, sparse, or mixed blocks Pm-1 Bm-1 Bm*n-1 Block distribution: block - place mapping Unique, constant, cyclic, grid, vertical, horizontal and random Bm*(n-1) B0 Bm B1 P0 P1 P2 P3
67 GML Example: PageRank while (I < max_iteration) { p = alpha*(g%*%p)+(1-alpha)*(e%*%u%*%p); Programmer determines distribution of the data. G is block distributed per Row P is replicated Programmer writes sequential, distribution-aware code. Perform usual optimizations allocate big data-structures outside loop Could equally well have been in Java. DML def step(g:, P:, GP:, dupgp:, UP:, P:, E, : ) { for (1..iteration) { GP.mult(G,P).scale(alpha).copyInto(dupGP);//broadcast P.local().mult(E, UP.mult(U,P.local())).scale(1-alpha).cellAdd(dupGP.local()).sync(); // broadcast X10
68 PageRank Performance Triloka x86 cluster (TK) Managed X10 on socket transport Native X10 on socket transport Watson4P (BlueGene/P) Native X10 on BlueGene DCMF transport Page Rank Page Rank 64 -place, G nonz ero sparsity Java-socket TK 4 Native-socket TK 3.5 Native-pgas BGP 3 14 Java-socket TK 12 Native-socket TK Native-pgas BGP 10 Time(sec) Time(sec) 1000 million nonz eros in G Nonzeros in G (million) Number of places (processors)
69 GML Example: Gaussian Non-Negative Matrix Multiplication Key kernel for topic modeling Involves factoring a large (D x W) matrix D ~ 100M Key decision is representation for matrix, and its distribution. Note: app code is polymorphic in this choice. W ~ 100K, but sparse (0.001) Iterative algorithm, involves distributed sparse matrix multiplication, cell-wise matrix operations. H V W H H H P0 P1 P2 Pn for (1..iteration) { H.cellMult(WV.transMult(W,V,tW).cellDiv(WWH.mult(WW.transMult(W,W),H))); W.cellMult(VH.multTrans(V,H).cellDiv(WHH.mult(W,HH.multTrans(H,H)))); X10
70 GNMF Performance Test platforms: Triloka Cluster (TK): 16 Quad-Core AMD Opteron(tm) 2356 nodes, 1-Gb/s ethernet Watson4P BlueGene/P system (BGP): 2 cards 64 nodes Test settings: run on 64-place V: sparsity 0.001, (1,000,000 x 100,000) ~ (10,000,000 x 1,000,000) W: dense matrix (1,000,000 x 10) ~ (10,000,000 x 10) H: dense matrix (10 x 1,000,000) GNMF runtime Java-socket TK Native-socket TK Native-socket TK Java-socket TK Native-pgas BGP Native-pgas BGP Time(sec) Time(sec) GNMF Computation time Nonzero (million) in V Nonzero (million) in V
71 Global Matrix Library Summary Easy to use Simplified matrix operation interface Managed back end simplifies development, while native back end provides high performance computing capability Hiding concurrency and distribution details from apps developers Hierarchical parallel computing capability Within a node: multi-thread BLAS/LAPACK for intensive computation Between nodes: collective communication for coarse-grained parallel computing Adaptability Data layout compatible to existing specifications (BLAS, CSC-LT). Portability Supporting MPI, socket, PAMI and BlueGene transport Availability Open source; x10.gml project in X10 project main svn on SourceForge.net
72 Sat X10 Framework to combine existing sequential SAT solvers into a parallel portfolio solver Diversity: small number of base solvers, but different parameterization Knowledge Sharing: exchange learned clauses between solvers Structure of SatX10 Minimal modifications to existing solver (add callbacks for knowledge sharing) Small X10/C++ interoperability wrapper for each solver Small (100s of lines) X10 program for communication/distribution Allows parallel solver to run on a single machine with multiple cores and across multiple machines, sharing information such as learned clauses More information:
73 SATX10 Architecture SatX10 Framework 1.0 SolverX10Callback SolverX10Callback SolverSatX10Base SolverSatX10Base ** SolverSatX10Base SolverSatX10Base Data Data Objects Objects Pure Pure Virtual Virtual Methods Methods Other Other Controls Controls Callback Callback ** placeid placeid maxlenshrcl maxlenshrcl outbufsize outbufsize incomingclsq incomingclsq outgoingclsq outgoingclsq x10_parsedimacs() x10_parsedimacs() x10_nvars() x10_nvars() x10_nclauses() x10_nclauses() x10_solve() x10_solve() x10_printsoln() x10_printsoln() x10_printstats() x10_printstats() x10_kill() x10_kill() x10_waskilled() x10_waskilled() x10_accessincbuf() x10_accessincbuf() x10_accessoutbuf() x10_accessoutbuf() Callback Callback Methods Methods x10_step() x10_step() x10_processoutgoingcls() x10_processoutgoingcls() SatX10.x10 SatX10.x10 Main Main X10 X10 routines routines to launch to launch solvers solvers at at various various places: places: Glucose::Solver Glucose::Solver ** Minisat::Solver Minisat::Solver ** SatX10 Solver SatX10 Solver Implement Implement callbacks callbacks of of base base class class CallbackStats CallbackStats (data) (data) Routines Routines for for X10 X10 to to interact with solvers interact with solvers solve() solve() kill() kill() bufferincomingcls() bufferincomingcls() printinstanceinfo() printinstanceinfo() printresults() printresults() SatX10 Minisat SatX10 Minisat SatX10 Glucos SatX10 Glucos e Specializedesolver: Specialized Specialized solver: solver: Specialized solver: Minisat::Solver **Confidential Glucose::Solver IBM Minisat::Solver Glucose::Solver ** Minisat::Solver Minisat::Solver Implement Implement pure pure virtual virtual methods methods of of base base class class Other Other Methods Methods bufferoutgoingcls() bufferoutgoingcls() processincomgcls() processincomgcls() Glucose::Solver Glucose::Solver Implement Implement pure pure virtual virtual methods methods of of base base class class Other Other Methods Methods bufferoutgoingcls() bufferoutgoingcls() processincomgcls() processincomgcls() Specialization for individual solvers
74 Preliminary Empirical Results: Same Machine Same machine, 8 cores, clause lengths=1 and 8 Note: Promising but preliminary results; focus so far has been on developing the framework, not on producing a highly competitive solver
75 Preliminary Empirical Results: Multiple Machines 8 machine/8 cores vs 16 machines/64 cores, clause lengths=1 and 8 Same executable as for single machine --- just different parameters!
76 Tutorial Outline X10 Overview (70 minutes) Programming model concepts Key language features Implementation highlights Scaling SPMD-style computations in X10 (40 minutes) Scaling finish/async Scaling communication Collective operations Unbalanced computations (20 minutes) Global Load Balancing & UTS Application frameworks (20 minutes) Global Matrix Library SAT X10 Summary and Discussion (20 minutes)
77 X10 Project Highlights since SC 11 Three major releases: 2.2.2, 2.2.3, Maintained backwards compatibility with X language (June 2011) Language backwards compatibility with will be maintained in future releases Smooth Java interoperability Major feature of X release Application work at IBM M3R: Main Memory Map Reduce (VLDB 12) Global Matrix Library (open sourced Oct 2011; available in x10 svn) SatX10 (SAT 12) HPC benchmarks (for PERCS) run at scale on 47,000 core Power775
78 Summary of X10 12 Workshop Second annual X10 workshop (co-located with PLDI/ECOOP in June 2012) Papers 1. Fast Method Dispatch and Effective Use of Primitives for Reified Generics in Managed X10 2. Distributed Garbage Collection for Managed X10, 3. StreamX10: A Stream Programming Framework on X10, 4. X10-based Massive Parallel Large-Scale Traffic Flow Simulation 5. Characterization of Smith-Waterman Sequence Database Search in X10 6. Introducing ScaleGraph : An X10 Library for Billion Scale Graph Analytics Invited talks 1. A tutorial on X10 and its implementation 2. M3R: Increased Performance for In-Memory Hadoop Jobs X10 13 will be held in June 2013 co-located with PLDI 13
79 X10 Community: StreamX10 Slide by Haitao Wei, X
A Tutorial on X10 and its Implementation
A Tutorial on X10 and its Implementation David Grove IBM TJ Watson Research Center This material is based upon work supported in part by the Defense Advanced Research Projects Agency under its Agreement
More informationTowards Performance and Productivity at Scale
Reflections on X10 Towards Performance and Productivity at Scale David Grove IBM TJ Watson Research Center This material is based upon work supported in part by the Defense Advanced Research Projects Agency
More informationProgramming Language X10 on Multiple JVMs
Workshop on the State of Art in Software Research, at UC Irvine Programming Language on Multiple JVMs 2013/05/15 Kiyokuni (KIYO) Kawachiya and Mikio Takeuchi IBM Research - Tokyo This Talk Focuses on Managed
More informationMain Memory Map Reduce (M3R)
Main Memory Map Reduce (M3R) PL Day, September 23, 2013 Avi Shinnar, David Cunningham, Ben Herta, Vijay Saraswat, Wei Zhang In collaboralon with: Yan Li*, Dave Grove, Mikio Takeuchi**, Salikh Zakirov**,
More informationProgramming Languages for Large Scale Parallel Computing. Marc Snir
Programming Languages for Large Scale Parallel Computing Marc Snir Focus Very large scale computing (>> 1K nodes) Performance is key issue Parallelism, load balancing, locality and communication are algorithmic
More informationParallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes
Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes Eric Petit, Loïc Thebault, Quang V. Dinh May 2014 EXA2CT Consortium 2 WPs Organization Proto-Applications
More informationScheduling Task Parallelism" on Multi-Socket Multicore Systems"
Scheduling Task Parallelism" on Multi-Socket Multicore Systems" Stephen Olivier, UNC Chapel Hill Allan Porterfield, RENCI Kyle Wheeler, Sandia National Labs Jan Prins, UNC Chapel Hill Outline" Introduction
More informationBigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic
BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop
More informationBlobSeer: Towards efficient data storage management on large-scale, distributed systems
: Towards efficient data storage management on large-scale, distributed systems Bogdan Nicolae University of Rennes 1, France KerData Team, INRIA Rennes Bretagne-Atlantique PhD Advisors: Gabriel Antoniu
More informationParallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite benson.muite@ut.ee http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
More informationDistributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
More informationScientific Computing Programming with Parallel Objects
Scientific Computing Programming with Parallel Objects Esteban Meneses, PhD School of Computing, Costa Rica Institute of Technology Parallel Architectures Galore Personal Computing Embedded Computing Moore
More informationCloud Computing. Up until now
Cloud Computing Lecture 11 Virtualization 2011-2012 Up until now Introduction. Definition of Cloud Computing Grid Computing Content Distribution Networks Map Reduce Cycle-Sharing 1 Process Virtual Machines
More information- An Essential Building Block for Stable and Reliable Compute Clusters
Ferdinand Geier ParTec Cluster Competence Center GmbH, V. 1.4, March 2005 Cluster Middleware - An Essential Building Block for Stable and Reliable Compute Clusters Contents: Compute Clusters a Real Alternative
More informationPerformance Monitoring of Parallel Scientific Applications
Performance Monitoring of Parallel Scientific Applications Abstract. David Skinner National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory This paper introduces an infrastructure
More informationHigh Performance Computing. Course Notes 2007-2008. HPC Fundamentals
High Performance Computing Course Notes 2007-2008 2008 HPC Fundamentals Introduction What is High Performance Computing (HPC)? Difficult to define - it s a moving target. Later 1980s, a supercomputer performs
More informationIntroduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
More informationFundamentals of Java Programming
Fundamentals of Java Programming This document is exclusive property of Cisco Systems, Inc. Permission is granted to print and copy this document for non-commercial distribution and exclusive use by instructors
More informationLoad Balancing Techniques
Load Balancing Techniques 1 Lecture Outline Following Topics will be discussed Static Load Balancing Dynamic Load Balancing Mapping for load balancing Minimizing Interaction 2 1 Load Balancing Techniques
More informationSpark: Cluster Computing with Working Sets
Spark: Cluster Computing with Working Sets Outline Why? Mesos Resilient Distributed Dataset Spark & Scala Examples Uses Why? MapReduce deficiencies: Standard Dataflows are Acyclic Prevents Iterative Jobs
More informationParallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data
Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin
More informationDistributed R for Big Data
Distributed R for Big Data Indrajit Roy HP Vertica Development Team Abstract Distributed R simplifies large-scale analysis. It extends R. R is a single-threaded environment which limits its utility for
More informationGTask Developing asynchronous applications for multi-core efficiency
GTask Developing asynchronous applications for multi-core efficiency February 2009 SCALE 7x Los Angeles Christian Hergert What Is It? GTask is a mini framework to help you write asynchronous code. Dependencies
More informationApache Flink Next-gen data analysis. Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas
Apache Flink Next-gen data analysis Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research
More informationMPI and Hybrid Programming Models. William Gropp www.cs.illinois.edu/~wgropp
MPI and Hybrid Programming Models William Gropp www.cs.illinois.edu/~wgropp 2 What is a Hybrid Model? Combination of several parallel programming models in the same program May be mixed in the same source
More informationPetascale Software Challenges. William Gropp www.cs.illinois.edu/~wgropp
Petascale Software Challenges William Gropp www.cs.illinois.edu/~wgropp Petascale Software Challenges Why should you care? What are they? Which are different from non-petascale? What has changed since
More informationPetascale Software Challenges. Piyush Chaudhary piyushc@us.ibm.com High Performance Computing
Petascale Software Challenges Piyush Chaudhary piyushc@us.ibm.com High Performance Computing Fundamental Observations Applications are struggling to realize growth in sustained performance at scale Reasons
More informationUTS: An Unbalanced Tree Search Benchmark
UTS: An Unbalanced Tree Search Benchmark LCPC 2006 1 Coauthors Stephen Olivier, UNC Jun Huan, UNC/Kansas Jinze Liu, UNC Jan Prins, UNC James Dinan, OSU P. Sadayappan, OSU Chau-Wen Tseng, UMD Also, thanks
More informationIntroduction to DISC and Hadoop
Introduction to DISC and Hadoop Alice E. Fischer April 24, 2009 Alice E. Fischer DISC... 1/20 1 2 History Hadoop provides a three-layer paradigm Alice E. Fischer DISC... 2/20 Parallel Computing Past and
More informationBSC vision on Big Data and extreme scale computing
BSC vision on Big Data and extreme scale computing Jesus Labarta, Eduard Ayguade,, Fabrizio Gagliardi, Rosa M. Badia, Toni Cortes, Jordi Torres, Adrian Cristal, Osman Unsal, David Carrera, Yolanda Becerra,
More informationIntroduction to Virtual Machines
Introduction to Virtual Machines Introduction Abstraction and interfaces Virtualization Computer system architecture Process virtual machines System virtual machines 1 Abstraction Mechanism to manage complexity
More informationSoftware-defined Storage Architecture for Analytics Computing
Software-defined Storage Architecture for Analytics Computing Arati Joshi Performance Engineering Colin Eldridge File System Engineering Carlos Carrero Product Management June 2015 Reference Architecture
More informationCSCI E 98: Managed Environments for the Execution of Programs
CSCI E 98: Managed Environments for the Execution of Programs Draft Syllabus Instructor Phil McGachey, PhD Class Time: Mondays beginning Sept. 8, 5:30-7:30 pm Location: 1 Story Street, Room 304. Office
More informationA Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin
A Comparison Of Shared Memory Parallel Programming Models Jace A Mogill David Haglin 1 Parallel Programming Gap Not many innovations... Memory semantics unchanged for over 50 years 2010 Multi-Core x86
More informationTo Java SE 8, and Beyond (Plan B)
11-12-13 To Java SE 8, and Beyond (Plan B) Francisco Morero Peyrona EMEA Java Community Leader 8 9...2012 2020? Priorities for the Java Platforms Grow Developer Base Grow Adoption
More informationChing-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing. October 29th, 2015
E6893 Big Data Analytics Lecture 8: Spark Streams and Graph Computing (I) Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing
More informationPARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)
PARALLEL JAVASCRIPT Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology) JAVASCRIPT Not connected with Java Scheme and self (dressed in c clothing) Lots of design errors (like automatic semicolon
More informationFast Analytics on Big Data with H20
Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,
More informationBringing Big Data Modelling into the Hands of Domain Experts
Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks david.willingham@mathworks.com.au 2015 The MathWorks, Inc. 1 Data is the sword of the
More informationDelivering Quality in Software Performance and Scalability Testing
Delivering Quality in Software Performance and Scalability Testing Abstract Khun Ban, Robert Scott, Kingsum Chow, and Huijun Yan Software and Services Group, Intel Corporation {khun.ban, robert.l.scott,
More informationParallel Data Preparation with the DS2 Programming Language
ABSTRACT Paper SAS329-2014 Parallel Data Preparation with the DS2 Programming Language Jason Secosky and Robert Ray, SAS Institute Inc., Cary, NC and Greg Otto, Teradata Corporation, Dayton, OH A time-consuming
More informationJonathan Worthington Scarborough Linux User Group
Jonathan Worthington Scarborough Linux User Group Introduction What does a Virtual Machine do? Hides away the details of the hardware platform and operating system. Defines a common set of instructions.
More informationDistributed File Systems
Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.
More information! E6893 Big Data Analytics Lecture 9:! Linked Big Data Graph Computing (I)
! E6893 Big Data Analytics Lecture 9:! Linked Big Data Graph Computing (I) Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Mgr., Dept. of Network Science and
More informationLoad balancing. David Bindel. 12 Nov 2015
Load balancing David Bindel 12 Nov 2015 Inefficiencies in parallel code Poor single processor performance Typically in the memory system Saw this in matrix multiply assignment Overhead for parallelism
More informationChapter 6, The Operating System Machine Level
Chapter 6, The Operating System Machine Level 6.1 Virtual Memory 6.2 Virtual I/O Instructions 6.3 Virtual Instructions For Parallel Processing 6.4 Example Operating Systems 6.5 Summary Virtual Memory General
More informationMizan: A System for Dynamic Load Balancing in Large-scale Graph Processing
/35 Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing Zuhair Khayyat 1 Karim Awara 1 Amani Alonazi 1 Hani Jamjoom 2 Dan Williams 2 Panos Kalnis 1 1 King Abdullah University of
More informationIn-Memory Computing for Iterative CPU-intensive Calculations in Financial Industry In-Memory Computing Summit 2015
In-Memory Computing for Iterative CPU-intensive Calculations in Financial Industry In-Memory Computing Summit 2015 June 29-30, 2015 Contacts Alexandre Boudnik Senior Solution Architect, EPAM Systems Alexandre_Boudnik@epam.com
More informationChapter 11 I/O Management and Disk Scheduling
Operating Systems: Internals and Design Principles, 6/E William Stallings Chapter 11 I/O Management and Disk Scheduling Dave Bremer Otago Polytechnic, NZ 2008, Prentice Hall I/O Devices Roadmap Organization
More informationGeneral Introduction
Managed Runtime Technology: General Introduction Xiao-Feng Li (xiaofeng.li@gmail.com) 2012-10-10 Agenda Virtual machines Managed runtime systems EE and MM (JIT and GC) Summary 10/10/2012 Managed Runtime
More informationVirtual Machine Learning: Thinking Like a Computer Architect
Virtual Machine Learning: Thinking Like a Computer Architect Michael Hind IBM T.J. Watson Research Center March 21, 2005 CGO 05 Keynote 2005 IBM Corporation What is this talk about? Virtual Machines? 2
More informationSourcery Overview & Virtual Machine Installation
Sourcery Overview & Virtual Machine Installation Damian Rouson, Ph.D., P.E. Sourcery, Inc. www.sourceryinstitute.org Sourcery, Inc. About Us Sourcery, Inc., is a software consultancy founded by and for
More informationSeeking Opportunities for Hardware Acceleration in Big Data Analytics
Seeking Opportunities for Hardware Acceleration in Big Data Analytics Paul Chow High-Performance Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Toronto Who
More informationJava Virtual Machine: the key for accurated memory prefetching
Java Virtual Machine: the key for accurated memory prefetching Yolanda Becerra Jordi Garcia Toni Cortes Nacho Navarro Computer Architecture Department Universitat Politècnica de Catalunya Barcelona, Spain
More informationBest practices for efficient HPC performance with large models
Best practices for efficient HPC performance with large models Dr. Hößl Bernhard, CADFEM (Austria) GmbH PRACE Autumn School 2013 - Industry Oriented HPC Simulations, September 21-27, University of Ljubljana,
More informationSWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri
SWARM: A Parallel Programming Framework for Multicore Processors David A. Bader, Varun N. Kanade and Kamesh Madduri Our Contributions SWARM: SoftWare and Algorithms for Running on Multicore, a portable
More information16.1 MAPREDUCE. For personal use only, not for distribution. 333
For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several
More informationReplication on Virtual Machines
Replication on Virtual Machines Siggi Cherem CS 717 November 23rd, 2004 Outline 1 Introduction The Java Virtual Machine 2 Napper, Alvisi, Vin - DSN 2003 Introduction JVM as state machine Addressing non-determinism
More informationImproving the performance of data servers on multicore architectures. Fabien Gaud
Improving the performance of data servers on multicore architectures Fabien Gaud Grenoble University Advisors: Jean-Bernard Stefani, Renaud Lachaize and Vivien Quéma Sardes (INRIA/LIG) December 2, 2010
More informationPART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design
PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions Slide 1 Outline Principles for performance oriented design Performance testing Performance tuning General
More informationHPC ABDS: The Case for an Integrating Apache Big Data Stack
HPC ABDS: The Case for an Integrating Apache Big Data Stack with HPC 1st JTC 1 SGBD Meeting SDSC San Diego March 19 2014 Judy Qiu Shantenu Jha (Rutgers) Geoffrey Fox gcf@indiana.edu http://www.infomall.org
More informationKITES TECHNOLOGY COURSE MODULE (C, C++, DS)
KITES TECHNOLOGY 360 Degree Solution www.kitestechnology.com/academy.php info@kitestechnology.com technologykites@gmail.com Contact: - 8961334776 9433759247 9830639522.NET JAVA WEB DESIGN PHP SQL, PL/SQL
More informationSpark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY
Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person
More informationOverlapping Data Transfer With Application Execution on Clusters
Overlapping Data Transfer With Application Execution on Clusters Karen L. Reid and Michael Stumm reid@cs.toronto.edu stumm@eecg.toronto.edu Department of Computer Science Department of Electrical and Computer
More informationPerformance Analysis and Optimization Tool
Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL andres.charif@uvsq.fr Performance Analysis Team, University of Versailles http://www.maqao.org Introduction Performance Analysis Develop
More informationA Multi-layered Domain-specific Language for Stencil Computations
A Multi-layered Domain-specific Language for Stencil Computations Christian Schmitt, Frank Hannig, Jürgen Teich Hardware/Software Co-Design, University of Erlangen-Nuremberg Workshop ExaStencils 2014,
More informationMulti-core Programming System Overview
Multi-core Programming System Overview Based on slides from Intel Software College and Multi-Core Programming increasing performance through software multi-threading by Shameem Akhter and Jason Roberts,
More informationThe Design and Implementation of Scalable Parallel Haskell
The Design and Implementation of Scalable Parallel Haskell Malak Aljabri, Phil Trinder,and Hans-Wolfgang Loidl MMnet 13: Language and Runtime Support for Concurrent Systems Heriot Watt University May 8,
More informationApplications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61
F# Applications to Computational Financial and GPU Computing May 16th Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 Today! Why care about F#? Just another fashion?! Three success stories! How Alea.cuBase
More informationbigdata Managing Scale in Ontological Systems
Managing Scale in Ontological Systems 1 This presentation offers a brief look scale in ontological (semantic) systems, tradeoffs in expressivity and data scale, and both information and systems architectural
More informationParallel Computing for Data Science
Parallel Computing for Data Science With Examples in R, C++ and CUDA Norman Matloff University of California, Davis USA (g) CRC Press Taylor & Francis Group Boca Raton London New York CRC Press is an imprint
More informationCompute Cluster Server Lab 3: Debugging the parallel MPI programs in Microsoft Visual Studio 2005
Compute Cluster Server Lab 3: Debugging the parallel MPI programs in Microsoft Visual Studio 2005 Compute Cluster Server Lab 3: Debugging the parallel MPI programs in Microsoft Visual Studio 2005... 1
More informationMapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
More informationRunning a Workflow on a PowerCenter Grid
Running a Workflow on a PowerCenter Grid 2010-2014 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording or otherwise)
More informationSupercomputing and Big Data: Where are the Real Boundaries and Opportunities for Synergy?
HPC2012 Workshop Cetraro, Italy Supercomputing and Big Data: Where are the Real Boundaries and Opportunities for Synergy? Bill Blake CTO Cray, Inc. The Big Data Challenge Supercomputing minimizes data
More informationSpark. Fast, Interactive, Language- Integrated Cluster Computing
Spark Fast, Interactive, Language- Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica UC
More informationNext Generation GPU Architecture Code-named Fermi
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
More informationHIGH PERFORMANCE BIG DATA ANALYTICS
HIGH PERFORMANCE BIG DATA ANALYTICS Kunle Olukotun Electrical Engineering and Computer Science Stanford University June 2, 2014 Explosion of Data Sources Sensors DoD is swimming in sensors and drowning
More informationFachbereich Informatik und Elektrotechnik SunSPOT. Ubiquitous Computing. Ubiquitous Computing, Helmut Dispert
Ubiquitous Computing Ubiquitous Computing The Sensor Network System Sun SPOT: The Sun Small Programmable Object Technology Technology-Based Wireless Sensor Networks a Java Platform for Developing Applications
More informationLecture 2 Parallel Programming Platforms
Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple
More information11.1 inspectit. 11.1. inspectit
11.1. inspectit Figure 11.1. Overview on the inspectit components [Siegl and Bouillet 2011] 11.1 inspectit The inspectit monitoring tool (website: http://www.inspectit.eu/) has been developed by NovaTec.
More informationHybrid Software Architectures for Big Data. Laurence.Hubert@hurence.com @hurence http://www.hurence.com
Hybrid Software Architectures for Big Data Laurence.Hubert@hurence.com @hurence http://www.hurence.com Headquarters : Grenoble Pure player Expert level consulting Training R&D Big Data X-data hot-line
More informationSpring 2011 Prof. Hyesoon Kim
Spring 2011 Prof. Hyesoon Kim Today, we will study typical patterns of parallel programming This is just one of the ways. Materials are based on a book by Timothy. Decompose Into tasks Original Problem
More informationSystem Structures. Services Interface Structure
System Structures Services Interface Structure Operating system services (1) Operating system services (2) Functions that are helpful to the user User interface Command line interpreter Batch interface
More informationData Centric Systems (DCS)
Data Centric Systems (DCS) Architecture and Solutions for High Performance Computing, Big Data and High Performance Analytics High Performance Computing with Data Centric Systems 1 Data Centric Systems
More informationBLM 413E - Parallel Programming Lecture 3
BLM 413E - Parallel Programming Lecture 3 FSMVU Bilgisayar Mühendisliği Öğr. Gör. Musa AYDIN 14.10.2015 2015-2016 M.A. 1 Parallel Programming Models Parallel Programming Models Overview There are several
More informationLast Class: OS and Computer Architecture. Last Class: OS and Computer Architecture
Last Class: OS and Computer Architecture System bus Network card CPU, memory, I/O devices, network card, system bus Lecture 3, page 1 Last Class: OS and Computer Architecture OS Service Protection Interrupts
More informationGlossary of Object Oriented Terms
Appendix E Glossary of Object Oriented Terms abstract class: A class primarily intended to define an instance, but can not be instantiated without additional methods. abstract data type: An abstraction
More informationC#5.0 IN A NUTSHELL. Joseph O'REILLY. Albahari and Ben Albahari. Fifth Edition. Tokyo. Sebastopol. Beijing. Cambridge. Koln.
Koln C#5.0 IN A NUTSHELL Fifth Edition Joseph Albahari and Ben Albahari O'REILLY Beijing Cambridge Farnham Sebastopol Tokyo Table of Contents Preface xi 1. Introducing C# and the.net Framework 1 Object
More informationEnd-user Tools for Application Performance Analysis Using Hardware Counters
1 End-user Tools for Application Performance Analysis Using Hardware Counters K. London, J. Dongarra, S. Moore, P. Mucci, K. Seymour, T. Spencer Abstract One purpose of the end-user tools described in
More informationPERFORMANCE ANALYSIS OF KERNEL-BASED VIRTUAL MACHINE
PERFORMANCE ANALYSIS OF KERNEL-BASED VIRTUAL MACHINE Sudha M 1, Harish G M 2, Nandan A 3, Usha J 4 1 Department of MCA, R V College of Engineering, Bangalore : 560059, India sudha.mooki@gmail.com 2 Department
More informationELEC 377. Operating Systems. Week 1 Class 3
Operating Systems Week 1 Class 3 Last Class! Computer System Structure, Controllers! Interrupts & Traps! I/O structure and device queues.! Storage Structure & Caching! Hardware Protection! Dual Mode Operation
More informationTechnical Computing Suite Job Management Software
Technical Computing Suite Job Management Software Toshiaki Mikamo Fujitsu Limited Supercomputer PRIMEHPC FX10 PRIMERGY x86 cluster Outline System Configuration and Software Stack Features The major functions
More informationBigdata High Availability (HA) Architecture
Bigdata High Availability (HA) Architecture Introduction This whitepaper describes an HA architecture based on a shared nothing design. Each node uses commodity hardware and has its own local resources
More informationOutline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging
Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging
More informationSymmetric Multiprocessing
Multicore Computing A multi-core processor is a processing system composed of two or more independent cores. One can describe it as an integrated circuit to which two or more individual processors (called
More informationHadoop. History and Introduction. Explained By Vaibhav Agarwal
Hadoop History and Introduction Explained By Vaibhav Agarwal Agenda Architecture HDFS Data Flow Map Reduce Data Flow Hadoop Versions History Hadoop version 2 Hadoop Architecture HADOOP (HDFS) Data Flow
More informationComparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster
Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster Gabriele Jost and Haoqiang Jin NAS Division, NASA Ames Research Center, Moffett Field, CA 94035-1000 {gjost,hjin}@nas.nasa.gov
More informationClash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics
Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics Juwei Shi, Yunjie Qiu, Umar Farooq Minhas, Limei Jiao, Chen Wang, Berthold Reinwald, and Fatma Özcan IBM Research China IBM Almaden
More informationReport: Declarative Machine Learning on MapReduce (SystemML)
Report: Declarative Machine Learning on MapReduce (SystemML) Jessica Falk ETH-ID 11-947-512 May 28, 2014 1 Introduction SystemML is a system used to execute machine learning (ML) algorithms in HaDoop,
More information