Background. Bulgaria. California 7/7/2014. Big Data, HPC & MapReduce 1

Transcription

1 Background Bulgaria California Big Data, HPC & MapReduce 1

2 Background Chapman U. Southern California Big Data, HPC & MapReduce 2

3 Background Chapman University Schmid College of Science and Technology School of Computational Sciences PhD & MS in Computational & Data Sciences BS in Computer Science, Computer Information Systems, Math, Math & Civil Engineering Big Data, HPC & MapReduce 3

4 Big Data, High-Performance Computing, and MapReduce Algorithms on Grids Atanas RADENSKI School of Computational Sciences Chapman University Orange, California, USA

5 Taxonomy Experiments & Measurements Theory Big Data Modeling & Simulation Data Science Computational Science Data-Intensive Applications Compute-Intensive Applications High-Performance Computing Systems Big Data, HPC & MapReduce 5

6 Experiments & Measurements 16th-century illustration of Archimedes Experiment on the Archimedes principle Big Data, HPC & MapReduce 6

7 Theory The Scientific Method Anzenbacher Research Group: Jokes Big Data, HPC & MapReduce 7

8 Modeling & Simulation Modeling & Simulation aims to replace physical experiments with virtual ones. Benefits: M & S can be cheaper, faster, safer, more scalable, more insightful. Examples: (1) Simulation of nuclear weapon performance, (2) living cell simulation and (3) human brain simulation. Scientific method extended by modeling & simulation Big Data, HPC & MapReduce 8

9 Modeling & Simulation The US Advanced Simulation & Computing Program (ASC) develops and employs science-based computer simulation capabilities for predictive simulation of the US nuclear stockpile - without physical tests. ASC (formerly ASCI) was created after the last US underground nuclear test in High-resolution 3D simulations help to assess the health of a B-61 nuclear bomb Big Data, HPC & MapReduce 9

10 Modeling & Simulation ASC employs several Top500 supercomputers, including three petascale systems (Sequoia, Cielo, Roadrunner, 2013). ASC LANL used 32K processors on the Cielo supercomputer for full-physics, full-geometry 3D simulation of nuclear blasts on Earth-threatening asteroids (2012). Asteroid killer simulation of a one megaton nuclear blast to explode the Itokawa asteroid ( m) Big Data, HPC & MapReduce 10

11 Modeling & Simulation Stanford used a 128-node cluster to run the world s first complete model of a living organism, the bacterium Mycoplasma genitalium (2012). The bacterium, a STI, has only 525 genes, the fewest of any independently living organism. This single-cell model synchronizes 28 submodels, each simulating a different process. Such computational models could bring rational design to biology and support the wholesale creation of new microorganisms. A computational model of the M. genitalium includes all of its molecular components and their interactions. Cell 2012, Elsevier Big Data, HPC & MapReduce 11

12 Modeling & Simulation The European Human Brain Project aims to reverse-engineer the complete human brain in detailed computer models and simulations ( ). Brain modeling & simulation will provide new data on brain functions & pathology. In particular, simulations will be used to study drug effects on brain disease before experiments with human subjects. The project is also expected to contribute to the design of novel high-performance computing technologies and systems. Big Data, HPC & MapReduce 12

13 Big Data Scientific experiments, measurements, modeling and simulation have been known for years for the generation of big data (e.g., in astronomy and meteorology). The NASA Center for Climate Simulation (NCCS) for example stores 32 petabytes of climate observations and simulations on the Discover supercomputing cluster. 32 petabytes = bytes Right: A snapshot from the NCCS's 2-year, 10- kilometer global atmospheric simulation, which included revisiting the extraordinary 2005 Atlantic hurricane season. The simulation spawned 23 storms, compared to 28 in reality. Putman & Kekesi, NASA/Goddard (2012) Big Data, HPC & MapReduce 13

14 Big Data Apart from sciences, recent years are characterized by explosive growth of big data in private enterprise and government. Giga = Tera = Peta = Exa = Zeta = Growth of globally produced data in zetabytes 2010 Stored data in petabytes Not all produced data is stored. Big Data, HPC & MapReduce 14

15 Big Data Volume of stored data in petabytes (2010) Peta = Big data informally refers to data that is too voluminous, complex, dynamic to capture, store, manage, and analyze using on-hand data tools. Big Data, HPC & MapReduce 15

16 Big Data Left: Examples of data sources of various sizes. Retrieved Giga = Tera = Peta = Exa = Zeta = Big Data, HPC & MapReduce 16

17 Data Science and Computational Science Computational Science is about computational models and simulations. Data Science is about generalizable extraction of knowledge from (big) data. Deals with big computations, typically over structured data (matrices). Numerical analysis Statistics Deals with (big) data, typically unstructured and possibly heterogeneous. Big Data, HPC & MapReduce 17

18 Data Science Data science is multi-disciplinary / datacook.blogspot.com In Data Science, the concern is finding interesting and robust patterns that satisfy the data, where "interesting" is usually something unexpected and actionable and "robust" is a pattern expected to occur in the future [Dhar, CACM, No 12, 2013]. For example, diabetes complication patterns like this one can be extracted from a health-care database: Age > 36 and #Medications > 6 Complication=Yes (100% confidence) The pattern represents the type of question we should ask the database, if only we knew what to ask [Dhar]. Big Data, HPC & MapReduce 18

19 Predictive Analytics Predictive analytics encompasses a variety of statistical techniques from modeling, machine learning, and data mining that analyze current and historical facts to make predictions about future, or otherwise unknown, events. [Wikipedia] Big Data, HPC & MapReduce 19

20 High-Performance Computing Data science practice materializes in data-intensive applications (they involve big data & are I/O bound). Popular platform: Hadoop. Computational science practice materializes in compute-intensive applications (they involve big computations & are processor bound). Popular platforms: MPI and OpenMP. Big Data, HPC & MapReduce 20

21 Exascale Computing Exascale performance calls for new technologies & architectures. Compute-intensive applications for advanced science, national security, and energy research demand exascale performance (DOE, 2009). One exaflop in 2020? Challenge: Based on current technology, scaling today s systems to an exaflop level would consume more than a gigawatt of power (DOE, 2010) (>$20B/year). Top 500 performance / en.wikipedia.org/wiki/top500 Big Data, HPC & MapReduce 21

22 The Hadoop Big Data Platform Yahoo s Hadoop cluster Apache Hadoop is a most popular open-source software platform for data-intensive applications. Hadoop runs on commodity clusters with high fault-tolerance. Hadoop s primary modules are MapReduce (MR) and the Hadoop Distributed File System (HDFS). MR implements a highlevel, implicit parallel programming model. HDFS provides highthroughput access to big data. Big Data, HPC & MapReduce 22

23 MapReduce (MR) MR was originally implemented as a proprietary product to support Google s needs for large-scale distributed processing of unstructured text data. Since the implementation of MR as a module of the opensource Apache Hadoop platform, MR has been applied to various domains, such as: sets and graphs; AI, machine learning and data mining; bioinformatics; image and video; evolutionary computing; and statistics and numerical mathematics. Big Data, HPC & MapReduce 23

24 The MR Advantage With MR, users specify serial-only computation in terms of a map method and a reduce method, and the underlying implementation automatically: parallelizes the computation, tends to machine failures, and schedules efficient inter-machine communication. MR offers simplicity, built-in fault-tolerance, and scalability. Big Data, HPC & MapReduce 24

25 MR Stages Big Data, HPC & MapReduce 25

26 MR Streaming versus Standard MR Algorithms in the standard MR model must be implemented with the MR Java API. Standard MR requires Java expertise and can be difficult to use by domain scientists. I chose to work with the alternative MR streaming model. MR streaming permits algorithm implementation in a variety of languages, including higher-level scripting languages, such as Python and Ruby. MR streaming is easier to use but can be less efficient and in more need for optimizations. Big Data, HPC & MapReduce 26

27 MR Streaming Example (1 of 2) Input Task Number Mapper OUT Reducer IN Reducer OUT to be or 1 to 1 be 1 or 1 not to be 2 not 1 to 1 be 1 be 1 be 1 not 1 or 1 to 1 to 1 be 2 not 1 or 1 to 2 Big Data, HPC & MapReduce 27

28 MR Streaming Example (2 of 2) class Mapper: method Map (): for line stdin: for word line: Emit (key=word, value =1); class Reducer: method Reduce (): for (word, value) stdin: if word is same as previous: sum +=value; else: Emit (word, sum); sum = 0 Emit (word, sum); # last word Big Data, HPC & MapReduce 28

29 Standard MR Example (1 of 2) Input Task Number Mapper OUT Reducer IN Reducer OUT to be or 1 to 1 be 1 or 1 not to be 2 not 1 to 1 be 1 be 1 1 not 1 or 1 to 1 1 be 2 not 1 or 1 to 2 Grouping of values for the same key is a semantic difference between standard MR (groups) and MR streaming (does not group). Big Data, HPC & MapReduce 29

30 Standard MR Example (2 of 2) class Mapper: method Map (key, value): for word value: Emit (key=word, value =1); class Reducer: method Reduce (key, list-of-values): sum = 0; for value list-of-values: sum +=value; Emit (key, value = sum); Big Data, HPC & MapReduce 30

31 Grid-Based Iterative Models In recent research, I have explored the potential of MapReduce streaming in the parallel simulation of gridbased iterative models or grid models for brevity. Examples of grid-based models include iterative relaxation (such as Jacobi relaxation for the Laplace equation) and cellular automata (such as discrete and continuous life). Selected results from this research will be outlined in the rest of this presentation. Big Data, HPC & MapReduce 31

32 Grid-Based Iterative Models Informally, a grid model consists of a regular grid of cells that evolves at discrete steps in accordance with a state transition rule. The transition rule determines the next state of each cell as a function of the cell s current state and the current state of its neighborhood. The generic form of the state transition rule is: S(cell, n+1) = F(S(cell, n), {S(cell, n) cell ϵ Neighborhood(cell)}) To specify a particular grid model, one needs to specify a state domain and define particular functions F and Neighborhood. Big Data, HPC & MapReduce 32

33 Grid Models: Laplace Relaxation Iterative relaxation methods (the simplest being Jacobi iteration) approximate the solutions of elliptic PDE (the simplest being the Laplace equation). Laplace relaxation (a short term for Jacobi relaxation of the Dirichlet problem for the Laplace equation) is a grid model with the following state transition rule: S(cell, n+1) = 0.25*SUM( {S(cell, n) cell ϵ Neighborhood(cell)}) Neighborhood(cell) generates the Von Neumann neighborhood of range 1. The above is an instance of the generic state transition rule: S(cell, n+1) = F(S(cell, n), {S(cell, n) cell ϵ Neighborhood(cell)}) Big Data, HPC & MapReduce 33

34 Grid Models: Laplace Relaxation Other methods are known to converge much faster than Jacobi relaxation, yet I chose to parallelize Jacobi relaxation because of its simplicity, a quality that makes it appropriate as an initial test bed for distributed MR relaxation techniques. My goals was to explore if it is at all possible to parallelize such methods with MR and to eventually provide a MR implementation that serves as an existence proof. Big Data, HPC & MapReduce 34

35 Laplace and Poisson Equations The Dirichlet problem for the Laplace equation, φ = 0, is the problem of finding a function φ that solves the equation in the interior of a given region D and that is equal on the boundary of D to some given function g, the boundary condition. Intuitively, the Laplace equation defines the temperature equilibrium φ on D for a time-stationary heat flow in D (such that the temperature on the D s boundary is time-independent and defined by g). The Laplace s equation is a special case of the Poisson s equation, φ = h. Big Data, HPC & MapReduce 35

36 Grid Models: Discrete Life Discrete life is a cellular automaton with two states, false/true (dead/alive) over an infinite 2D grid. Discrete life s most popular instance is Conway s Game of Life. Discrete life is a grid model with this state transition rule: S(cell, n+1) = S(cell, n) and Alive-neighbors(cell, n) ϵ A or not S(cell, n) and Alive-neighbors(cell, n) ϵ B Alive-neighbors (cell, n) generates the total number of alive neighbors in Moore neighborhood of range 1 A and B are given sets of integers A = {2, 3} and B = {2} defines Conway s Game of Life. Again, the above is an instance of the generic state transition rule: S(cell, n+1) = F(S(cell, n), {S(cell, n) cell ϵ Neighborhood(cell)}) Big Data, HPC & MapReduce 36

37 Grid Models: Continuous Life Continuous life is a cellular automaton with continuously valued states from the 0..1 range [Peper et al., 2010]. Continuous life generalizes Conway s discrete Game of Life. Continuous life is a grid model with this state transition rule: S(cell, n+1) = G(E(S(cell, n) + 2*SUM( {S(cell, n) cell ϵ Neighborhood(cell)}))) Neighborhood(cell) generates the Moore neighborhood of range 1 G(z) = b / (b + 1) where b = exp(2*z/t) and T is a parameter E(x) = E0 - (x - x0) 2 and E0, x0 are parameters Again, the above is an instance of the generic state transition rule: S(cell, n+1) = F(S(cell, n), {S(cell, n) cell ϵ Neighborhood(cell)}) Big Data, HPC & MapReduce 37

38 Grid Models on MR: Data Representation For MR simulation, cells can be identified with indexes, such as cell=(row, col) in the 2D case. Recall that both standard MR and MR streaming operate on key-value records. A whole grid can be represented as a set of records {(key=cell, value=state) }, where cell serves as a MR key and the current state of that cell serves as MR value. Grid cells & states, represented as keyvalue pairs Big Data, HPC & MapReduce 38

39 Grid Models on MR: Message Passing A single state transition for the entire grid can be simulated in MR by means of the MR message passing method: For each input cell, mappers emit sequences of intermediate key-value pairs that are interpreted by a reducer as messages to the cell s neighbors. These messages carry the input cell s contribution to the calculation of its neighbors next states. I have used message-passing to develop basic and optimized MR streaming algorithms for Laplace relaxation, discrete life, and continuous life. Big Data, HPC & MapReduce 39

40 MR Laplace Relaxation Example (1 of 2) For each input record of the form (key=(row, col), value=state), a mapper emits four intermediated records in the form (key=neighbor, value=0.25*state). These four intermediate records are interpreted as messages from the input cell to its neighbors. Big Data, HPC & MapReduce 40

41 MR Laplace Relaxation Example (2 of 2) In the shuffle phase, messages to the same cell are dispatched to the same reducer, and each reducer receives all of its messages in sorted order. This enables the reducer to sum-up all of the cell s neighbor contributions and then emit the cell and its newly calculated state. Big Data, HPC & MapReduce 41

42 Message-Passing Pattern for Grid Models class Mapper: method Map (): for (input-cell, state-of-input-cell) stdin : Emit (input-cell, state-of-input-cell) # message to self for neighbor-cell Neighborhood (input-cell) : Emit (neighbor-cell, contribution-of-input-cell) # message to neighbor class Reducer: method Reduce(): for (input-cell, input-value) stdin : if Current-Input-Cell-Is-Different-From-Previous-Input-Cell () : Emit (previous-cell, completed-new-state-of-previous-cell) Initialize-Current-Input-Cell-Processing () : else : Accumulate-Current-Input-State-Into-Partial-New-State () Emit (last-input-cell, completed-new-state-of-last-input-cell) Big Data, HPC & MapReduce 42

43 Basic & Optimized Grid Algorithms While a basic message-passing algorithm can handle data grids in a fault-tolerant distributed execution, it also may generate a large number of messages to be routed from mapper to reducer tasks. The volume of intermediate data in the MR network can become a performance bottleneck for larger-scale grids and thus offset the benefits of distributed MR execution. I have explored three optimization techniques for that help reduce the volume of intermediate messages: local aggregation (LA), strip partitioning (SP), and message packing (MP). The optimization techniques are outlined next. Big Data, HPC & MapReduce 43

44 Basic & Optimized Laplace Relaxation Mapper input - all algorithms Mapper output / Reducer input - basic algorithm Mapper output / Reducer input -LA algorithm //reduce num. messages Some neighborhood points are omitted in this sample Message from (5 5) Message from (4 5) Message from (6 5) Message from (5 5) Message from (5 5) Aggregated messages from (4 5) and (6 5) Message from (5 5) Mapper output / Reducer input -LA+SP Mapper output / Reducer input -LA+SP+MP //preserve data locality Strip index 0 in these message samples Key = strip index, row, col Messages sorted by strip index, row, col Strip indexes 0 & 1 in these message 0 {(4,5):0.1,(6,5):0.1,(5,5):0.4} samples 1 {(29,5):0.7,(27,5):0.2, } Key = strip index; messages sorted by strip //pack to reduce number of messages index alone, contain unsorted hashes Big Data, HPC & MapReduce 44

45 Basic MR Relaxation: Mapper Pseudocode class Mapper: method Map (): for line stdin : (cell, in-state) = Parse (line) if Is-Boundary(cell) : Emit (cell, in-state) out-val = 0.25*in-state for neighbor in Neighborhood (point) : if Is-Interior (neighbor) : Emit (neighbor, out-val) Big Data, HPC & MapReduce 45

46 Basic MR Relaxation: Reducer Pseudocode class Reducer: method Reduce (): for line stdin : (cell, in-val) = Parse (line) if point is same as previous : out-state += in-val else: Emit (cell, out-state) Big Data, HPC & MapReduce 46

47 Relaxation with LA: Mapper Pseudocode class Mapper: method Map (): hash = for line stdin: (cell, in-state) = Parse (line) if Is-Boundary(cell) : Emit (cell, in-state) out-val = 0.25*in-state for neighbor in Neighborhood (cell) : if Is-Interior (neighbor) : hash [neighbor] += out-val for cell in hash : Emit (cell, value = hash[cell]) Big Data, HPC & MapReduce 47

48 Strip Partitioning The default MR partitioner tends to disperse neighborhoods and reduce data locality, which impairs local in-mapper aggregation and is detrimental for performance. With strip partitioning, a mapper sends whole strips of consecutive grid rows to the same reducer. Therefore, strips of adjacent points will remain in the same output file as produced by the reducer. This strategy preserves data locality during iterative simulation and promotes performance. Technically, mappers output intermediate records of the form (key = (strip, row, col), value), where strip is an index that identifies individual strips of grid rows. Given a strip-length parameter, the index can be calculated as strip = row / strip-length. Big Data, HPC & MapReduce 48

49 Performance: Empirical Evaluation We evaluated empirically the performance effects of the LA, SP, and MP optimization techniques. To do so, we developed eight algorithms, as outlined in this table: Basic LA LA+SP LA+SP+MP Laplace Discrete life + + Continuous life + + We implemented the algorithms in MR streaming with Python then executed and timed the implementations on Amazon s Elastic MR Cloud with Amazon s Hadoop distribution. Big Data, HPC & MapReduce 49

50 Performance: Laplace Relaxation Experiments were performed on an Amazon s Elastic MapReduce Cloud on an cluster of 10 large instances, with grids of 10 8 points Optimizations: Local aggregation (LA), Strip Partitioning (SP), and Message Packing (MP) Big Data, HPC & MapReduce 50

51 Performance: Life Simulation Experiments were performed on the Amazon s Elastic MapReduce Cloud on clusters of i large instances (i = 1, 2, 4, 8, 16), over randomly generated square grids of approximately 16*10 7 cells. Execution time in minutes Number of nodes Optimizations: Local aggregation (LA) and Strip Partitioning (SP) Big Data, HPC & MapReduce 51

52 Hadoop and MPI In comparison with MPI, Hadoop MR trades speed for convenience, including ease of use and faulttolerance. Speed: In general, MPI can perform faster than Hadoop MR. While MPI is oriented towards memory to memory operations, Hadoop is oriented to file operations through its redundancy-based DFS. All intermediate key-value records always go to DFS. This difference makes Hadoop more robust and flexible but less efficient than MPI because of Hadoop s use of the DFS as a message-passing medium. Big Data, HPC & MapReduce 52

53 Hadoop and MPI Ease of use: The MPI user needs to express parallelism explicitly. The Hadoop MR user develops only sequential code and leaves parallelization to Hadoop. MR streaming takes usability one step further by enabling the user to select the programming language and development tools, instead of requiring the rigid Java API of standard MR. As a result, MR streaming is slower than standard MR. Big Data, HPC & MapReduce 53

54 Hadoop and MPI Fault tolerance: By default, an MPI application terminates upon a failure of a single process (unless the programmer develops custom fault-tolerance by means of checkpoints and intercommunicators). By contrast, Hadoop automatically detects failed worker nodes and resubmits incomplete work to live worker nodes, thus providing built-in fault-tolerance. Big Data, HPC & MapReduce 54

55 Hadoop and MPI According to Ding et al [2011], MR streaming is slower than standard MR and both are considerably slower than MPI. Ding et al timed word count, grep, sort, Mandelbrot, and established that: Standard MR executes with consistently higher overhead compared with MPI (30%, 529%, 227%, 550% for word count, grep, sort, and Mandelbrot correspondingly), and that MR streaming has even higher overhead compared with MPI (153%, 734%, 553%, 791% for word count, grep, sort, and Mandelbrot correspondingly). The inefficiency of MR streaming in comparison to standard MR is attributed to the inefficient use of Unix pipes in the Hadoop streaming implementation. Big Data, HPC & MapReduce 55

56 Hadoop and MPI The significant performance advantage of MPI over standard MR as implemented in Hadoop has also been confirmed by Fox [2009] and by Ekanayake [2010] in empirical studies of a k-means clustering application., and by Ekanayake et al [2008] for High Energy Physics data analysis. In general, MPI can be expected to significantly outperform Hadoop MR in iterative processing with frequent communication over data set that fit in the cluster memory. Big Data, HPC & MapReduce 56

57 Post-Hadoop MR Frameworks Post-Hadoop MR frameworks replace the DFS as the medium for inter-task and inter-job communication with alternative, faster communication media, such as a network file system (NFS) or memory. Examples of such frameworks include MARIANE (MApReduce Implementation Adapted for HPC Environments), CGL- MapReduce, Spark, M3R, M3, GridGain, imapreduce, Twister, and Phoenix. Such frameworks trade the DFS-supported fault-tolerance for speed, in some cases introducing alternative faulttolerance mechanisms. For example, CGL-MapReduce is reported to perform closely to MPI for large data sets in the case of k-means clustering and matrix multiplication. Big Data, HPC & MapReduce 57

58 Post-Hadoop: In-Memory MR In-memory MR implementations such as Spark, M3R, M3, and GridGain bypass the DFS to support in-memory processing. For problems that fit completely in available main memory, this approach would perform significantly better than standard MR. Spark for example is reported to perform word count up to a 100 times faster than Hadoop. M3R can run Hadoop jobs unchanged, and run them considerably faster than the Hadoop engine itself, e.g. 45 times faster on several workloads for sparse matrix vector multiply. Big Data, HPC & MapReduce 58

59 Conclusions To the best of our knowledge, we are the first to propose and evaluate MR streaming algorithms for grid models in general, and strip partitioning in particular. Our work on distributed MR relaxation and life simulation builds on MR message passing ideas originally introduced for data-intensive graph algorithms, such as PageRank [Lin & Schatz, 2010] Local in-mapper aggregation was originally proposed by for data-intensive text processing and MR message packing was used first in word co-occurrence count [Lin & Dyer, 2010]. Big Data, HPC & MapReduce 59

60 Conclusions In MR computing, convenience comes at the cost of performance: MR streaming is easy to use with Hadoop but Hadoop s streaming performance can be inadequate in scientific computing. As a new generation of MR frameworks is emerging, our MR streaming algorithms can be adapted to higher-performance MR parallelism models. In particular, converting our message-passing algorithms to the Spark model and empirically evaluating them in Spark execution is a plausible project. Big Data, HPC & MapReduce 60

61 Conclusions In conclusion, we agree that there is a clear algorithmic challenge to design more loosely coupled algorithms that are compatible with the map followed by reduce MR parallelism model, and more generally, to design algorithms compatible with the structure of clouds [Fox, 2010]. Although the trend may be quiet and distributed across only a relative few supercomputing sites, Hadoop and HPC are already hopping hand-in-hand more frequently [Hemsoth, 2014]. Big Data, HPC & MapReduce 61

62 Selected References Radenski, A., B. Norris. MapReduce streaming algorithms for Laplace relaxation on the cloud. In Parallel Computing: Accelerating Computational Science and Engineering, Advances in Parallel Computing 25, IOS Press, 2014 (in print). Radenski, A. Using MapReduce Streaming for Distributed Life Simulation on the Cloud, In Advances in Artificial Life, ECAL 2013, MIT Press, 2013, Radenski, A., L. Ehwerhemuepha, Speeding-Up Codon Analysis on the Cloud with Local MapReduce Aggregation, Information Sciences, Elsevier, Information Sciences, Elsevier, 263 (2014), Big Data, HPC & MapReduce 62

63 Big Data, High-Performance Computing, and MapReduce Algorithms on Grids Atanas RADENSKI School of Computational Sciences Chapman University Orange, California, USA