Background. Bulgaria. California 7/7/2014. Big Data, HPC & MapReduce 1

Size: px
Start display at page:

Download "Background. Bulgaria. California 7/7/2014. Big Data, HPC & MapReduce 1"

Transcription

1 Background Bulgaria California Big Data, HPC & MapReduce 1

2 Background Chapman U. Southern California Big Data, HPC & MapReduce 2

3 Background Chapman University Schmid College of Science and Technology School of Computational Sciences PhD & MS in Computational & Data Sciences BS in Computer Science, Computer Information Systems, Math, Math & Civil Engineering Big Data, HPC & MapReduce 3

4 Big Data, High-Performance Computing, and MapReduce Algorithms on Grids Atanas RADENSKI School of Computational Sciences Chapman University Orange, California, USA

5 Taxonomy Experiments & Measurements Theory Big Data Modeling & Simulation Data Science Computational Science Data-Intensive Applications Compute-Intensive Applications High-Performance Computing Systems Big Data, HPC & MapReduce 5

6 Experiments & Measurements 16th-century illustration of Archimedes Experiment on the Archimedes principle Big Data, HPC & MapReduce 6

7 Theory The Scientific Method Anzenbacher Research Group: Jokes Big Data, HPC & MapReduce 7

8 Modeling & Simulation Modeling & Simulation aims to replace physical experiments with virtual ones. Benefits: M & S can be cheaper, faster, safer, more scalable, more insightful. Examples: (1) Simulation of nuclear weapon performance, (2) living cell simulation and (3) human brain simulation. Scientific method extended by modeling & simulation Big Data, HPC & MapReduce 8

9 Modeling & Simulation The US Advanced Simulation & Computing Program (ASC) develops and employs science-based computer simulation capabilities for predictive simulation of the US nuclear stockpile - without physical tests. ASC (formerly ASCI) was created after the last US underground nuclear test in High-resolution 3D simulations help to assess the health of a B-61 nuclear bomb Big Data, HPC & MapReduce 9

10 Modeling & Simulation ASC employs several Top500 supercomputers, including three petascale systems (Sequoia, Cielo, Roadrunner, 2013). ASC LANL used 32K processors on the Cielo supercomputer for full-physics, full-geometry 3D simulation of nuclear blasts on Earth-threatening asteroids (2012). Asteroid killer simulation of a one megaton nuclear blast to explode the Itokawa asteroid ( m) Big Data, HPC & MapReduce 10

11 Modeling & Simulation Stanford used a 128-node cluster to run the world s first complete model of a living organism, the bacterium Mycoplasma genitalium (2012). The bacterium, a STI, has only 525 genes, the fewest of any independently living organism. This single-cell model synchronizes 28 submodels, each simulating a different process. Such computational models could bring rational design to biology and support the wholesale creation of new microorganisms. A computational model of the M. genitalium includes all of its molecular components and their interactions. Cell 2012, Elsevier Big Data, HPC & MapReduce 11

12 Modeling & Simulation The European Human Brain Project aims to reverse-engineer the complete human brain in detailed computer models and simulations ( ). Brain modeling & simulation will provide new data on brain functions & pathology. In particular, simulations will be used to study drug effects on brain disease before experiments with human subjects. The project is also expected to contribute to the design of novel high-performance computing technologies and systems. Big Data, HPC & MapReduce 12

13 Big Data Scientific experiments, measurements, modeling and simulation have been known for years for the generation of big data (e.g., in astronomy and meteorology). The NASA Center for Climate Simulation (NCCS) for example stores 32 petabytes of climate observations and simulations on the Discover supercomputing cluster. 32 petabytes = bytes Right: A snapshot from the NCCS's 2-year, 10- kilometer global atmospheric simulation, which included revisiting the extraordinary 2005 Atlantic hurricane season. The simulation spawned 23 storms, compared to 28 in reality. Putman & Kekesi, NASA/Goddard (2012) Big Data, HPC & MapReduce 13

14 Big Data Apart from sciences, recent years are characterized by explosive growth of big data in private enterprise and government. Giga = Tera = Peta = Exa = Zeta = Growth of globally produced data in zetabytes 2010 Stored data in petabytes Not all produced data is stored. Big Data, HPC & MapReduce 14

15 Big Data Volume of stored data in petabytes (2010) Peta = Big data informally refers to data that is too voluminous, complex, dynamic to capture, store, manage, and analyze using on-hand data tools. Big Data, HPC & MapReduce 15

16 Big Data Left: Examples of data sources of various sizes. Retrieved Giga = Tera = Peta = Exa = Zeta = Big Data, HPC & MapReduce 16

17 Data Science and Computational Science Computational Science is about computational models and simulations. Data Science is about generalizable extraction of knowledge from (big) data. Deals with big computations, typically over structured data (matrices). Numerical analysis Statistics Deals with (big) data, typically unstructured and possibly heterogeneous. Big Data, HPC & MapReduce 17

18 Data Science Data science is multi-disciplinary / datacook.blogspot.com In Data Science, the concern is finding interesting and robust patterns that satisfy the data, where "interesting" is usually something unexpected and actionable and "robust" is a pattern expected to occur in the future [Dhar, CACM, No 12, 2013]. For example, diabetes complication patterns like this one can be extracted from a health-care database: Age > 36 and #Medications > 6 Complication=Yes (100% confidence) The pattern represents the type of question we should ask the database, if only we knew what to ask [Dhar]. Big Data, HPC & MapReduce 18

19 Predictive Analytics Predictive analytics encompasses a variety of statistical techniques from modeling, machine learning, and data mining that analyze current and historical facts to make predictions about future, or otherwise unknown, events. [Wikipedia] Big Data, HPC & MapReduce 19

20 High-Performance Computing Data science practice materializes in data-intensive applications (they involve big data & are I/O bound). Popular platform: Hadoop. Computational science practice materializes in compute-intensive applications (they involve big computations & are processor bound). Popular platforms: MPI and OpenMP. Big Data, HPC & MapReduce 20

21 Exascale Computing Exascale performance calls for new technologies & architectures. Compute-intensive applications for advanced science, national security, and energy research demand exascale performance (DOE, 2009). One exaflop in 2020? Challenge: Based on current technology, scaling today s systems to an exaflop level would consume more than a gigawatt of power (DOE, 2010) (>$20B/year). Top 500 performance / en.wikipedia.org/wiki/top500 Big Data, HPC & MapReduce 21

22 The Hadoop Big Data Platform Yahoo s Hadoop cluster Apache Hadoop is a most popular open-source software platform for data-intensive applications. Hadoop runs on commodity clusters with high fault-tolerance. Hadoop s primary modules are MapReduce (MR) and the Hadoop Distributed File System (HDFS). MR implements a highlevel, implicit parallel programming model. HDFS provides highthroughput access to big data. Big Data, HPC & MapReduce 22

23 MapReduce (MR) MR was originally implemented as a proprietary product to support Google s needs for large-scale distributed processing of unstructured text data. Since the implementation of MR as a module of the opensource Apache Hadoop platform, MR has been applied to various domains, such as: sets and graphs; AI, machine learning and data mining; bioinformatics; image and video; evolutionary computing; and statistics and numerical mathematics. Big Data, HPC & MapReduce 23

24 The MR Advantage With MR, users specify serial-only computation in terms of a map method and a reduce method, and the underlying implementation automatically: parallelizes the computation, tends to machine failures, and schedules efficient inter-machine communication. MR offers simplicity, built-in fault-tolerance, and scalability. Big Data, HPC & MapReduce 24

25 MR Stages Big Data, HPC & MapReduce 25

26 MR Streaming versus Standard MR Algorithms in the standard MR model must be implemented with the MR Java API. Standard MR requires Java expertise and can be difficult to use by domain scientists. I chose to work with the alternative MR streaming model. MR streaming permits algorithm implementation in a variety of languages, including higher-level scripting languages, such as Python and Ruby. MR streaming is easier to use but can be less efficient and in more need for optimizations. Big Data, HPC & MapReduce 26

27 MR Streaming Example (1 of 2) Input Task Number Mapper OUT Reducer IN Reducer OUT to be or 1 to 1 be 1 or 1 not to be 2 not 1 to 1 be 1 be 1 be 1 not 1 or 1 to 1 to 1 be 2 not 1 or 1 to 2 Big Data, HPC & MapReduce 27

28 MR Streaming Example (2 of 2) class Mapper: method Map (): for line stdin: for word line: Emit (key=word, value =1); class Reducer: method Reduce (): for (word, value) stdin: if word is same as previous: sum +=value; else: Emit (word, sum); sum = 0 Emit (word, sum); # last word Big Data, HPC & MapReduce 28

29 Standard MR Example (1 of 2) Input Task Number Mapper OUT Reducer IN Reducer OUT to be or 1 to 1 be 1 or 1 not to be 2 not 1 to 1 be 1 be 1 1 not 1 or 1 to 1 1 be 2 not 1 or 1 to 2 Grouping of values for the same key is a semantic difference between standard MR (groups) and MR streaming (does not group). Big Data, HPC & MapReduce 29

30 Standard MR Example (2 of 2) class Mapper: method Map (key, value): for word value: Emit (key=word, value =1); class Reducer: method Reduce (key, list-of-values): sum = 0; for value list-of-values: sum +=value; Emit (key, value = sum); Big Data, HPC & MapReduce 30

31 Grid-Based Iterative Models In recent research, I have explored the potential of MapReduce streaming in the parallel simulation of gridbased iterative models or grid models for brevity. Examples of grid-based models include iterative relaxation (such as Jacobi relaxation for the Laplace equation) and cellular automata (such as discrete and continuous life). Selected results from this research will be outlined in the rest of this presentation. Big Data, HPC & MapReduce 31

32 Grid-Based Iterative Models Informally, a grid model consists of a regular grid of cells that evolves at discrete steps in accordance with a state transition rule. The transition rule determines the next state of each cell as a function of the cell s current state and the current state of its neighborhood. The generic form of the state transition rule is: S(cell, n+1) = F(S(cell, n), {S(cell, n) cell ϵ Neighborhood(cell)}) To specify a particular grid model, one needs to specify a state domain and define particular functions F and Neighborhood. Big Data, HPC & MapReduce 32

33 Grid Models: Laplace Relaxation Iterative relaxation methods (the simplest being Jacobi iteration) approximate the solutions of elliptic PDE (the simplest being the Laplace equation). Laplace relaxation (a short term for Jacobi relaxation of the Dirichlet problem for the Laplace equation) is a grid model with the following state transition rule: S(cell, n+1) = 0.25*SUM( {S(cell, n) cell ϵ Neighborhood(cell)}) Neighborhood(cell) generates the Von Neumann neighborhood of range 1. The above is an instance of the generic state transition rule: S(cell, n+1) = F(S(cell, n), {S(cell, n) cell ϵ Neighborhood(cell)}) Big Data, HPC & MapReduce 33

34 Grid Models: Laplace Relaxation Other methods are known to converge much faster than Jacobi relaxation, yet I chose to parallelize Jacobi relaxation because of its simplicity, a quality that makes it appropriate as an initial test bed for distributed MR relaxation techniques. My goals was to explore if it is at all possible to parallelize such methods with MR and to eventually provide a MR implementation that serves as an existence proof. Big Data, HPC & MapReduce 34

35 Laplace and Poisson Equations The Dirichlet problem for the Laplace equation, φ = 0, is the problem of finding a function φ that solves the equation in the interior of a given region D and that is equal on the boundary of D to some given function g, the boundary condition. Intuitively, the Laplace equation defines the temperature equilibrium φ on D for a time-stationary heat flow in D (such that the temperature on the D s boundary is time-independent and defined by g). The Laplace s equation is a special case of the Poisson s equation, φ = h. Big Data, HPC & MapReduce 35

36 Grid Models: Discrete Life Discrete life is a cellular automaton with two states, false/true (dead/alive) over an infinite 2D grid. Discrete life s most popular instance is Conway s Game of Life. Discrete life is a grid model with this state transition rule: S(cell, n+1) = S(cell, n) and Alive-neighbors(cell, n) ϵ A or not S(cell, n) and Alive-neighbors(cell, n) ϵ B Alive-neighbors (cell, n) generates the total number of alive neighbors in Moore neighborhood of range 1 A and B are given sets of integers A = {2, 3} and B = {2} defines Conway s Game of Life. Again, the above is an instance of the generic state transition rule: S(cell, n+1) = F(S(cell, n), {S(cell, n) cell ϵ Neighborhood(cell)}) Big Data, HPC & MapReduce 36

37 Grid Models: Continuous Life Continuous life is a cellular automaton with continuously valued states from the 0..1 range [Peper et al., 2010]. Continuous life generalizes Conway s discrete Game of Life. Continuous life is a grid model with this state transition rule: S(cell, n+1) = G(E(S(cell, n) + 2*SUM( {S(cell, n) cell ϵ Neighborhood(cell)}))) Neighborhood(cell) generates the Moore neighborhood of range 1 G(z) = b / (b + 1) where b = exp(2*z/t) and T is a parameter E(x) = E0 - (x - x0) 2 and E0, x0 are parameters Again, the above is an instance of the generic state transition rule: S(cell, n+1) = F(S(cell, n), {S(cell, n) cell ϵ Neighborhood(cell)}) Big Data, HPC & MapReduce 37

38 Grid Models on MR: Data Representation For MR simulation, cells can be identified with indexes, such as cell=(row, col) in the 2D case. Recall that both standard MR and MR streaming operate on key-value records. A whole grid can be represented as a set of records {(key=cell, value=state) }, where cell serves as a MR key and the current state of that cell serves as MR value. Grid cells & states, represented as keyvalue pairs Big Data, HPC & MapReduce 38

39 Grid Models on MR: Message Passing A single state transition for the entire grid can be simulated in MR by means of the MR message passing method: For each input cell, mappers emit sequences of intermediate key-value pairs that are interpreted by a reducer as messages to the cell s neighbors. These messages carry the input cell s contribution to the calculation of its neighbors next states. I have used message-passing to develop basic and optimized MR streaming algorithms for Laplace relaxation, discrete life, and continuous life. Big Data, HPC & MapReduce 39

40 MR Laplace Relaxation Example (1 of 2) For each input record of the form (key=(row, col), value=state), a mapper emits four intermediated records in the form (key=neighbor, value=0.25*state). These four intermediate records are interpreted as messages from the input cell to its neighbors. Big Data, HPC & MapReduce 40

41 MR Laplace Relaxation Example (2 of 2) In the shuffle phase, messages to the same cell are dispatched to the same reducer, and each reducer receives all of its messages in sorted order. This enables the reducer to sum-up all of the cell s neighbor contributions and then emit the cell and its newly calculated state. Big Data, HPC & MapReduce 41

42 Message-Passing Pattern for Grid Models class Mapper: method Map (): for (input-cell, state-of-input-cell) stdin : Emit (input-cell, state-of-input-cell) # message to self for neighbor-cell Neighborhood (input-cell) : Emit (neighbor-cell, contribution-of-input-cell) # message to neighbor class Reducer: method Reduce(): for (input-cell, input-value) stdin : if Current-Input-Cell-Is-Different-From-Previous-Input-Cell () : Emit (previous-cell, completed-new-state-of-previous-cell) Initialize-Current-Input-Cell-Processing () : else : Accumulate-Current-Input-State-Into-Partial-New-State () Emit (last-input-cell, completed-new-state-of-last-input-cell) Big Data, HPC & MapReduce 42

43 Basic & Optimized Grid Algorithms While a basic message-passing algorithm can handle data grids in a fault-tolerant distributed execution, it also may generate a large number of messages to be routed from mapper to reducer tasks. The volume of intermediate data in the MR network can become a performance bottleneck for larger-scale grids and thus offset the benefits of distributed MR execution. I have explored three optimization techniques for that help reduce the volume of intermediate messages: local aggregation (LA), strip partitioning (SP), and message packing (MP). The optimization techniques are outlined next. Big Data, HPC & MapReduce 43

44 Basic & Optimized Laplace Relaxation Mapper input - all algorithms Mapper output / Reducer input - basic algorithm Mapper output / Reducer input -LA algorithm //reduce num. messages Some neighborhood points are omitted in this sample Message from (5 5) Message from (4 5) Message from (6 5) Message from (5 5) Message from (5 5) Aggregated messages from (4 5) and (6 5) Message from (5 5) Mapper output / Reducer input -LA+SP Mapper output / Reducer input -LA+SP+MP //preserve data locality Strip index 0 in these message samples Key = strip index, row, col Messages sorted by strip index, row, col Strip indexes 0 & 1 in these message 0 {(4,5):0.1,(6,5):0.1,(5,5):0.4} samples 1 {(29,5):0.7,(27,5):0.2, } Key = strip index; messages sorted by strip //pack to reduce number of messages index alone, contain unsorted hashes Big Data, HPC & MapReduce 44

45 Basic MR Relaxation: Mapper Pseudocode class Mapper: method Map (): for line stdin : (cell, in-state) = Parse (line) if Is-Boundary(cell) : Emit (cell, in-state) out-val = 0.25*in-state for neighbor in Neighborhood (point) : if Is-Interior (neighbor) : Emit (neighbor, out-val) Big Data, HPC & MapReduce 45

46 Basic MR Relaxation: Reducer Pseudocode class Reducer: method Reduce (): for line stdin : (cell, in-val) = Parse (line) if point is same as previous : out-state += in-val else: Emit (cell, out-state) Big Data, HPC & MapReduce 46

47 Relaxation with LA: Mapper Pseudocode class Mapper: method Map (): hash = for line stdin: (cell, in-state) = Parse (line) if Is-Boundary(cell) : Emit (cell, in-state) out-val = 0.25*in-state for neighbor in Neighborhood (cell) : if Is-Interior (neighbor) : hash [neighbor] += out-val for cell in hash : Emit (cell, value = hash[cell]) Big Data, HPC & MapReduce 47

48 Strip Partitioning The default MR partitioner tends to disperse neighborhoods and reduce data locality, which impairs local in-mapper aggregation and is detrimental for performance. With strip partitioning, a mapper sends whole strips of consecutive grid rows to the same reducer. Therefore, strips of adjacent points will remain in the same output file as produced by the reducer. This strategy preserves data locality during iterative simulation and promotes performance. Technically, mappers output intermediate records of the form (key = (strip, row, col), value), where strip is an index that identifies individual strips of grid rows. Given a strip-length parameter, the index can be calculated as strip = row / strip-length. Big Data, HPC & MapReduce 48

49 Performance: Empirical Evaluation We evaluated empirically the performance effects of the LA, SP, and MP optimization techniques. To do so, we developed eight algorithms, as outlined in this table: Basic LA LA+SP LA+SP+MP Laplace Discrete life + + Continuous life + + We implemented the algorithms in MR streaming with Python then executed and timed the implementations on Amazon s Elastic MR Cloud with Amazon s Hadoop distribution. Big Data, HPC & MapReduce 49

50 Performance: Laplace Relaxation Experiments were performed on an Amazon s Elastic MapReduce Cloud on an cluster of 10 large instances, with grids of 10 8 points Optimizations: Local aggregation (LA), Strip Partitioning (SP), and Message Packing (MP) Big Data, HPC & MapReduce 50

51 Performance: Life Simulation Experiments were performed on the Amazon s Elastic MapReduce Cloud on clusters of i large instances (i = 1, 2, 4, 8, 16), over randomly generated square grids of approximately 16*10 7 cells. Execution time in minutes Number of nodes Optimizations: Local aggregation (LA) and Strip Partitioning (SP) Big Data, HPC & MapReduce 51

52 Hadoop and MPI In comparison with MPI, Hadoop MR trades speed for convenience, including ease of use and faulttolerance. Speed: In general, MPI can perform faster than Hadoop MR. While MPI is oriented towards memory to memory operations, Hadoop is oriented to file operations through its redundancy-based DFS. All intermediate key-value records always go to DFS. This difference makes Hadoop more robust and flexible but less efficient than MPI because of Hadoop s use of the DFS as a message-passing medium. Big Data, HPC & MapReduce 52

53 Hadoop and MPI Ease of use: The MPI user needs to express parallelism explicitly. The Hadoop MR user develops only sequential code and leaves parallelization to Hadoop. MR streaming takes usability one step further by enabling the user to select the programming language and development tools, instead of requiring the rigid Java API of standard MR. As a result, MR streaming is slower than standard MR. Big Data, HPC & MapReduce 53

54 Hadoop and MPI Fault tolerance: By default, an MPI application terminates upon a failure of a single process (unless the programmer develops custom fault-tolerance by means of checkpoints and intercommunicators). By contrast, Hadoop automatically detects failed worker nodes and resubmits incomplete work to live worker nodes, thus providing built-in fault-tolerance. Big Data, HPC & MapReduce 54

55 Hadoop and MPI According to Ding et al [2011], MR streaming is slower than standard MR and both are considerably slower than MPI. Ding et al timed word count, grep, sort, Mandelbrot, and established that: Standard MR executes with consistently higher overhead compared with MPI (30%, 529%, 227%, 550% for word count, grep, sort, and Mandelbrot correspondingly), and that MR streaming has even higher overhead compared with MPI (153%, 734%, 553%, 791% for word count, grep, sort, and Mandelbrot correspondingly). The inefficiency of MR streaming in comparison to standard MR is attributed to the inefficient use of Unix pipes in the Hadoop streaming implementation. Big Data, HPC & MapReduce 55

56 Hadoop and MPI The significant performance advantage of MPI over standard MR as implemented in Hadoop has also been confirmed by Fox [2009] and by Ekanayake [2010] in empirical studies of a k-means clustering application., and by Ekanayake et al [2008] for High Energy Physics data analysis. In general, MPI can be expected to significantly outperform Hadoop MR in iterative processing with frequent communication over data set that fit in the cluster memory. Big Data, HPC & MapReduce 56

57 Post-Hadoop MR Frameworks Post-Hadoop MR frameworks replace the DFS as the medium for inter-task and inter-job communication with alternative, faster communication media, such as a network file system (NFS) or memory. Examples of such frameworks include MARIANE (MApReduce Implementation Adapted for HPC Environments), CGL- MapReduce, Spark, M3R, M3, GridGain, imapreduce, Twister, and Phoenix. Such frameworks trade the DFS-supported fault-tolerance for speed, in some cases introducing alternative faulttolerance mechanisms. For example, CGL-MapReduce is reported to perform closely to MPI for large data sets in the case of k-means clustering and matrix multiplication. Big Data, HPC & MapReduce 57

58 Post-Hadoop: In-Memory MR In-memory MR implementations such as Spark, M3R, M3, and GridGain bypass the DFS to support in-memory processing. For problems that fit completely in available main memory, this approach would perform significantly better than standard MR. Spark for example is reported to perform word count up to a 100 times faster than Hadoop. M3R can run Hadoop jobs unchanged, and run them considerably faster than the Hadoop engine itself, e.g. 45 times faster on several workloads for sparse matrix vector multiply. Big Data, HPC & MapReduce 58

59 Conclusions To the best of our knowledge, we are the first to propose and evaluate MR streaming algorithms for grid models in general, and strip partitioning in particular. Our work on distributed MR relaxation and life simulation builds on MR message passing ideas originally introduced for data-intensive graph algorithms, such as PageRank [Lin & Schatz, 2010] Local in-mapper aggregation was originally proposed by for data-intensive text processing and MR message packing was used first in word co-occurrence count [Lin & Dyer, 2010]. Big Data, HPC & MapReduce 59

60 Conclusions In MR computing, convenience comes at the cost of performance: MR streaming is easy to use with Hadoop but Hadoop s streaming performance can be inadequate in scientific computing. As a new generation of MR frameworks is emerging, our MR streaming algorithms can be adapted to higher-performance MR parallelism models. In particular, converting our message-passing algorithms to the Spark model and empirically evaluating them in Spark execution is a plausible project. Big Data, HPC & MapReduce 60

61 Conclusions In conclusion, we agree that there is a clear algorithmic challenge to design more loosely coupled algorithms that are compatible with the map followed by reduce MR parallelism model, and more generally, to design algorithms compatible with the structure of clouds [Fox, 2010]. Although the trend may be quiet and distributed across only a relative few supercomputing sites, Hadoop and HPC are already hopping hand-in-hand more frequently [Hemsoth, 2014]. Big Data, HPC & MapReduce 61

62 Selected References Radenski, A., B. Norris. MapReduce streaming algorithms for Laplace relaxation on the cloud. In Parallel Computing: Accelerating Computational Science and Engineering, Advances in Parallel Computing 25, IOS Press, 2014 (in print). Radenski, A. Using MapReduce Streaming for Distributed Life Simulation on the Cloud, In Advances in Artificial Life, ECAL 2013, MIT Press, 2013, Radenski, A., L. Ehwerhemuepha, Speeding-Up Codon Analysis on the Cloud with Local MapReduce Aggregation, Information Sciences, Elsevier, Information Sciences, Elsevier, 263 (2014), Big Data, HPC & MapReduce 62

63 Big Data, High-Performance Computing, and MapReduce Algorithms on Grids Atanas RADENSKI School of Computational Sciences Chapman University Orange, California, USA

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012 MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte

More information

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging

More information

Developing MapReduce Programs

Developing MapReduce Programs Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2015/16 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

Map Reduce & Hadoop Recommended Text:

Map Reduce & Hadoop Recommended Text: Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately

More information

A bit about Hadoop. Luca Pireddu. March 9, 2012. CRS4Distributed Computing Group. luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 1 / 18

A bit about Hadoop. Luca Pireddu. March 9, 2012. CRS4Distributed Computing Group. luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 1 / 18 A bit about Hadoop Luca Pireddu CRS4Distributed Computing Group March 9, 2012 luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 1 / 18 Often seen problems Often seen problems Low parallelism I/O is

More information

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:

More information

MapReduce and Hadoop Distributed File System V I J A Y R A O

MapReduce and Hadoop Distributed File System V I J A Y R A O MapReduce and Hadoop Distributed File System 1 V I J A Y R A O The Context: Big-data Man on the moon with 32KB (1969); my laptop had 2GB RAM (2009) Google collects 270PB data in a month (2007), 20000PB

More information

Report: Declarative Machine Learning on MapReduce (SystemML)

Report: Declarative Machine Learning on MapReduce (SystemML) Report: Declarative Machine Learning on MapReduce (SystemML) Jessica Falk ETH-ID 11-947-512 May 28, 2014 1 Introduction SystemML is a system used to execute machine learning (ML) algorithms in HaDoop,

More information

Distributed Computing and Big Data: Hadoop and MapReduce

Distributed Computing and Big Data: Hadoop and MapReduce Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:

More information

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

MAPREDUCE Programming Model

MAPREDUCE Programming Model CS 2510 COMPUTER OPERATING SYSTEMS Cloud Computing MAPREDUCE Dr. Taieb Znati Computer Science Department University of Pittsburgh MAPREDUCE Programming Model Scaling Data Intensive Application MapReduce

More information

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Introduction to Big Data! with Apache Spark UC#BERKELEY# Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"

More information

High Performance Computing with Hadoop WV HPC Summer Institute 2014

High Performance Computing with Hadoop WV HPC Summer Institute 2014 High Performance Computing with Hadoop WV HPC Summer Institute 2014 E. James Harner Director of Data Science Department of Statistics West Virginia University June 18, 2014 Outline Introduction Hadoop

More information

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Keywords: Big Data, HDFS, Map Reduce, Hadoop Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning

More information

Big Systems, Big Data

Big Systems, Big Data Big Systems, Big Data When considering Big Distributed Systems, it can be noted that a major concern is dealing with data, and in particular, Big Data Have general data issues (such as latency, availability,

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

MapReduce and Hadoop Distributed File System

MapReduce and Hadoop Distributed File System MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) bina@buffalo.edu http://www.cse.buffalo.edu/faculty/bina Partially

More information

Introduction to DISC and Hadoop

Introduction to DISC and Hadoop Introduction to DISC and Hadoop Alice E. Fischer April 24, 2009 Alice E. Fischer DISC... 1/20 1 2 History Hadoop provides a three-layer paradigm Alice E. Fischer DISC... 2/20 Parallel Computing Past and

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop Miles Osborne School of Informatics University of Edinburgh miles@inf.ed.ac.uk October 28, 2010 Miles Osborne Introduction to Hadoop 1 Background Hadoop Programming Model Examples

More information

Data-intensive HPC: opportunities and challenges. Patrick Valduriez

Data-intensive HPC: opportunities and challenges. Patrick Valduriez Data-intensive HPC: opportunities and challenges Patrick Valduriez Big Data Landscape Multi-$billion market! Big data = Hadoop = MapReduce? No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard,

More information

From GWS to MapReduce: Google s Cloud Technology in the Early Days

From GWS to MapReduce: Google s Cloud Technology in the Early Days Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu lingu@ieee.org MapReduce/Hadoop

More information

Adapting scientific computing problems to cloud computing frameworks Ph.D. Thesis. Pelle Jakovits

Adapting scientific computing problems to cloud computing frameworks Ph.D. Thesis. Pelle Jakovits Adapting scientific computing problems to cloud computing frameworks Ph.D. Thesis Pelle Jakovits Outline Problem statement State of the art Approach Solutions and contributions Current work Conclusions

More information

The Hadoop Framework

The Hadoop Framework The Hadoop Framework Nils Braden University of Applied Sciences Gießen-Friedberg Wiesenstraße 14 35390 Gießen nils.braden@mni.fh-giessen.de Abstract. The Hadoop Framework offers an approach to large-scale

More information

Data-Intensive Computing with Map-Reduce and Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion

More information

Market Basket Analysis Algorithm on Map/Reduce in AWS EC2

Market Basket Analysis Algorithm on Map/Reduce in AWS EC2 Market Basket Analysis Algorithm on Map/Reduce in AWS EC2 Jongwook Woo Computer Information Systems Department California State University Los Angeles jwoo5@calstatela.edu Abstract As the web, social networking,

More information

Cloud Computing using MapReduce, Hadoop, Spark

Cloud Computing using MapReduce, Hadoop, Spark Cloud Computing using MapReduce, Hadoop, Spark Benjamin Hindman benh@cs.berkeley.edu Why this talk? At some point, you ll have enough data to run your parallel algorithms on multiple computers SPMD (e.g.,

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction

More information

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage Parallel Computing Benson Muite benson.muite@ut.ee http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework

More information

Big Data on Microsoft Platform

Big Data on Microsoft Platform Big Data on Microsoft Platform Prepared by GJ Srinivas Corporate TEG - Microsoft Page 1 Contents 1. What is Big Data?...3 2. Characteristics of Big Data...3 3. Enter Hadoop...3 4. Microsoft Big Data Solutions...4

More information

Cloud Computing. Chapter 8. 8.1 Hadoop

Cloud Computing. Chapter 8. 8.1 Hadoop Chapter 8 Cloud Computing In cloud computing, the idea is that a large corporation that has many computers could sell time on them, for example to make profitable use of excess capacity. The typical customer

More information

HPC ABDS: The Case for an Integrating Apache Big Data Stack

HPC ABDS: The Case for an Integrating Apache Big Data Stack HPC ABDS: The Case for an Integrating Apache Big Data Stack with HPC 1st JTC 1 SGBD Meeting SDSC San Diego March 19 2014 Judy Qiu Shantenu Jha (Rutgers) Geoffrey Fox gcf@indiana.edu http://www.infomall.org

More information

http://www.wordle.net/

http://www.wordle.net/ Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely

More information

MapReduce: Algorithm Design Patterns

MapReduce: Algorithm Design Patterns Designing Algorithms for MapReduce MapReduce: Algorithm Design Patterns Need to adapt to a restricted model of computation Goals Scalability: adding machines will make the algo run faster Efficiency: resources

More information

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Accelerating Hadoop MapReduce Using an In-Memory Data Grid Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for

More information

Architectures for Big Data Analytics A database perspective

Architectures for Big Data Analytics A database perspective Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum

More information

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect Matteo Migliavacca (mm53@kent) School of Computing Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect Simple past - Traditional

More information

Big Data Systems CS 5965/6965 FALL 2015

Big Data Systems CS 5965/6965 FALL 2015 Big Data Systems CS 5965/6965 FALL 2015 Today General course overview Expectations from this course Q&A Introduction to Big Data Assignment #1 General Course Information Course Web Page http://www.cs.utah.edu/~hari/teaching/fall2015.html

More information

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical Identify a problem Review approaches to the problem Propose a novel approach to the problem Define, design, prototype an implementation to evaluate your approach Could be a real system, simulation and/or

More information

Introduction to Hadoop

Introduction to Hadoop 1 What is Hadoop? Introduction to Hadoop We are living in an era where large volumes of data are available and the problem is to extract meaning from the data avalanche. The goal of the software tools

More information

Hadoop Parallel Data Processing

Hadoop Parallel Data Processing MapReduce and Implementation Hadoop Parallel Data Processing Kai Shen A programming interface (two stage Map and Reduce) and system support such that: the interface is easy to program, and suitable for

More information

Big Data and Scripting map/reduce in Hadoop

Big Data and Scripting map/reduce in Hadoop Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb

More information

HIGH PERFORMANCE BIG DATA ANALYTICS

HIGH PERFORMANCE BIG DATA ANALYTICS HIGH PERFORMANCE BIG DATA ANALYTICS Kunle Olukotun Electrical Engineering and Computer Science Stanford University June 2, 2014 Explosion of Data Sources Sensors DoD is swimming in sensors and drowning

More information

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics Dr. Liangxiu Han Future Networks and Distributed Systems Group (FUNDS) School of Computing, Mathematics and Digital Technology,

More information

A survey on platforms for big data analytics

A survey on platforms for big data analytics Singh and Reddy Journal of Big Data 2014, 1:8 SURVEY PAPER Open Access A survey on platforms for big data analytics Dilpreet Singh and Chandan K Reddy * * Correspondence: reddy@cs.wayne.edu Department

More information

Map Reduce / Hadoop / HDFS

Map Reduce / Hadoop / HDFS Chapter 3: Map Reduce / Hadoop / HDFS 97 Overview Outline Distributed File Systems (re-visited) Motivation Programming Model Example Applications Big Data in Apache Hadoop HDFS in Hadoop YARN 98 Overview

More information

How To Scale Out Of A Nosql Database

How To Scale Out Of A Nosql Database Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

Energy Efficient MapReduce

Energy Efficient MapReduce Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing

More information

Analysis of MapReduce Algorithms

Analysis of MapReduce Algorithms Analysis of MapReduce Algorithms Harini Padmanaban Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 harini.gomadam@gmail.com ABSTRACT MapReduce is a programming model

More information

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON Overview * Introduction * Multiple faces of Big Data * Challenges of Big Data * Cloud Computing

More information

The MapReduce Framework

The MapReduce Framework The MapReduce Framework Luke Tierney Department of Statistics & Actuarial Science University of Iowa November 8, 2007 Luke Tierney (U. of Iowa) The MapReduce Framework November 8, 2007 1 / 16 Background

More information

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future

More information

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib

More information

RevoScaleR Speed and Scalability

RevoScaleR Speed and Scalability EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution

More information

Big Data and Analytics: Challenges and Opportunities

Big Data and Analytics: Challenges and Opportunities Big Data and Analytics: Challenges and Opportunities Dr. Amin Beheshti Lecturer and Senior Research Associate University of New South Wales, Australia (Service Oriented Computing Group, CSE) Talk: Sharif

More information

International Journal of Innovative Research in Information Security (IJIRIS) ISSN: 2349-7017(O) Volume 1 Issue 3 (September 2014)

International Journal of Innovative Research in Information Security (IJIRIS) ISSN: 2349-7017(O) Volume 1 Issue 3 (September 2014) SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCE N.Alamelu Menaka * Department of Computer Applications Dr.Jabasheela Department of Computer Applications Abstract-We are in the age of big data which

More information

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin

More information

Hadoop and Map-reduce computing

Hadoop and Map-reduce computing Hadoop and Map-reduce computing 1 Introduction This activity contains a great deal of background information and detailed instructions so that you can refer to it later for further activities and homework.

More information

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Big Data Challenges in Bioinformatics

Big Data Challenges in Bioinformatics Big Data Challenges in Bioinformatics BARCELONA SUPERCOMPUTING CENTER COMPUTER SCIENCE DEPARTMENT Autonomic Systems and ebusiness Pla?orms Jordi Torres Jordi.Torres@bsc.es Talk outline! We talk about Petabyte?

More information

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE

More information

Open source Google-style large scale data analysis with Hadoop

Open source Google-style large scale data analysis with Hadoop Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical

More information

Hadoop. Bioinformatics Big Data

Hadoop. Bioinformatics Big Data Hadoop Bioinformatics Big Data Paolo D Onorio De Meo Mattia D Antonio p.donoriodemeo@cineca.it m.dantonio@cineca.it Big Data Too much information! Big Data Explosive data growth proliferation of data capture

More information

Spark and the Big Data Library

Spark and the Big Data Library Spark and the Big Data Library Reza Zadeh Thanks to Matei Zaharia Problem Data growing faster than processing speeds Only solution is to parallelize on large clusters» Wide use in both enterprises and

More information

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides. MapReduce for data intensive computing Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

More information

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12 Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using

More information

Distributed Aggregation in Cloud Databases. By: Aparna Tiwari tiwaria@umail.iu.edu

Distributed Aggregation in Cloud Databases. By: Aparna Tiwari tiwaria@umail.iu.edu Distributed Aggregation in Cloud Databases By: Aparna Tiwari tiwaria@umail.iu.edu ABSTRACT Data intensive applications rely heavily on aggregation functions for extraction of data according to user requirements.

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Click Stream Data Analysis Using Hadoop

Click Stream Data Analysis Using Hadoop Governors State University OPUS Open Portal to University Scholarship Capstone Projects Spring 2015 Click Stream Data Analysis Using Hadoop Krishna Chand Reddy Gaddam Governors State University Sivakrishna

More information

Challenges for Data Driven Systems

Challenges for Data Driven Systems Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2

More information

Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics

Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics Sabeur Aridhi Aalto University, Finland Sabeur Aridhi Frameworks for Big Data Analytics 1 / 59 Introduction Contents 1 Introduction

More information

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013 Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software SC13, November, 2013 Agenda Abstract Opportunity: HPC Adoption of Big Data Analytics on Apache

More information

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person

More information

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,

More information

Convergence of Big Data and Cloud

Convergence of Big Data and Cloud American Journal of Engineering Research (AJER) e-issn : 2320-0847 p-issn : 2320-0936 Volume-03, Issue-05, pp-266-270 www.ajer.org Research Paper Open Access Convergence of Big Data and Cloud Sreevani.Y.V.

More information

COMP9321 Web Application Engineering

COMP9321 Web Application Engineering COMP9321 Web Application Engineering Semester 2, 2015 Dr. Amin Beheshti Service Oriented Computing Group, CSE, UNSW Australia Week 11 (Part II) http://webapps.cse.unsw.edu.au/webcms2/course/index.php?cid=2411

More information

Parallel Processing of cluster by Map Reduce

Parallel Processing of cluster by Map Reduce Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai vamadhavi04@yahoo.co.in MapReduce is a parallel programming model

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Big Data Processing with Google s MapReduce. Alexandru Costan

Big Data Processing with Google s MapReduce. Alexandru Costan 1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:

More information

Sriram Krishnan, Ph.D. sriram@sdsc.edu

Sriram Krishnan, Ph.D. sriram@sdsc.edu Sriram Krishnan, Ph.D. sriram@sdsc.edu (Re-)Introduction to cloud computing Introduction to the MapReduce and Hadoop Distributed File System Programming model Examples of MapReduce Where/how to run MapReduce

More information

A Brief Outline on Bigdata Hadoop

A Brief Outline on Bigdata Hadoop A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is

More information

How To Handle Big Data With A Data Scientist

How To Handle Big Data With A Data Scientist III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution

More information

Generic Log Analyzer Using Hadoop Mapreduce Framework

Generic Log Analyzer Using Hadoop Mapreduce Framework Generic Log Analyzer Using Hadoop Mapreduce Framework Milind Bhandare 1, Prof. Kuntal Barua 2, Vikas Nagare 3, Dynaneshwar Ekhande 4, Rahul Pawar 5 1 M.Tech(Appeare), 2 Asst. Prof., LNCT, Indore 3 ME,

More information

CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof.

CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof. CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensie Computing Uniersity of Florida, CISE Department Prof. Daisy Zhe Wang Map/Reduce: Simplified Data Processing on Large Clusters Parallel/Distributed

More information

What is Analytic Infrastructure and Why Should You Care?

What is Analytic Infrastructure and Why Should You Care? What is Analytic Infrastructure and Why Should You Care? Robert L Grossman University of Illinois at Chicago and Open Data Group grossman@uic.edu ABSTRACT We define analytic infrastructure to be the services,

More information

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview

More information

BIG DATA SOLUTION DATA SHEET

BIG DATA SOLUTION DATA SHEET BIG DATA SOLUTION DATA SHEET Highlight. DATA SHEET HGrid247 BIG DATA SOLUTION Exploring your BIG DATA, get some deeper insight. It is possible! Another approach to access your BIG DATA with the latest

More information

Duke University http://www.cs.duke.edu/starfish

Duke University http://www.cs.duke.edu/starfish Herodotos Herodotou, Harold Lim, Fei Dong, Shivnath Babu Duke University http://www.cs.duke.edu/starfish Practitioners of Big Data Analytics Google Yahoo! Facebook ebay Physicists Biologists Economists

More information

MATE-EC2: A Middleware for Processing Data with AWS

MATE-EC2: A Middleware for Processing Data with AWS MATE-EC2: A Middleware for Processing Data with AWS Tekin Bicer Department of Computer Science and Engineering Ohio State University bicer@cse.ohio-state.edu David Chiu School of Engineering and Computer

More information

Apache Hadoop. Alexandru Costan

Apache Hadoop. Alexandru Costan 1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open

More information

Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team

Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team Software tools for Complex Networks Analysis Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team MOTIVATION Why do we need tools? Source : nature.com Visualization Properties extraction

More information

Open source large scale distributed data management with Google s MapReduce and Bigtable

Open source large scale distributed data management with Google s MapReduce and Bigtable Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory

More information

Lecture 10 - Functional programming: Hadoop and MapReduce

Lecture 10 - Functional programming: Hadoop and MapReduce Lecture 10 - Functional programming: Hadoop and MapReduce Sohan Dharmaraja Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 1 / 41 For today Big Data and Text analytics Functional

More information