General purpose Distributed Computing using a High level Language. Michael Isard

Size: px
Start display at page:

Download "General purpose Distributed Computing using a High level Language. Michael Isard"

Transcription

1 Dryad and DryadLINQ General purpose Distributed Computing using a High level Language Michael Isard Microsoft Research Silicon Valley

2 Distributed Data Parallel Computing Workloads beyond standard SQL, HPC Data mining, graph analysis, Complex, long lived lived application software Cloud (shared clusters) Transparent scaling Resource virtualization Commodity hardware Fault tolerance with good performance

3 Talk overview Part I High level language: LINQ Computational model: DAG Execution layer: Dryad+Quincy Part tii Dryad systems issues Comparison with MapReduce DryadLINQ demo

4 LINQ Microsoft s Language INtegrated Query Operators to manipulate datasets in.net Dataset is a first class abstraction Select, Join, GroupBy, Aggregate, etc. Set at a time, instead of looping over Object at a time Integrated into.net programming languages Programscan call operators Operators can invoke arbitrary.net functions Data model Data elements are strongly typed.net objects Much more expressive than SQL tables Extensible Add new operators and implementations

5 Aggregating partial sums class PartialSum { public int sum; public int count; }; static double MergeSums(PartialSum[] sums) { int totalsum = 0, totalcount = 0; int i; for (i = 0; i < sums.length; ++i) { totalsum += sums[i].sum; totalcount += sums[i].count; } return (double) totalsum / (double) totalcount; }

6 Aggregating partial sums class PartialSum { public int sum; public int count; }; static double MergeSums(PartialSum[] sums) { return (double) sums.select(x => x.sum).sum() xsum)sum() / (double) sums.select(x => x.count).sum(); }

7 Convenient syntax var words = tableoflines.selectmany(l => l.split(' ')).GroupBy(w => w);

8 Convenient syntax var words = tableoflines.selectmany(l => l.split(' ')).GroupBy(w => w); IQueryable<IGrouping<string,string>> words = tableoflines.selectmany(mysplitfunction).groupby(mystringidentity); IEnumerable<string> mysplitfunction(string line) { return line.split(' '); } string mystringidentity(string word) { return word; }

9 K means algorithm

10 K means algorithm

11 K means algorithm

12 K means algorithm

13 K means helper functions class Vector { } Vector Mean(IEnumerable<Vector> set) { } Vector sum = set.aggregate( (x, y) => x + y ); return sum / set.count(); Vector NearestNeighbor(Vector vect, IEnumerable<Vector> set) { return set.min( e => (e vect).l2norm() ); }

14 K means algorithm IEnumerable<Vector> kmeansstep(ienumerable<vector> vectors, } var clusters = vectors.groupby( IEnumerable<Vector> centers) { vector => NearestNeighbor(vector, centers).vectorid); return clusters.select(cluster => Mean(cluster)); IEnumerable<Vector> kmeans(ienumerable<vector> vectors, IEnumerable<Vector> centers) { for (int i = 0; i< iterations; i++) centers = kmeansstep(vectors, centers); return centers; }

15 Data mining, machine learning, Decision tree training SVD Power iteration i (PageRank) Image feature extraction/indexing/clustering Network trace analysis Light field simulation

16 Talk overview Part I High level language: LINQ Computational model: DAG Execution layer: Dryad+Quincy Part tii Dryad systems issues Comparison with MapReduce DryadLINQ demo

17 Computational model: DAG Distributed processing Partition computation across cores/cluster Minimize communication overhead Directed acyclic graph Edge is finite sequence of data items Vertex is computation over input edge sequences

18 DAG abstraction Explicit dataflow Exposes dependencies within computation Absence of cycles Allows re execution for fault tolerance Simplifies scheduling: no deadlock Cycles can often be replaced by unrolling Unsuitable for fine grain inner loops Very popular p Databases, functional languages,

19 Map Independent transformation of dataset for each x in S, output x = f(x) Eg E.g. simple grep for word w output line x only if x contains w

20 Map Independent transformation of dataset for each x in S, output x = f(x) Eg E.g. simple grep for word w output line x only if x contains w S f S

21 Map Independent transformation of dataset for each x in S, output x = f(x) Eg E.g. simple grep for word w output line x only if x contains w S 1 f S 1 S 2 f S 2 S 3 f S 3

22 Reduce Grouping plus aggregation 1) Group x in S according to key selector k(x) 2) For each group g, output r(g) E.g. simple word count group by k(x) () = x for each group g output key (word) and count of g

23 Reduce Grouping plus aggregation 1) Group x in S according to key selector k(x) 2) For each group g, output r(g) E.g. simple word count group by k(x) () = x for each group g output key (word) and count of g S G r S

24 Reduce S G r S

25 Reduce S 1 G r S S 2 S 3

26 Reduce S 1 D G r S 1 S 2 D G r S 2 S 3 D G r S 3 3 D is distribute, eg e.g. by hash or range

27 Reduce S 1 G ir D G r S 1 S 2 G ir D G r S 2 S 3 G ir D 3 ir is initial reduce, eg e.g. compute a partial sum

28 K means C 0 ac cc C 1 P ac cc C 2 ac cc C 3

29 K means C 0 P 1 ac ac ac cc C 1 P 2 P 3 ac 3 ac ac cc C 2 ac ac ac cc C 3

30 PageRank N 0 1 N 0 2 ae ae ae D D D cc cc cc N 1 1 N 1 2 N 1 3 N 0 3 E 1 E 2 ae ae cc D cc N2 1 D ae D cc N 2 2 N 2 3 E 3 ae ae ae D D cc cc N 3 1 N 3 2 D cc N 3 3

31 Distributed Word Count Count word frequency in a set of documents: var words = docs.selectmany(doc => doc.words); var groups = words.groupby(word => word); var counts = groups.select(g => new WordCount(g.Key, g.count())); IN SM GB S OUT doc => word => g => doc.words word new

32 Execution Plan for Word Count SM SelectMany SM Q GB sort groupby pipelined pp GB S (1) C D count distribute MS mergesort GB groupby pipelined Sum Sum

33 Execution Plan for Word Count SM SM SM SM SM Q GB Q GB Q GB Q GB GB S (1) C D (2) C D C D C D MS GB MS GB MS GB Sum Sum Sum

34 Talk overview Part I High level language: LINQ Computational model: DAG Execution layer: Dryad+Quincy Part tii Dryad systems issues Comparison with MapReduce DryadLINQ demo

35 Dryad General purposeexecution execution engine Batch processing on immutable datasets Well tested on largeclusters Automatically handles Fault tolerance Distribution of code and intermediate data Scheduling of work to resources

36 Fault tolerance Buffer data in (some) edges Re execute on failure using buffered data Speculatively l re execute for stragglers

37 Rewrite graph at runtime Loop unrolling with convergence tests Adapt partitioning scheme at run time Choose #partitions based on runtime data dt volume Broadcast Join vs. Hash Join, etc. Adaptive aggregation and distribution trees Based on data skew and network topology Load balancing Data/processing skew (cf work stealing)

38 Dryad System Architecture Scheduler R

39 Dryad System Architecture Scheduler R R

40 Quincy DAG Scheduler Data locality and fairness (SLAs) SOSP 2009

41 Production system Dryadwell tested tested, scalable Daily use supporting Bing for over 3 years Clusters with >10k computers Applicable to large number of computations 250 computer cluster at MSR SVC, Mar >Nov 09 15k jobs (tens of millions of processes executed) Hundreds d of distinct t programs Network trace analysis, privacy preserving inference, lighttransport simulation, decision tree training, deep belief network training, image feature extraction,

42 Conclusion DryadLINQ supports many computations Easy to use, flexible DAG structured jobs scale to large clusters Transient failures common, disk failures daily Publically available for download

43 Talk overview Part I High level language: LINQ Computational model: DAG Execution layer: Dryad+Quincy Part tii Dryad systems issues Comparison with MapReduce DryadLINQ demo

44 Dryad Inputs and Outputs Partitioned data set Records do not cross partition boundaries Data on compute machines: NTFS, SQLServer, Optional semantics Hash partition, range partition, sorted, etc. Loading external data Partitioning automatic File system chooses sensible partition sizes Or known partitioning from user

45 Partitioning driven by data S 1 G S 2 G S 3 G S 4 G S 5 G S 6 G

46 Partitioning driven by data S 1 G ir D S 2 G r S 1 S 3 G ir D S 4 G r S 2 S 5 G ir D S 6

47 Push vs Pull Databases typically pull using iterator model Avoids buffering Canprevent unnecessary computation But DAG must be fully materialized Complicates rewriting Prevents resource virtualization in shared cluster S 1 D G r S 1 S 2 D G r S 2

48 Channel abstraction S 1 G ir D G r S 1 S 2 G ir D G r S 2 S 3 G ir D 3

49 Push vs Pull Channel types define connected component Shared memory or TCP must be gang scheduled Pull within gang, push between gangs

50 Fault tolerance Buffer data in (some) edges Re execute on failure using buffered data Speculatively l re execute for stragglers Push model makes this very simple

51 DryadLINQ Internals Distributed execution plan Static optimizations: pipelining, eager aggregation, etc. Dynamic optimizations: data dependent partitioning, dynamic aggregation, etc. Automatic code generation Vertex code that runs on vertices Channel serialization ili i code Callback code for runtime optimizations Automatically distributed to cluster machines Separate LINQ query from its local context Distribute referenced objects to cluster machines Distribute application DLLs to cluster machines

52 Decomposable Functions Roughly, a function H is decomposable if it can be expressed as composition of two functions IR and C such that IR is commutative C is commutative and associative Some decomposable functions Sum: IR = Sum, C = Sum Count: IR = Count, C = Sum OrderBy.Take: IR = OrderBy.Take, C = SelectMany.OrderBy.Take

53 Two Key Questions How do we decompose a function? Two interfaces: iterator and accumulator Choice of interfaces can have significant impact on performance How do we deal with user defined functions? Try to infer automatically Provide a good annotation tti mechanism

54 Iterator Interface in DryadLINQ M M M [Decomposable("InitialReduce", lr " "Combine")] ")] public static IntPair SumAndCount(IEnumerable<int> g) { return new IntPair(g.Sum(), g.count()); } G1 IR D G1 IR D G1 IR D public static IntPair InitialReduce(IEnumerable<int> g) { return rn new IntPair(g.Sum(), g.count()); } public static IntPair Combine(IEnumerable<IntPair> g) { return new IntPair(g.Select(x => x.first).sum(), g.select(x => x.second).sum()); } MG G3 F MG G3 F MG G2 C MG G2 C X X

55 Accumulator Interface in DryadLINQ [Decomposable("Initialize", "Iterate", "Merge")] public static IntPair SumAndCount(IEnumerable<int> g) { return new IntPair(g.Sum(), g.count()); } public static IntPair Initialize() { return new IntPair(0, 0); } M G1 IR D M G1 IR D M G1 IR D MG MG public static IntPair Iterate(IntPair x, int r) { x.first += r; x.second += 1; return x; } public static IntPair Merge(IntPair x, IntPair o) { x.first += o.first; x.second += o.second; return x; } MG G3 F X MG G3 F X G2 C G2 C

56 G1+IR and G2+C Iterator PartialSort Keep only a fixed number of chunks in memory Chunks are processed in parallel: sorted, grouped, reduced by IR or C, and emitted G3+F Read the entire input into memory, perform a parallel sort, and apply F to each group Observations G1+IR can always a bepipelined with upstream G3+F can often be pipelined with downstream G1+IR may havepoor data reduction PartialSort is the closest to MapReduce

57 Accumulator FullHash G1+IR, G2+C, and G3+F Build an in memory parallel hash table: oneaccumulator object/key Each input record is accumulated into its accumulator object, and then discarded Output the hash table when all records are processed Observations Optimal data reduction for G1+IR Memory usage proportional to the number of unique keys, not records So, we by default enable upstream and downstream pipelining Used by DB2 and Oracle

58 Talk overview Part I High level language: LINQ Computational model: DAG Execution layer: Dryad+Quincy Part tii Dryad systems issues Comparison with MapReduce DryadLINQ demo

59 MapReduce (Hadoop) MapReduce restricts Topology of DAG Semantics of function in compute vertex Sequence of instances for non trivial tasks S 1 f G ir D G r S 1 1 S 2 f G ir D G r S 2 S 3 f G ir D

60 MapReduce language complexity Simple to describe MapReduce model Can be hard to map algorithm to framework cf k means: combine C+P, CPbroadcast tc, iterate, t HIVE, PigLatin etc. mitigate programming issues

61 MapReduce system complexity Simple to describe MapReduce system Implementation not uniform Different fault tolerance t l for mappers, reducers Add more special cases for performance Hd Hadoop introducing i TCP channels, pipelines, Dryad has same state machine everywhere

62 DryadLINQ demo

63 Conclusions High level language is good For ease of use, maintainability, expressiveness Computational abstraction is important Suitable target for compiler, not developer Common patterns should be efficient i Optimization should be easy LINQ is a pretty good language abstraction bt ti DAG is a very good computational model

Distributed Aggregation in Cloud Databases. By: Aparna Tiwari tiwaria@umail.iu.edu

Distributed Aggregation in Cloud Databases. By: Aparna Tiwari tiwaria@umail.iu.edu Distributed Aggregation in Cloud Databases By: Aparna Tiwari tiwaria@umail.iu.edu ABSTRACT Data intensive applications rely heavily on aggregation functions for extraction of data according to user requirements.

More information

A Software Stack for Distributed Data-Parallel Computing. Yuan Yu Microsoft Research Silicon Valley

A Software Stack for Distributed Data-Parallel Computing. Yuan Yu Microsoft Research Silicon Valley A Software Stack for Distributed Data-Parallel Computing Yuan Yu Microsoft Research Silicon Valley The Programming Model Provide a sequential, single machine programming abstraction Same program runs on

More information

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Spark. Fast, Interactive, Language- Integrated Cluster Computing Spark Fast, Interactive, Language- Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica UC

More information

Distributed Aggregation for Data-Parallel Computing: Interfaces and Implementations

Distributed Aggregation for Data-Parallel Computing: Interfaces and Implementations Distributed Aggregation for Data-Parallel Computing: Interfaces and Implementations Yuan Yu Microsoft Research 1065 La Avenida Ave. Mountain View, CA 94043 yuanbyu@microsoft.com Pradeep Kumar Gunda Microsoft

More information

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:

More information

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person

More information

16.1 MAPREDUCE. For personal use only, not for distribution. 333

16.1 MAPREDUCE. For personal use only, not for distribution. 333 For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several

More information

Systems Engineering II. Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de

Systems Engineering II. Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de Systems Engineering II Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de About me! Since May 2015 2015 2012 Research Group Leader cfaed, TU Dresden PhD Student MPI- SWS Research Intern Microsoft

More information

Programming Abstractions and Security for Cloud Computing Systems

Programming Abstractions and Security for Cloud Computing Systems Programming Abstractions and Security for Cloud Computing Systems By Sapna Bedi A MASTERS S ESSAY SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in The Faculty

More information

Cloud Computing. Up until now

Cloud Computing. Up until now Cloud Computing Lecture 19 Cloud Programming 2011-2012 Up until now Introduction, Definition of Cloud Computing Pre-Cloud Large Scale Computing: Grid Computing Content Distribution Networks Cycle-Sharing

More information

Developing MapReduce Programs

Developing MapReduce Programs Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2015/16 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes

More information

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora {mbalassi, gyfora}@apache.org The Flink Big Data Analytics Platform Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org What is Apache Flink? Open Source Started in 2009 by the Berlin-based database research groups In the Apache

More information

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides. MapReduce for data intensive computing Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

The Stratosphere Big Data Analytics Platform

The Stratosphere Big Data Analytics Platform The Stratosphere Big Data Analytics Platform Amir H. Payberah Swedish Institute of Computer Science amir@sics.se June 4, 2014 Amir H. Payberah (SICS) Stratosphere June 4, 2014 1 / 44 Big Data small data

More information

Big Data Processing with Google s MapReduce. Alexandru Costan

Big Data Processing with Google s MapReduce. Alexandru Costan 1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:

More information

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop

More information

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012 MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte

More information

Apache Flink. Fast and Reliable Large-Scale Data Processing

Apache Flink. Fast and Reliable Large-Scale Data Processing Apache Flink Fast and Reliable Large-Scale Data Processing Fabian Hueske @fhueske 1 What is Apache Flink? Distributed Data Flow Processing System Focused on large-scale data analytics Real-time stream

More information

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.

More information

Big Data looks Tiny from the Stratosphere

Big Data looks Tiny from the Stratosphere Volker Markl http://www.user.tu-berlin.de/marklv volker.markl@tu-berlin.de Big Data looks Tiny from the Stratosphere Data and analyses are becoming increasingly complex! Size Freshness Format/Media Type

More information

Introduction to Parallel Programming and MapReduce

Introduction to Parallel Programming and MapReduce Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant

More information

Apache Hadoop. Alexandru Costan

Apache Hadoop. Alexandru Costan 1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open

More information

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing /35 Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing Zuhair Khayyat 1 Karim Awara 1 Amani Alonazi 1 Hani Jamjoom 2 Dan Williams 2 Panos Kalnis 1 1 King Abdullah University of

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

A Brief Introduction to Apache Tez

A Brief Introduction to Apache Tez A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value

More information

MapReduce (in the cloud)

MapReduce (in the cloud) MapReduce (in the cloud) How to painlessly process terabytes of data by Irina Gordei MapReduce Presentation Outline What is MapReduce? Example How it works MapReduce in the cloud Conclusion Demo Motivation:

More information

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc. Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc. 2015 The MathWorks, Inc. 1 Challenges of Big Data Any collection of data sets so large and complex that it becomes difficult

More information

Unified Big Data Analytics Pipeline. 连 城 lian@databricks.com

Unified Big Data Analytics Pipeline. 连 城 lian@databricks.com Unified Big Data Analytics Pipeline 连 城 lian@databricks.com What is A fast and general engine for large-scale data processing An open source implementation of Resilient Distributed Datasets (RDD) Has an

More information

Big Data on Microsoft Platform

Big Data on Microsoft Platform Big Data on Microsoft Platform Prepared by GJ Srinivas Corporate TEG - Microsoft Page 1 Contents 1. What is Big Data?...3 2. Characteristics of Big Data...3 3. Enter Hadoop...3 4. Microsoft Big Data Solutions...4

More information

Lecture 10 - Functional programming: Hadoop and MapReduce

Lecture 10 - Functional programming: Hadoop and MapReduce Lecture 10 - Functional programming: Hadoop and MapReduce Sohan Dharmaraja Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 1 / 41 For today Big Data and Text analytics Functional

More information

Bringing Big Data Modelling into the Hands of Domain Experts

Bringing Big Data Modelling into the Hands of Domain Experts Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks david.willingham@mathworks.com.au 2015 The MathWorks, Inc. 1 Data is the sword of the

More information

Challenges for Data Driven Systems

Challenges for Data Driven Systems Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2

More information

Spark: Cluster Computing with Working Sets

Spark: Cluster Computing with Working Sets Spark: Cluster Computing with Working Sets Outline Why? Mesos Resilient Distributed Dataset Spark & Scala Examples Uses Why? MapReduce deficiencies: Standard Dataflows are Acyclic Prevents Iterative Jobs

More information

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context MapReduce Jeffrey Dean and Sanjay Ghemawat Background context BIG DATA!! o Large-scale services generate huge volumes of data: logs, crawls, user databases, web site content, etc. o Very useful to be able

More information

Mining Large Datasets: Case of Mining Graph Data in the Cloud

Mining Large Datasets: Case of Mining Graph Data in the Cloud Mining Large Datasets: Case of Mining Graph Data in the Cloud Sabeur Aridhi PhD in Computer Science with Laurent d Orazio, Mondher Maddouri and Engelbert Mephu Nguifo 16/05/2014 Sabeur Aridhi Mining Large

More information

Cosmos. Big Data and Big Challenges. Ed Harris - Microsoft Online Services Division HTPS 2011

Cosmos. Big Data and Big Challenges. Ed Harris - Microsoft Online Services Division HTPS 2011 Cosmos Big Data and Big Challenges Ed Harris - Microsoft Online Services Division HTPS 2011 1 Outline Introduction Cosmos Overview The Structured s Project Conclusion 2 What Is COSMOS? Petabyte Store and

More information

Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce

Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce Mohammad Farhan Husain, Pankil Doshi, Latifur Khan, and Bhavani Thuraisingham University of Texas at Dallas, Dallas TX 75080, USA Abstract.

More information

Map-Reduce for Machine Learning on Multicore

Map-Reduce for Machine Learning on Multicore Map-Reduce for Machine Learning on Multicore Chu, et al. Problem The world is going multicore New computers - dual core to 12+-core Shift to more concurrent programming paradigms and languages Erlang,

More information

Apache Flink Next-gen data analysis. Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas

Apache Flink Next-gen data analysis. Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas Apache Flink Next-gen data analysis Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research

More information

Lecture Data Warehouse Systems

Lecture Data Warehouse Systems Lecture Data Warehouse Systems Eva Zangerle SS 2013 PART C: Novel Approaches in DW NoSQL and MapReduce Stonebraker on Data Warehouses Star and snowflake schemas are a good idea in the DW world C-Stores

More information

Using distributed technologies to analyze Big Data

Using distributed technologies to analyze Big Data Using distributed technologies to analyze Big Data Abhijit Sharma Innovation Lab BMC Software 1 Data Explosion in Data Center Performance / Time Series Data Incoming data rates ~Millions of data points/

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

A Comparison of Approaches to Large-Scale Data Analysis

A Comparison of Approaches to Large-Scale Data Analysis A Comparison of Approaches to Large-Scale Data Analysis Sam Madden MIT CSAIL with Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, and Michael Stonebraker In SIGMOD 2009 MapReduce

More information

Distributed Computing and Big Data: Hadoop and MapReduce

Distributed Computing and Big Data: Hadoop and MapReduce Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:

More information

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University 2014-05-12

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University 2014-05-12 Introduction to NoSQL Databases and MapReduce Tore Risch Information Technology Uppsala University 2014-05-12 What is a NoSQL Database? 1. A key/value store Basic index manager, no complete query language

More information

http://www.wordle.net/

http://www.wordle.net/ Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely

More information

Hadoop-BAM and SeqPig

Hadoop-BAM and SeqPig Hadoop-BAM and SeqPig Keijo Heljanko 1, André Schumacher 1,2, Ridvan Döngelci 1, Luca Pireddu 3, Matti Niemenmaa 1, Aleksi Kallio 4, Eija Korpelainen 4, and Gianluigi Zanetti 3 1 Department of Computer

More information

Energy Efficient MapReduce

Energy Efficient MapReduce Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing

More information

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect Matteo Migliavacca (mm53@kent) School of Computing Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect Simple past - Traditional

More information

Information Processing, Big Data, and the Cloud

Information Processing, Big Data, and the Cloud Information Processing, Big Data, and the Cloud James Horey Computational Sciences & Engineering Oak Ridge National Laboratory Fall Creek Falls 2010 Information Processing Systems Model Parameters Data-intensive

More information

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763 International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing

More information

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process

More information

Cosmos. Big Data and Big Challenges. Pat Helland July 2011

Cosmos. Big Data and Big Challenges. Pat Helland July 2011 Cosmos Big Data and Big Challenges Pat Helland July 2011 1 Outline Introduction Cosmos Overview The Structured s Project Some Other Exciting Projects Conclusion 2 What Is COSMOS? Petabyte Store and Computation

More information

GraySort on Apache Spark by Databricks

GraySort on Apache Spark by Databricks GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner

More information

Hadoop & Spark Using Amazon EMR

Hadoop & Spark Using Amazon EMR Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?

More information

Big Data and Scripting map/reduce in Hadoop

Big Data and Scripting map/reduce in Hadoop Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction

More information

Scaling Out With Apache Spark. DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf

Scaling Out With Apache Spark. DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf Scaling Out With Apache Spark DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf Your hosts Mathijs Kattenberg Technical consultant Jeroen Schot Technical consultant

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Google Cloud Data Platform & Services. Gregor Hohpe

Google Cloud Data Platform & Services. Gregor Hohpe Google Cloud Data Platform & Services Gregor Hohpe All About Data We Have More of It Internet data more easily available Logs user & system behavior Cheap Storage keep more of it 3 Beyond just Relational

More information

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Introduction to Big Data! with Apache Spark UC#BERKELEY# Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"

More information

Large-Scale Data Processing

Large-Scale Data Processing Large-Scale Data Processing Eiko Yoneki eiko.yoneki@cl.cam.ac.uk http://www.cl.cam.ac.uk/~ey204 Systems Research Group University of Cambridge Computer Laboratory 2010s: Big Data Why Big Data now? Increase

More information

Data Management in the Cloud

Data Management in the Cloud Data Management in the Cloud Ryan Stern stern@cs.colostate.edu : Advanced Topics in Distributed Systems Department of Computer Science Colorado State University Outline Today Microsoft Cloud SQL Server

More information

NoSQL and Hadoop Technologies On Oracle Cloud

NoSQL and Hadoop Technologies On Oracle Cloud NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath

More information

Cloud Computing at Google. Architecture

Cloud Computing at Google. Architecture Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale

More information

09336863931 : provid.ir

09336863931 : provid.ir provid.ir 09336863931 : NET Architecture Core CSharp o Variable o Variable Scope o Type Inference o Namespaces o Preprocessor Directives Statements and Flow of Execution o If Statement o Switch Statement

More information

Click Stream Data Analysis Using Hadoop

Click Stream Data Analysis Using Hadoop Governors State University OPUS Open Portal to University Scholarship Capstone Projects Spring 2015 Click Stream Data Analysis Using Hadoop Krishna Chand Reddy Gaddam Governors State University Sivakrishna

More information

What Is Datacenter (Warehouse) Computing. Distributed and Parallel Technology. Datacenter Computing Application Programming

What Is Datacenter (Warehouse) Computing. Distributed and Parallel Technology. Datacenter Computing Application Programming Distributed and Parallel Technology Datacenter and Warehouse Computing Hans-Wolfgang Loidl School of Mathematical and Computer Sciences Heriot-Watt University, Edinburgh 0 Based on earlier versions by

More information

Introduction to MapReduce and Hadoop

Introduction to MapReduce and Hadoop Introduction to MapReduce and Hadoop Jie Tao Karlsruhe Institute of Technology jie.tao@kit.edu Die Kooperation von Why Map/Reduce? Massive data Can not be stored on a single machine Takes too long to process

More information

From GWS to MapReduce: Google s Cloud Technology in the Early Days

From GWS to MapReduce: Google s Cloud Technology in the Early Days Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu lingu@ieee.org MapReduce/Hadoop

More information

Resource Scalability for Efficient Parallel Processing in Cloud

Resource Scalability for Efficient Parallel Processing in Cloud Resource Scalability for Efficient Parallel Processing in Cloud ABSTRACT Govinda.K #1, Abirami.M #2, Divya Mercy Silva.J #3 #1 SCSE, VIT University #2 SITE, VIT University #3 SITE, VIT University In the

More information

Massive Cloud Auditing using Data Mining on Hadoop

Massive Cloud Auditing using Data Mining on Hadoop Massive Cloud Auditing using Data Mining on Hadoop Prof. Sachin Shetty CyberBAT Team, AFRL/RIGD AFRL VFRP Tennessee State University Outline Massive Cloud Auditing Traffic Characterization Distributed

More information

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data

More information

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data

More information

Map Reduce / Hadoop / HDFS

Map Reduce / Hadoop / HDFS Chapter 3: Map Reduce / Hadoop / HDFS 97 Overview Outline Distributed File Systems (re-visited) Motivation Programming Model Example Applications Big Data in Apache Hadoop HDFS in Hadoop YARN 98 Overview

More information

Big Data Data-intensive Computing Methods, Tools, and Applications (CMSC 34900)

Big Data Data-intensive Computing Methods, Tools, and Applications (CMSC 34900) Big Data Data-intensive Computing Methods, Tools, and Applications (CMSC 34900) Ian Foster Computation Institute Argonne National Lab & University of Chicago 2 3 SQL Overview Structured Query Language

More information

MAPREDUCE Programming Model

MAPREDUCE Programming Model CS 2510 COMPUTER OPERATING SYSTEMS Cloud Computing MAPREDUCE Dr. Taieb Znati Computer Science Department University of Pittsburgh MAPREDUCE Programming Model Scaling Data Intensive Application MapReduce

More information

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes

More information

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com Data Warehousing and Analytics Infrastructure at Facebook Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com Overview Challenges in a Fast Growing & Dynamic Environment Data Flow Architecture,

More information

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010 Hadoop s Entry into the Traditional Analytical DBMS Market Daniel Abadi Yale University August 3 rd, 2010 Data, Data, Everywhere Data explosion Web 2.0 more user data More devices that sense data More

More information

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15 Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 15 Big Data Management V (Big-data Analytics / Map-Reduce) Chapter 16 and 19: Abideboul et. Al. Demetris

More information

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical Identify a problem Review approaches to the problem Propose a novel approach to the problem Define, design, prototype an implementation to evaluate your approach Could be a real system, simulation and/or

More information

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

More information

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With

More information

Introduction to NoSQL Databases. Tore Risch Information Technology Uppsala University 2013-03-05

Introduction to NoSQL Databases. Tore Risch Information Technology Uppsala University 2013-03-05 Introduction to NoSQL Databases Tore Risch Information Technology Uppsala University 2013-03-05 UDBL Tore Risch Uppsala University, Sweden Evolution of DBMS technology Distributed databases SQL 1960 1970

More information

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez To cite this version: Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez. FP-Hadoop:

More information

Outline. Motivation. Motivation. MapReduce & GraphLab: Programming Models for Large-Scale Parallel/Distributed Computing 2/28/2013

Outline. Motivation. Motivation. MapReduce & GraphLab: Programming Models for Large-Scale Parallel/Distributed Computing 2/28/2013 MapReduce & GraphLab: Programming Models for Large-Scale Parallel/Distributed Computing Iftekhar Naim Outline Motivation MapReduce Overview Design Issues & Abstractions Examples and Results Pros and Cons

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

DRIVING INNOVATION THROUGH DATA ACCELERATING BIG DATA APPLICATION DEVELOPMENT WITH CASCADING

DRIVING INNOVATION THROUGH DATA ACCELERATING BIG DATA APPLICATION DEVELOPMENT WITH CASCADING DRIVING INNOVATION THROUGH DATA ACCELERATING BIG DATA APPLICATION DEVELOPMENT WITH CASCADING Supreet Oberoi VP Field Engineering, Concurrent Inc GET TO KNOW CONCURRENT Leader in Application Infrastructure

More information

Building Scalable Applications Using Microsoft Technologies

Building Scalable Applications Using Microsoft Technologies Building Scalable Applications Using Microsoft Technologies Padma Krishnan Senior Manager Introduction CIOs lay great emphasis on application scalability and performance and rightly so. As business grows,

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example MapReduce MapReduce and SQL Injections CS 3200 Final Lecture Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design

More information

Performance Evaluation of LINQ to HPC and Hadoop for Big Data

Performance Evaluation of LINQ to HPC and Hadoop for Big Data UNF Digital Commons UNF Theses and Dissertations Student Scholarship 2013 Performance Evaluation of LINQ to HPC and Hadoop for Big Data Ravishankar Sivasubramaniam University of North Florida Suggested

More information

HiBench Introduction. Carson Wang (carson.wang@intel.com) Software & Services Group

HiBench Introduction. Carson Wang (carson.wang@intel.com) Software & Services Group HiBench Introduction Carson Wang (carson.wang@intel.com) Agenda Background Workloads Configurations Benchmark Report Tuning Guide Background WHY Why we need big data benchmarking systems? WHAT What is

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

JetScope: Reliable and Interactive Analytics at Cloud Scale

JetScope: Reliable and Interactive Analytics at Cloud Scale JetScope: Reliable and Interactive Analytics at Cloud Scale Eric Boutin, Paul Brett, Xiaoyu Chen, Jaliya Ekanayake Tao Guan, Anna Korsun, Zhicheng Yin, Nan Zhang, Jingren Zhou Microsoft ABSTRACT Interactive,

More information