web-scale data processing Christopher Olston and many others Yahoo! Research

Size: px
Start display at page:

Download "web-scale data processing Christopher Olston and many others Yahoo! Research"

Transcription

1 web-scale data processing Christopher Olston and many others Yahoo! Research

2 Motivation Projects increasingly revolve around analysis of big data sets Extracting structured data, e.g. face detection Understanding complex large-scale phenomena social systems (e.g. user-generated web content) economic systems (e.g. advertising markets) computer systems (e.g. web search engines) Data analysis is inner loop at Yahoo! et al. Big data necessitates parallel processing

3 Examples 1. Detect faces You have a function detectfaces() You want to run it over n images n is big 2. Study web usage You have a web crawl and click log Find sessions that end with the best page

4 Existing Work Parallel architectures cluster computing multi-core processors Data-parallel software parallel DBMS Map-Reduce, Dryad Data-parallel languages SQL NESL

5 Pig Project Data-parallel language ( Pig Latin ) Relational data manipulation primitives Imperative programming style Plug in code to customize processing Various crazy ideas Multi-program optimization Adaptive data placement Automatic example data generator

6 Pig Latin Language [SIGMOD 08]

7 Example 1 Detect faces in many images. I = load /mydata/images using ImageParser() as (id, image); F = foreach I generate id, detectfaces(image); store F into /mydata/faces ;

8 Example 2 Find sessions that end with the best page. Visits user url time Amy 8:00 Amy 8:05 Amy 10:00 Amy 10:05 Fred cnn.com/index.htm 12:00 Pages url pagerank

9 Efficient Evaluation Method the answer repartition by user repartition by url group-wise processing (identify sessions; examine pageranks) join Visits Pages

10 In Pig Latin Visits = load /data/visits as (user, url, time); Visits = foreach Visits generate user, Canonicalize(url), time; Pages = load /data/pages as (url, pagerank); VP = join Visits by url, Pages by url; UserVisits = group VP by user; Sessions = foreach UserVisits generate flatten(findsessions(*)); HappyEndings = filter Sessions by BestIsLast(*); store HappyEndings into '/data/happy_endings';

11 Pig Latin, in general transformations on sets of records easy for users high-level, extensible data processing primitives easy for the system exposes opportunities for parallelism and reuse operators: FILTER FOREACH GENERATE GROUP binary operators: JOIN COGROUP UNION

12 Related Languages SQL: declarative all-in-one blocks NESL: lacks join, cogroup Map-Reduce: special case of Pig Latin a = FOREACH input GENERATE flatten(map(*)); b = GROUP a BY $0; c = FOREACH b GENERATE Reduce(*); Sawzall: rigid map-then-reduce structure

13 Pig Latin = Sweet Spot Between SQL & Map-Reduce SQL Map-Reduce Programming style Large blocks of declarative Plug together pipes (but "I much prefer writing in Pig [Latin] versus SQL. The step-by-step method of constraints restricted to linear, 2-step creating a program in Pig [Latin] is much cleaner and simpler to use than the single block method of SQL. It is easier to keep track of data what flow) your variables Built-in are, data and where you Group-by, are in the process Sort, Join, of analyzing Filter, your Group-by, data. Sort manipulations Aggregate, Top-k, etc Jasmine Novak, Engineer, Yahoo! Execution model Fancy; trust the query Simple, transparent "PIG seems to give optimizer the necessary parallel programming construct Opportunities (FOREACH, for FLATTEN, Many COGROUP.. etc) and also give Few sufficient (logic control buried in map() automatic back optimization to the programmer (which purely declarative approach and reduce()) like [SQL on top of Map-Reduce] doesn t). -- Ricky Ho, Adobe Software

14 Map-Reduce as Backend ( SQL ) user automatic rewrite + optimize Pig or or Map-Reduce cluster details in [VLDB09]

15 Pig Latin vs. Map-Reduce: Code Reduction import java.io.ioexception; reporter.setstatus("ok"); lp.setoutputkeyclass(text.class); import java.util.arraylist; lp.setoutputvalueclass(text.class); import java.util.iterator; lp.setmapperclass(loadpages.class); Users = load users as (name, age); import java.util.list; // Do the cross product and collect the values FileInputFormat.addInputPath(lp, new for (String s1 : first) { Path("/user/gates/pages")); import org.apache.hadoop.fs.path; for (String s2 : second) { FileOutputFormat.setOutputPath(lp, import org.apache.hadoop.io.longwritable; String outval = key + "," + s1 + "," + s2; new Path("/user/gates/tmp/indexed_pages")); import org.apache.hadoop.io.text; oc.collect(null, new Text(outval)); lp.setnumreducetasks(0); import org.apache.hadoop.io.writable; reporter.setstatus("ok"); Job loadpages = new Job(lp); import org.apache.hadoop.io.writablecomparable; import Fltrd org.apache.hadoop.mapred.fileinputformat; = filter Users by JobConf lfu = new JobConf(MRExample.class); import org.apache.hadoop.mapred.fileoutputformat; lfu.setjobname("load and Filter Users"); import org.apache.hadoop.mapred.jobconf; lfu.setinputformat(textinputformat.class); import org.apache.hadoop.mapred.keyvaluetextinputformat; public static class LoadJoined extends MapReduceBase lfu.setoutputkeyclass(text.class); import org.apache.hadoop.mapred.mapper; implements Mapper<Text, Text, Text, LongWritable> { lfu.setoutputvalueclass(text.class); import org.apache.hadoop.mapred.mapreducebase; lfu.setmapperclass(loadandfilterusers.class); import org.apache.hadoop.mapred.outputcollector; public void map( FileInputFormat.addInputPath(lfu, new import org.apache.hadoop.mapred.recordreader; age >= 18 Text and k, age <= 25; Path("/user/gates/users")); import org.apache.hadoop.mapred.reducer; Text val, FileOutputFormat.setOutputPath(lfu, import org.apache.hadoop.mapred.reporter; OutputCollector<Text, LongWritable> oc, new Path("/user/gates/tmp/filtered_users")); import org.apache.hadoop.mapred.sequencefileinputformat; Reporter reporter) throws IOException { lfu.setnumreducetasks(0); import org.apache.hadoop.mapred.sequencefileoutputformat; // Find the url Job loadusers = new Job(lfu); import org.apache.hadoop.mapred.textinputformat; String line = val.tostring(); import org.apache.hadoop.mapred.jobcontrol.job; int firstcomma = line.indexof(','); JobConf join = new JobConf(MRExample.class); import Views org.apache.hadoop.mapred.jobcontrol.jobcontrol; = load views int secondcomma = line.indexof(',', as (user, firstcomma); url); join.setjobname("join Users and Pages"); import org.apache.hadoop.mapred.lib.identitymapper; String key = line.substring(firstcomma, secondcomma); join.setinputformat(keyvaluetextinputformat.class); // drop the rest of the record, I don't need it anymore, join.setoutputkeyclass(text.class); public class MRExample { // just pass a 1 for the combiner/reducer to sum instead. join.setoutputvalueclass(text.class); public static class LoadPages extends MapReduceBase Text outkey = new Text(key); join.setmapperclass(identitymapper.class); implements Mapper<LongWritable, Text, Text, Text> { oc.collect(outkey, new LongWritable(1L)); join.setreducerclass(join.class); FileInputFormat.addInputPath(join, new Jnd public void map(longwritable = k, Text join val, Fltrd by name, Views Path("/user/gates/tmp/indexed_pages")); by user; OutputCollector<Text, Text> oc, public static class ReduceUrls extends MapReduceBase FileInputFormat.addInputPath(join, new Reporter reporter) throws IOException { implements Reducer<Text, LongWritable, WritableComparable, Path("/user/gates/tmp/filtered_users")); // Pull the key out Writable> { FileOutputFormat.setOutputPath(join, new String line = val.tostring(); Path("/user/gates/tmp/joined")); int firstcomma = line.indexof(','); public void reduce( join.setnumreducetasks(50); String key = line.substring(0, firstcomma); Text key, Job joinjob = new Job(join); Grpd String value = line.substring(firstcomma = group + 1); Jnd Iterator<LongWritable> by url; iter, joinjob.adddependingjob(loadpages); Text outkey = new Text(key); OutputCollector<WritableComparable, Writable> oc, joinjob.adddependingjob(loadusers); // Prepend an index to the value so we know which file Reporter reporter) throws IOException { // it came from. // Add up all the values we see JobConf group = new JobConf(MRExample.class); Text outval = new Text("1" + value); group.setjobname("group URLs"); oc.collect(outkey, outval); long sum = 0; group.setinputformat(keyvaluetextinputformat.class); while (iter.hasnext()) { group.setoutputkeyclass(text.class); sum += iter.next().get(); group.setoutputvalueclass(longwritable.class); Smmd public static class LoadAndFilterUsers = extends foreach MapReduceBase Grpd reporter.setstatus("ok"); generate group, group.setoutputformat(sequencefileoutputformat.class); implements Mapper<LongWritable, Text, Text, Text> { group.setmapperclass(loadjoined.class); group.setcombinerclass(reduceurls.class); public void map(longwritable k, Text val, oc.collect(key, new LongWritable(sum)); group.setreducerclass(reduceurls.class); OutputCollector<Text, Text> oc, FileInputFormat.addInputPath(group, new Reporter reporter) throws IOException { Path("/user/gates/tmp/joined")); // Pull the key out public static class LoadClicks extends MapReduceBase FileOutputFormat.setOutputPath(group, new String line = val.tostring(); COUNT(Jnd) implements Mapper<WritableComparable, as clicks; Writable, LongWritable, Path("/user/gates/tmp/grouped")); int firstcomma = line.indexof(','); Text> { group.setnumreducetasks(50); String value = line.substring(firstcomma + 1); Job groupjob = new Job(group); int age = Integer.parseInt(value); public void map( groupjob.adddependingjob(joinjob); if (age < 18 age > 25) return; WritableComparable key, String key = line.substring(0, firstcomma); Writable val, JobConf top100 = new JobConf(MRExample.class); Text outkey = new Text(key); OutputCollector<LongWritable, Text> oc, top100.setjobname("top 100 sites"); Srtd // Prepend an index to = the value order so we know which file Smmd Reporter by reporter) throws clicks IOException { desc; top100.setinputformat(sequencefileinputformat.class); // it came from. oc.collect((longwritable)val, (Text)key); top100.setoutputkeyclass(longwritable.class); Text outval = new Text("2" + value); top100.setoutputvalueclass(text.class); oc.collect(outkey, outval); top100.setoutputformat(sequencefileoutputformat.class); public static class LimitClicks extends MapReduceBase top100.setmapperclass(loadclicks.class); implements Reducer<LongWritable, Text, LongWritable, Text> { top100.setcombinerclass(limitclicks.class); public static class Join extends MapReduceBase top100.setreducerclass(limitclicks.class); implements Reducer<Text, Text, Text, Text> { int count = 0; FileInputFormat.addInputPath(top100, new public void reduce( Path("/user/gates/tmp/grouped")); public void reduce(text key, LongWritable key, FileOutputFormat.setOutputPath(top100, new Iterator<Text> iter, Iterator<Text> iter, Path("/user/gates/top100sitesforusers18to25")); OutputCollector<Text, Text> oc, OutputCollector<LongWritable, Text> oc, top100.setnumreducetasks(1); Reporter reporter) throws IOException { Reporter reporter) throws IOException { Job limit = new Job(top100); // For each value, figure out which file it's from and limit.adddependingjob(groupjob); store it Top5 into top5sites ; // Only output the first 100 records // accordingly. while (count < 100 && iter.hasnext()) { JobControl jc = new JobControl("Find top 100 sites for users List<String> first = new ArrayList<String>(); oc.collect(key, iter.next()); 18 to 25"); List<String> second = new ArrayList<String>(); count++; jc.addjob(loadpages); jc.addjob(loadusers); while (iter.hasnext()) { jc.addjob(joinjob); Text t = iter.next(); jc.addjob(groupjob); String value = t.tostring(); public static void main(string[] args) throws IOException { jc.addjob(limit); if (value.charat(0) == '1') JobConf lp = new JobConf(MRExample.class); jc.run(); first.add(value.substring(1)); lp.setjobname("load Pages"); else second.add(value.substring(1)); lp.setinputformat(textinputformat.class); Top5 = limit Srtd 5;

16 Comparison /20 the lines of code Minutes /16 the development time Hadoop Pig Hadoop Pig performance 1.5x Hadoop

17 Ways to Run Pig Interactive shell Script file Embed in host language (e.g., Java) soon: Graphical editor

18 Status Open-source implementation Runs on Hadoop or local machine Active project; many refinements in the works Wide adoption in Yahoo 100s of users 1000s of Pig jobs/day 60% of ad-hoc Hadoop jobs are via Pig 40% of production jobs via Pig

19 Status Gaining traction externally log processing & aggregation building text indexes collaborative filtering, applied to image & video recommendation systems "The [Hofmann PLSA E/M] algorithm was implemented in pig in lines of pig-latin statements. Took a lot less compared to what it took in implementing the algorithm in Map-Reduce Java. Exactly that's the reason I wanted to try it out in Pig. It took 3-4 days for me to write it, starting from learning pig. -- Prasenjit Mukherjee, Mahout project

20 Crazy Ideas [USENIX 08] [VLDB 08] [SIGMOD 09]

21 Crazy Idea #1 Multi-Program Optimization

22 Motivation User programs repeatedly scan the same data files web crawl search log Goal: Reduce redundant IOs, and hence improve overall system throughput Approach: Introduce shared scans capability Careful scheduling of jobs, to maximize benefit of shared scans

23 Scheduling Shared Scans Scheduler looks at queues and anticipates arrivals Executor (Idle) 2 not popular Search Log Web Crawl Click Data

24 Crazy Idea #2 Adaptive Data Placement

25 Motivation Hadoop is good at localizing computation to data, for conventional map-reduce scenarios However, advanced query processing operations change this: Pre-hashed join of A & B: co-locate A & B? Frag-repl join of A & B: more replicas of B? Our idea: Adaptive pressure-based mechanism to move data s.t. better locality arises

26 Adaptive Data Placement jobs Job Localizer execution engine Scan C C A B Join Scan A A & B D A B Worker 1 Worker 2

27 Crazy Idea #3 Automatic Example Generator

28 Example Program (abstract view) Find users who tend to visit good pages. Load Visits(user, url, time) Load Pages(url, pagerank) Transform to (user, Canonicalize(url), time) Join url = url Group by user Transform to (user, Average(pagerank) as avgpr) Filter avgpr > 0.5

29 Load Visits(user, url, time) Transform to (user, Canonicalize(url), time) (Amy, cnn.com, 8am) (Amy, 9am) (Fred, 11am) Join url = url Load Pages(url, pagerank) ( 0.9) ( 0.4) (Amy, 8am) (Amy, 9am) (Fred, 11am) (Amy, 8am, 0.9) (Amy, 9am, 0.4) (Fred, 11am, 0.4) Group by user (Amy, { (Amy, 8am, 0.9), (Amy, 9am, 0.4) ) (Fred, { (Fred, 11am, 0.4) ) Transform to (user, Average(pagerank) as avgpr) (Amy, 0.65) (Fred, 0.4) Filter avgpr > 0.5 (Amy, 0.65)

30 Objectives: Automatic Example Data Generator Realism Conciseness Completeness Challenges: Large original data Selective operators (e.g., join, filter) Noninvertible operators (e.g., UDFs)

31 Talk Summary Data-parallel language ( Pig Latin ) Sequence of data transformation steps Users can plug in custom code Research nuggets Joint scheduling of related programs, to amortize IOs Adaptive data placement, to enhance locality Automatic example data generator, to make user s life easier

32 Credits Yahoo! Grid Team project leads: Alan Gates Olga Natkovich Yahoo! Research project leads: Chris Olston Utkarsh Srivastava Ben Reed

Practice and Applications of Data Management CMPSCI 345. Lecture 19-20: Amazon Web Services

Practice and Applications of Data Management CMPSCI 345. Lecture 19-20: Amazon Web Services Practice and Applications of Data Management CMPSCI 345 Lecture 19-20: Amazon Web Services Extra credit: project part 3 } Open-ended addi*onal features. } Presenta*ons on Dec 7 } Need to sign up by Nov

More information

Hadoop Pig. Introduction Basic. Exercise

Hadoop Pig. Introduction Basic. Exercise Your Name Hadoop Pig Introduction Basic Exercise A set of files A database A single file Modern systems have to deal with far more data than was the case in the past Yahoo : over 170PB of data Facebook

More information

Distributed and Cloud Computing

Distributed and Cloud Computing Distributed and Cloud Computing K. Hwang, G. Fox and J. Dongarra Chapter 6: Cloud Programming and Software Environments Part 1 Adapted from Kai Hwang, University of Southern California with additions from

More information

Database design and implementation CMPSCI 645. Lectures 22: Parallel Databases and MapReduce

Database design and implementation CMPSCI 645. Lectures 22: Parallel Databases and MapReduce Database design and implementation CMPSCI 645 Lectures 22: Parallel Databases and MapReduce What is a parallel database? 2 Parallel v.s. distributed databases } Parallel database system: } Improve performance

More information

Programming and Debugging Large- Scale Data Processing Workflows. Christopher Olston and many others Yahoo! Research

Programming and Debugging Large- Scale Data Processing Workflows. Christopher Olston and many others Yahoo! Research Programming and Debugging Large- Scale Data Processing Workflows Christopher Olston and many others Yahoo! Research Context Elaborate processing of large data sets e.g.: web search pre- processing cross-

More information

Programming and Debugging Large- Scale Data Processing Workflows

Programming and Debugging Large- Scale Data Processing Workflows Programming and Debugging Large- Scale Data Processing Workflows Christopher Olston Google Research (work done at Yahoo! Research, with many colleagues) Context Elaborate processing of large data sets

More information

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, 2009. Seth Ladd http://sethladd.com

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, 2009. Seth Ladd http://sethladd.com Hadoop and Eclipse Eclipse Hawaii User s Group May 26th, 2009 Seth Ladd http://sethladd.com Goal YOU can use the same technologies as The Big Boys Google Yahoo (2000 nodes) Last.FM AOL Facebook (2.5 petabytes

More information

How To Write A Program On A Microsoft.Com (For Microsoft) With A Microsail (For Linux) And A Microosoft.Com Microsails (For Windows) (For Macintosh) ( For Microsils

How To Write A Program On A Microsoft.Com (For Microsoft) With A Microsail (For Linux) And A Microosoft.Com Microsails (For Windows) (For Macintosh) ( For Microsils Programming and Debugging Large- Scale Data Processing Workflows Christopher Olston Google Research (work done at Yahoo! Research, with many colleagues) What I m Working on at Google We ve got fantasjc

More information

Large Scale (Machine) Learning at Twitter. Aleksander Kołcz Twitter, Inc.

Large Scale (Machine) Learning at Twitter. Aleksander Kołcz Twitter, Inc. Large Scale (Machine) Learning at Twitter Aleksander Kołcz Twitter, Inc. What is Twitter Micro-blogging service Open exchange of information/opinions Private communication possible via DMs Content restricted

More information

CSE 544: Principles of Database Systems

CSE 544: Principles of Database Systems CSE 544: Principles of Database Systems MapReduce, PigLatin CSE544 - Spring, 2012 1 Overview of Today s Lecture Cluster computing Map/reduce (paper) Degree sequence (brief discussion) PigLatin (brief discussion,

More information

MR-(Mapreduce Programming Language)

MR-(Mapreduce Programming Language) MR-(Mapreduce Programming Language) Siyang Dai Zhi Zhang Shuai Yuan Zeyang Yu Jinxiong Tan sd2694 zz2219 sy2420 zy2156 jt2649 Objective of MR MapReduce is a software framework introduced by Google, aiming

More information

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13 Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 Hadoop Ecosystem Overview of this Lecture Module Background Google MapReduce The Hadoop Ecosystem Core components: Hadoop

More information

Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so:

Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so: Codelab 1 Introduction to the Hadoop Environment (version 0.17.0) Goals: 1. Set up and familiarize yourself with the Eclipse plugin 2. Run and understand a word counting program Setting up Eclipse: Step

More information

Copy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins)

Copy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins) Beijing Codelab 1 Introduction to the Hadoop Environment Spinnaker Labs, Inc. Contains materials Copyright 2007 University of Washington, licensed under the Creative Commons Attribution 3.0 License --

More information

Introduction to Pig. Content developed and presented by: 2009 Cloudera, Inc.

Introduction to Pig. Content developed and presented by: 2009 Cloudera, Inc. Introduction to Pig Content developed and presented by: Outline Motivation Background Components How it Works with Map Reduce Pig Latin by Example Wrap up & Conclusions Motivation Map Reduce is very powerful,

More information

Big Data Technology Pig: Query Language atop Map-Reduce

Big Data Technology Pig: Query Language atop Map-Reduce Big Data Technology Pig: Query Language atop Map-Reduce Eshcar Hillel Yahoo! Ronny Lempel Outbrain *Based on slides by Edward Bortnikov & Ronny Lempel Roadmap Previous class MR Implementation This class

More information

Connecting Hadoop with Oracle Database

Connecting Hadoop with Oracle Database Connecting Hadoop with Oracle Database Sharon Stephen Senior Curriculum Developer Server Technologies Curriculum The following is intended to outline our general product direction.

More information

Hadoop WordCount Explained! IT332 Distributed Systems

Hadoop WordCount Explained! IT332 Distributed Systems Hadoop WordCount Explained! IT332 Distributed Systems Typical problem solved by MapReduce Read a lot of data Map: extract something you care about from each record Shuffle and Sort Reduce: aggregate, summarize,

More information

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich Big Data Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich MapReduce & Hadoop The new world of Big Data (programming model) Overview of this Lecture Module Background Google MapReduce The Hadoop

More information

Zebra and MapReduce. Table of contents. 1 Overview...2 2 Hadoop MapReduce APIs...2 3 Zebra MapReduce APIs...2 4 Zebra MapReduce Examples...

Zebra and MapReduce. Table of contents. 1 Overview...2 2 Hadoop MapReduce APIs...2 3 Zebra MapReduce APIs...2 4 Zebra MapReduce Examples... Table of contents 1 Overview...2 2 Hadoop MapReduce APIs...2 3 Zebra MapReduce APIs...2 4 Zebra MapReduce Examples... 2 1. Overview MapReduce allows you to take full advantage of Zebra's capabilities.

More information

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks Hadoop Introduction Olivier Renault Solution Engineer - Hortonworks Hortonworks A Brief History of Apache Hadoop Apache Project Established Yahoo! begins to Operate at scale Hortonworks Data Platform 2013

More information

Word count example Abdalrahman Alsaedi

Word count example Abdalrahman Alsaedi Word count example Abdalrahman Alsaedi To run word count in AWS you have two different ways; either use the already exist WordCount program, or to write your own file. First: Using AWS word count program

More information

Click Stream Data Analysis Using Hadoop

Click Stream Data Analysis Using Hadoop Governors State University OPUS Open Portal to University Scholarship Capstone Projects Spring 2015 Click Stream Data Analysis Using Hadoop Krishna Chand Reddy Gaddam Governors State University Sivakrishna

More information

Introduction to Apache Pig Indexing and Search

Introduction to Apache Pig Indexing and Search Large-scale Information Processing, Summer 2014 Introduction to Apache Pig Indexing and Search Emmanouil Tzouridis Knowledge Mining & Assessment Includes slides from Ulf Brefeld: LSIP 2013 Organizational

More information

Apache Hive. Big Data 2015

Apache Hive. Big Data 2015 Apache Hive Big Data 2015 Hive Configuration Translates HiveQL statements into a set of MapReduce jobs which are then executed on a Hadoop Cluster Execute on Hadoop Cluster HiveQL Hive Monitor/Report Client

More information

Systems Engineering II. Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de

Systems Engineering II. Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de Systems Engineering II Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de About me! Since May 2015 2015 2012 Research Group Leader cfaed, TU Dresden PhD Student MPI- SWS Research Intern Microsoft

More information

Scalable Computing with Hadoop

Scalable Computing with Hadoop Scalable Computing with Hadoop Doug Cutting cutting@apache.org dcutting@yahoo-inc.com 5/4/06 Seek versus Transfer B-Tree requires seek per access unless to recent, cached page so can buffer & pre-sort

More information

Pig vs Hive. Big Data 2014

Pig vs Hive. Big Data 2014 Pig vs Hive Big Data 2014 Pig Configuration In the bash_profile export all needed environment variables Pig Configuration Download a release of apache pig: pig-0.11.1.tar.gz Pig Configuration Go to the

More information

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology Working With Hadoop Now that we covered the basics of MapReduce, let s look at some Hadoop specifics. Mostly based on Tom White s book Hadoop: The Definitive Guide, 3 rd edition Note: We will use the new

More information

Word Count Code using MR2 Classes and API

Word Count Code using MR2 Classes and API EDUREKA Word Count Code using MR2 Classes and API A Guide to Understand the Execution of Word Count edureka! A guide to understand the execution and flow of word count WRITE YOU FIRST MRV2 PROGRAM AND

More information

Case-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang

Case-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang Case-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang Independent Study Advanced Case-Based Reasoning Department of Computer Science

More information

LANGUAGES FOR HADOOP: PIG & HIVE

LANGUAGES FOR HADOOP: PIG & HIVE Friday, September 27, 13 1 LANGUAGES FOR HADOOP: PIG & HIVE Michail Michailidis & Patrick Maiden Friday, September 27, 13 2 Motivation Native MapReduce Gives fine-grained control over how program interacts

More information

Xiaoming Gao Hui Li Thilina Gunarathne

Xiaoming Gao Hui Li Thilina Gunarathne Xiaoming Gao Hui Li Thilina Gunarathne Outline HBase and Bigtable Storage HBase Use Cases HBase vs RDBMS Hands-on: Load CSV file to Hbase table with MapReduce Motivation Lots of Semi structured data Horizontal

More information

CS54100: Database Systems

CS54100: Database Systems CS54100: Database Systems Cloud Databases: The Next Post- Relational World 18 April 2012 Prof. Chris Clifton Beyond RDBMS The Relational Model is too limiting! Simple data model doesn t capture semantics

More information

How To Write A Mapreduce Program In Java.Io 4.4.4 (Orchestra)

How To Write A Mapreduce Program In Java.Io 4.4.4 (Orchestra) MapReduce framework - Operates exclusively on pairs, - that is, the framework views the input to the job as a set of pairs and produces a set of pairs as the output

More information

Data Management in the Cloud

Data Management in the Cloud Data Management in the Cloud Ryan Stern stern@cs.colostate.edu : Advanced Topics in Distributed Systems Department of Computer Science Colorado State University Outline Today Microsoft Cloud SQL Server

More information

Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction to Pig

Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction to Pig Introduction to Pig Agenda What is Pig? Key Features of Pig The Anatomy of Pig Pig on Hadoop Pig Philosophy Pig Latin Overview Pig Latin Statements Pig Latin: Identifiers Pig Latin: Comments Data Types

More information

The Hadoop Eco System Shanghai Data Science Meetup

The Hadoop Eco System Shanghai Data Science Meetup The Hadoop Eco System Shanghai Data Science Meetup Karthik Rajasethupathy, Christian Kuka 03.11.2015 @Agora Space Overview What is this talk about? Giving an overview of the Hadoop Ecosystem and related

More information

BIG DATA, MAPREDUCE & HADOOP

BIG DATA, MAPREDUCE & HADOOP BIG, MAPREDUCE & HADOOP LARGE SCALE DISTRIBUTED SYSTEMS By Jean-Pierre Lozi A tutorial for the LSDS class LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 1 OBJECTIVES OF THIS LAB SESSION The LSDS

More information

Cloudera Certified Developer for Apache Hadoop

Cloudera Certified Developer for Apache Hadoop Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number

More information

How To Write A Map In Java (Java) On A Microsoft Powerbook 2.5 (Ahem) On An Ipa (Aeso) Or Ipa 2.4 (Aseo) On Your Computer Or Your Computer

How To Write A Map In Java (Java) On A Microsoft Powerbook 2.5 (Ahem) On An Ipa (Aeso) Or Ipa 2.4 (Aseo) On Your Computer Or Your Computer Lab 0 - Introduction to Hadoop/Eclipse/Map/Reduce CSE 490h - Winter 2007 To Do 1. Eclipse plug in introduction Dennis Quan, IBM 2. Read this hand out. 3. Get Eclipse set up on your machine. 4. Load the

More information

Pig Latin: A Not-So-Foreign Language for Data Processing

Pig Latin: A Not-So-Foreign Language for Data Processing Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Yahoo! Research {olston, breed, utkarsh, ravikumar, atomkins}@yahoo-inc.com

More information

How To Analyze Log Files In A Web Application On A Hadoop Mapreduce System

How To Analyze Log Files In A Web Application On A Hadoop Mapreduce System Analyzing Web Application Log Files to Find Hit Count Through the Utilization of Hadoop MapReduce in Cloud Computing Environment Sayalee Narkhede Department of Information Technology Maharashtra Institute

More information

Introduction to MapReduce and Hadoop

Introduction to MapReduce and Hadoop Introduction to MapReduce and Hadoop Jie Tao Karlsruhe Institute of Technology jie.tao@kit.edu Die Kooperation von Why Map/Reduce? Massive data Can not be stored on a single machine Takes too long to process

More information

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin

More information

Recap. CSE 486/586 Distributed Systems Data Analytics. Example 1: Scientific Data. Two Questions We ll Answer. Data Analytics. Example 2: Web Data C 1

Recap. CSE 486/586 Distributed Systems Data Analytics. Example 1: Scientific Data. Two Questions We ll Answer. Data Analytics. Example 2: Web Data C 1 ecap Distributed Systems Data Analytics Steve Ko Computer Sciences and Engineering University at Buffalo PC enables programmers to call functions in remote processes. IDL (Interface Definition Language)

More information

Distributed Aggregation in Cloud Databases. By: Aparna Tiwari tiwaria@umail.iu.edu

Distributed Aggregation in Cloud Databases. By: Aparna Tiwari tiwaria@umail.iu.edu Distributed Aggregation in Cloud Databases By: Aparna Tiwari tiwaria@umail.iu.edu ABSTRACT Data intensive applications rely heavily on aggregation functions for extraction of data according to user requirements.

More information

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig Contents Acknowledgements... 1 Introduction to Hive and Pig... 2 Setup... 2 Exercise 1 Load Avro data into HDFS... 2 Exercise 2 Define an

More information

16.1 MAPREDUCE. For personal use only, not for distribution. 333

16.1 MAPREDUCE. For personal use only, not for distribution. 333 For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several

More information

Cloud Computing. Up until now

Cloud Computing. Up until now Cloud Computing Lecture 19 Cloud Programming 2011-2012 Up until now Introduction, Definition of Cloud Computing Pre-Cloud Large Scale Computing: Grid Computing Content Distribution Networks Cycle-Sharing

More information

A Comparison of Approaches to Large-Scale Data Analysis

A Comparison of Approaches to Large-Scale Data Analysis A Comparison of Approaches to Large-Scale Data Analysis Sam Madden MIT CSAIL with Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, and Michael Stonebraker In SIGMOD 2009 MapReduce

More information

Big Data Technology CS 236620, Technion, Spring 2013

Big Data Technology CS 236620, Technion, Spring 2013 Big Data Technology CS 236620, Technion, Spring 2013 Structured Databases atop Map-Reduce Edward Bortnikov & Ronny Lempel Yahoo! Labs, Haifa Roadmap Previous class MR Implementation This class Query Languages

More information

Extreme Computing. Hadoop MapReduce in more detail. www.inf.ed.ac.uk

Extreme Computing. Hadoop MapReduce in more detail. www.inf.ed.ac.uk Extreme Computing Hadoop MapReduce in more detail How will I actually learn Hadoop? This class session Hadoop: The Definitive Guide RTFM There is a lot of material out there There is also a lot of useless

More information

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

More information

Mrs: MapReduce for Scientific Computing in Python

Mrs: MapReduce for Scientific Computing in Python Mrs: for Scientific Computing in Python Andrew McNabb, Jeff Lund, and Kevin Seppi Brigham Young University November 16, 2012 Large scale problems require parallel processing Communication in parallel processing

More information

Play with Big Data on the Shoulders of Open Source

Play with Big Data on the Shoulders of Open Source OW2 Open Source Corporate Network Meeting Play with Big Data on the Shoulders of Open Source Liu Jie Technology Center of Software Engineering Institute of Software, Chinese Academy of Sciences 2012-10-19

More information

Implement Hadoop jobs to extract business value from large and varied data sets

Implement Hadoop jobs to extract business value from large and varied data sets Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to

More information

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13 Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13 Astrid Rheinländer Wissensmanagement in der Bioinformatik What is Big Data? collection of data sets so large and complex

More information

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example MapReduce MapReduce and SQL Injections CS 3200 Final Lecture Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design

More information

Biomap Jobs and the Big Picture

Biomap Jobs and the Big Picture Lots of Data, Little Money. A Last.fm perspective Martin Dittus, martind@last.fm London Geek Nights, 2009-04-23 Big Data Little Money You have lots of data You want to process it For your product (Last.fm:

More information

Hadoop Framework. technology basics for data scientists. Spring - 2014. Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN

Hadoop Framework. technology basics for data scientists. Spring - 2014. Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN Hadoop Framework technology basics for data scientists Spring - 2014 Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN Warning! Slides are only for presenta8on guide We will discuss+debate addi8onal

More information

Peers Techno log ies Pv t. L td. HADOOP

Peers Techno log ies Pv t. L td. HADOOP Page 1 Peers Techno log ies Pv t. L td. Course Brochure Overview Hadoop is a Open Source from Apache, which provides reliable storage and faster process by using the Hadoop distibution file system and

More information

BIG DATA APPLICATIONS

BIG DATA APPLICATIONS BIG DATA ANALYTICS USING HADOOP AND SPARK ON HATHI Boyu Zhang Research Computing ITaP BIG DATA APPLICATIONS Big data has become one of the most important aspects in scientific computing and business analytics

More information

Big Data Management and NoSQL Databases

Big Data Management and NoSQL Databases NDBI040 Big Data Management and NoSQL Databases Lecture 3. Apache Hadoop Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Apache Hadoop Open-source

More information

HPCHadoop: MapReduce on Cray X-series

HPCHadoop: MapReduce on Cray X-series HPCHadoop: MapReduce on Cray X-series Scott Michael Research Analytics Indiana University Cray User Group Meeting May 7, 2014 1 Outline Motivation & Design of HPCHadoop HPCHadoop demo Benchmarking Methodology

More information

COURSE CONTENT Big Data and Hadoop Training

COURSE CONTENT Big Data and Hadoop Training COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop

More information

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe

More information

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person

More information

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan sriram@sdsc.edu

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan sriram@sdsc.edu Introduc)on to the MapReduce Paradigm and Apache Hadoop Sriram Krishnan sriram@sdsc.edu Programming Model The computa)on takes a set of input key/ value pairs, and Produces a set of output key/value pairs.

More information

Data Science in the Wild

Data Science in the Wild Data Science in the Wild Lecture 3 Some slides are taken from J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 Data Science and Big Data Big Data: the data cannot

More information

Report Vertiefung, Spring 2013 Constant Interval Extraction using Hadoop

Report Vertiefung, Spring 2013 Constant Interval Extraction using Hadoop Report Vertiefung, Spring 2013 Constant Interval Extraction using Hadoop Thomas Brenner, 08-928-434 1 Introduction+and+Task+ Temporal databases are databases expanded with a time dimension in order to

More information

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing MapReduce and Hadoop 15 319, spring 2010 17 th Lecture, Mar 16 th Majd F. Sakr Lecture Goals Transition to MapReduce from Functional Programming Understand the origins of

More information

Moving from CS 61A Scheme to CS 61B Java

Moving from CS 61A Scheme to CS 61B Java Moving from CS 61A Scheme to CS 61B Java Introduction Java is an object-oriented language. This document describes some of the differences between object-oriented programming in Scheme (which we hope you

More information

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010 Hadoop s Entry into the Traditional Analytical DBMS Market Daniel Abadi Yale University August 3 rd, 2010 Data, Data, Everywhere Data explosion Web 2.0 more user data More devices that sense data More

More information

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış Istanbul Şehir University Big Data Camp 14 Hadoop Map Reduce Aslan Bakirov Kevser Nur Çoğalmış Agenda Map Reduce Concepts System Overview Hadoop MR Hadoop MR Internal Job Execution Workflow Map Side Details

More information

Relational Processing on MapReduce

Relational Processing on MapReduce Relational Processing on MapReduce Jerome Simeon IBM Watson Research Content obtained from many sources, notably: Jimmy Lin course on MapReduce. Our Plan Today 1. Recap: Key relational DBMS notes Key Hadoop

More information

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015 COSC 6397 Big Data Analytics 2 nd homework assignment Pig and Hive Edgar Gabriel Spring 2015 2 nd Homework Rules Each student should deliver Source code (.java files) Documentation (.pdf,.doc,.tex or.txt

More information

Big Data for the JVM developer. Costin Leau, Elasticsearch @costinl

Big Data for the JVM developer. Costin Leau, Elasticsearch @costinl Big Data for the JVM developer Costin Leau, Elasticsearch @costinl Agenda Data Trends Data Pipelines JVM and Big Data Tool Eco-system Data Landscape Data Trends http://www.emc.com/leadership/programs/digital-universe.htm

More information

Chameleon: The Performance Tuning Tool for MapReduce Query Processing Systems

Chameleon: The Performance Tuning Tool for MapReduce Query Processing Systems paper:38 Chameleon: The Performance Tuning Tool for MapReduce Query Processing Systems Edson Ramiro Lucsa Filho 1, Ivan Luiz Picoli 2, Eduardo Cunha de Almeida 2, Yves Le Traon 1 1 University of Luxembourg

More information

Introduc)on to Map- Reduce. Vincent Leroy

Introduc)on to Map- Reduce. Vincent Leroy Introduc)on to Map- Reduce Vincent Leroy Sources Apache Hadoop Yahoo! Developer Network Hortonworks Cloudera Prac)cal Problem Solving with Hadoop and Pig Slides will be available at hgp://lig- membres.imag.fr/leroyv/

More information

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since Feb

More information

BIG DATA - HADOOP PROFESSIONAL amron

BIG DATA - HADOOP PROFESSIONAL amron 0 Training Details Course Duration: 30-35 hours training + assignments + actual project based case studies Training Materials: All attendees will receive: Assignment after each module, video recording

More information

Introduction to Parallel Programming and MapReduce

Introduction to Parallel Programming and MapReduce Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant

More information

Health Care Claims System Prototype

Health Care Claims System Prototype SGT WHITE PAPER Health Care Claims System Prototype MongoDB and Hadoop 2015 SGT, Inc. All Rights Reserved 7701 Greenbelt Road, Suite 400, Greenbelt, MD 20770 Tel: (301) 614-8600 Fax: (301) 614-8601 www.sgt-inc.com

More information

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University 2014-05-12

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University 2014-05-12 Introduction to NoSQL Databases and MapReduce Tore Risch Information Technology Uppsala University 2014-05-12 What is a NoSQL Database? 1. A key/value store Basic index manager, no complete query language

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Elastic Map Reduce. Shadi Khalifa Database Systems Laboratory (DSL) khalifa@cs.queensu.ca

Elastic Map Reduce. Shadi Khalifa Database Systems Laboratory (DSL) khalifa@cs.queensu.ca Elastic Map Reduce Shadi Khalifa Database Systems Laboratory (DSL) khalifa@cs.queensu.ca The Amazon Web Services Universe Cross Service Features Management Interface Platform Services Infrastructure Services

More information

Halloween Costume Ideas For Xbox 360

Halloween Costume Ideas For Xbox 360 Semantics with Failures Practical Considerations If map and reduce are deterministic, then output identical to non-faulting sequential execution For non-deterministic operators, different reduce tasks

More information

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Big Data Storage, Management and challenges. Ahmed Ali-Eldin Big Data Storage, Management and challenges Ahmed Ali-Eldin (Ambitious) Plan What is Big Data? And Why talk about Big Data? How to store Big Data? BigTables (Google) Dynamo (Amazon) How to process Big

More information

Hadoop: Understanding the Big Data Processing Method

Hadoop: Understanding the Big Data Processing Method Hadoop: Understanding the Big Data Processing Method Deepak Chandra Upreti 1, Pawan Sharma 2, Dr. Yaduvir Singh 3 1 PG Student, Department of Computer Science & Engineering, Ideal Institute of Technology

More information

Workshop on Hadoop with Big Data

Workshop on Hadoop with Big Data Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly

More information

Other Map-Reduce (ish) Frameworks. William Cohen

Other Map-Reduce (ish) Frameworks. William Cohen Other Map-Reduce (ish) Frameworks William Cohen 1 Outline More concise languages for map- reduce pipelines Abstractions built on top of map- reduce General comments Speci

More information

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014 Lambda Architecture CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014 1 Goals Cover the material in Chapter 8 of the Concurrency Textbook The Lambda Architecture Batch Layer MapReduce

More information

Distributed Calculus with Hadoop MapReduce inside Orange Search Engine. mardi 3 juillet 12

Distributed Calculus with Hadoop MapReduce inside Orange Search Engine. mardi 3 juillet 12 Distributed Calculus with Hadoop MapReduce inside Orange Search Engine What is Big Data? $ 5 billions (2012) to $ 50 billions (by 2017) Forbes «Big Data is the new definitive source of competitive advantage

More information

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems Introduction PRESENTATION to Hadoop, TITLE GOES MapReduce HERE and HDFS for Big Data Applications Serge Blazhievsky Nice Systems SNIA Legal Notice The material contained in this tutorial is copyrighted

More information

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems Processing of massive data: MapReduce 2. Hadoop 1 MapReduce Implementations Google were the first that applied MapReduce for big data analysis Their idea was introduced in their seminal paper MapReduce:

More information

Data processing goes big

Data processing goes big Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,

More information

CIEL A universal execution engine for distributed data-flow computing

CIEL A universal execution engine for distributed data-flow computing Reviewing: CIEL A universal execution engine for distributed data-flow computing Presented by Niko Stahl for R202 Outline 1. Motivation 2. Goals 3. Design 4. Fault Tolerance 5. Performance 6. Related Work

More information

Hadoop Streaming. 2012 coreservlets.com and Dima May. 2012 coreservlets.com and Dima May

Hadoop Streaming. 2012 coreservlets.com and Dima May. 2012 coreservlets.com and Dima May 2012 coreservlets.com and Dima May Hadoop Streaming Originals of slides and source code for examples: http://www.coreservlets.com/hadoop-tutorial/ Also see the customized Hadoop training courses (onsite

More information