web-scale data processing Christopher Olston and many others Yahoo! Research

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "web-scale data processing Christopher Olston and many others Yahoo! Research"

Transcription

1 web-scale data processing Christopher Olston and many others Yahoo! Research

2 Motivation Projects increasingly revolve around analysis of big data sets Extracting structured data, e.g. face detection Understanding complex large-scale phenomena social systems (e.g. user-generated web content) economic systems (e.g. advertising markets) computer systems (e.g. web search engines) Data analysis is inner loop at Yahoo! et al. Big data necessitates parallel processing

3 Examples 1. Detect faces You have a function detectfaces() You want to run it over n images n is big 2. Study web usage You have a web crawl and click log Find sessions that end with the best page

4 Existing Work Parallel architectures cluster computing multi-core processors Data-parallel software parallel DBMS Map-Reduce, Dryad Data-parallel languages SQL NESL

5 Pig Project Data-parallel language ( Pig Latin ) Relational data manipulation primitives Imperative programming style Plug in code to customize processing Various crazy ideas Multi-program optimization Adaptive data placement Automatic example data generator

6 Pig Latin Language [SIGMOD 08]

7 Example 1 Detect faces in many images. I = load /mydata/images using ImageParser() as (id, image); F = foreach I generate id, detectfaces(image); store F into /mydata/faces ;

8 Example 2 Find sessions that end with the best page. Visits user url time Amy 8:00 Amy 8:05 Amy 10:00 Amy 10:05 Fred cnn.com/index.htm 12:00 Pages url pagerank

9 Efficient Evaluation Method the answer repartition by user repartition by url group-wise processing (identify sessions; examine pageranks) join Visits Pages

10 In Pig Latin Visits = load /data/visits as (user, url, time); Visits = foreach Visits generate user, Canonicalize(url), time; Pages = load /data/pages as (url, pagerank); VP = join Visits by url, Pages by url; UserVisits = group VP by user; Sessions = foreach UserVisits generate flatten(findsessions(*)); HappyEndings = filter Sessions by BestIsLast(*); store HappyEndings into '/data/happy_endings';

11 Pig Latin, in general transformations on sets of records easy for users high-level, extensible data processing primitives easy for the system exposes opportunities for parallelism and reuse operators: FILTER FOREACH GENERATE GROUP binary operators: JOIN COGROUP UNION

12 Related Languages SQL: declarative all-in-one blocks NESL: lacks join, cogroup Map-Reduce: special case of Pig Latin a = FOREACH input GENERATE flatten(map(*)); b = GROUP a BY $0; c = FOREACH b GENERATE Reduce(*); Sawzall: rigid map-then-reduce structure

13 Pig Latin = Sweet Spot Between SQL & Map-Reduce SQL Map-Reduce Programming style Large blocks of declarative Plug together pipes (but "I much prefer writing in Pig [Latin] versus SQL. The step-by-step method of constraints restricted to linear, 2-step creating a program in Pig [Latin] is much cleaner and simpler to use than the single block method of SQL. It is easier to keep track of data what flow) your variables Built-in are, data and where you Group-by, are in the process Sort, Join, of analyzing Filter, your Group-by, data. Sort manipulations Aggregate, Top-k, etc Jasmine Novak, Engineer, Yahoo! Execution model Fancy; trust the query Simple, transparent "PIG seems to give optimizer the necessary parallel programming construct Opportunities (FOREACH, for FLATTEN, Many COGROUP.. etc) and also give Few sufficient (logic control buried in map() automatic back optimization to the programmer (which purely declarative approach and reduce()) like [SQL on top of Map-Reduce] doesn t). -- Ricky Ho, Adobe Software

14 Map-Reduce as Backend ( SQL ) user automatic rewrite + optimize Pig or or Map-Reduce cluster details in [VLDB09]

15 Pig Latin vs. Map-Reduce: Code Reduction import java.io.ioexception; reporter.setstatus("ok"); lp.setoutputkeyclass(text.class); import java.util.arraylist; lp.setoutputvalueclass(text.class); import java.util.iterator; lp.setmapperclass(loadpages.class); Users = load users as (name, age); import java.util.list; // Do the cross product and collect the values FileInputFormat.addInputPath(lp, new for (String s1 : first) { Path("/user/gates/pages")); import org.apache.hadoop.fs.path; for (String s2 : second) { FileOutputFormat.setOutputPath(lp, import org.apache.hadoop.io.longwritable; String outval = key + "," + s1 + "," + s2; new Path("/user/gates/tmp/indexed_pages")); import org.apache.hadoop.io.text; oc.collect(null, new Text(outval)); lp.setnumreducetasks(0); import org.apache.hadoop.io.writable; reporter.setstatus("ok"); Job loadpages = new Job(lp); import org.apache.hadoop.io.writablecomparable; import Fltrd org.apache.hadoop.mapred.fileinputformat; = filter Users by JobConf lfu = new JobConf(MRExample.class); import org.apache.hadoop.mapred.fileoutputformat; lfu.setjobname("load and Filter Users"); import org.apache.hadoop.mapred.jobconf; lfu.setinputformat(textinputformat.class); import org.apache.hadoop.mapred.keyvaluetextinputformat; public static class LoadJoined extends MapReduceBase lfu.setoutputkeyclass(text.class); import org.apache.hadoop.mapred.mapper; implements Mapper<Text, Text, Text, LongWritable> { lfu.setoutputvalueclass(text.class); import org.apache.hadoop.mapred.mapreducebase; lfu.setmapperclass(loadandfilterusers.class); import org.apache.hadoop.mapred.outputcollector; public void map( FileInputFormat.addInputPath(lfu, new import org.apache.hadoop.mapred.recordreader; age >= 18 Text and k, age <= 25; Path("/user/gates/users")); import org.apache.hadoop.mapred.reducer; Text val, FileOutputFormat.setOutputPath(lfu, import org.apache.hadoop.mapred.reporter; OutputCollector<Text, LongWritable> oc, new Path("/user/gates/tmp/filtered_users")); import org.apache.hadoop.mapred.sequencefileinputformat; Reporter reporter) throws IOException { lfu.setnumreducetasks(0); import org.apache.hadoop.mapred.sequencefileoutputformat; // Find the url Job loadusers = new Job(lfu); import org.apache.hadoop.mapred.textinputformat; String line = val.tostring(); import org.apache.hadoop.mapred.jobcontrol.job; int firstcomma = line.indexof(','); JobConf join = new JobConf(MRExample.class); import Views org.apache.hadoop.mapred.jobcontrol.jobcontrol; = load views int secondcomma = line.indexof(',', as (user, firstcomma); url); join.setjobname("join Users and Pages"); import org.apache.hadoop.mapred.lib.identitymapper; String key = line.substring(firstcomma, secondcomma); join.setinputformat(keyvaluetextinputformat.class); // drop the rest of the record, I don't need it anymore, join.setoutputkeyclass(text.class); public class MRExample { // just pass a 1 for the combiner/reducer to sum instead. join.setoutputvalueclass(text.class); public static class LoadPages extends MapReduceBase Text outkey = new Text(key); join.setmapperclass(identitymapper.class); implements Mapper<LongWritable, Text, Text, Text> { oc.collect(outkey, new LongWritable(1L)); join.setreducerclass(join.class); FileInputFormat.addInputPath(join, new Jnd public void map(longwritable = k, Text join val, Fltrd by name, Views Path("/user/gates/tmp/indexed_pages")); by user; OutputCollector<Text, Text> oc, public static class ReduceUrls extends MapReduceBase FileInputFormat.addInputPath(join, new Reporter reporter) throws IOException { implements Reducer<Text, LongWritable, WritableComparable, Path("/user/gates/tmp/filtered_users")); // Pull the key out Writable> { FileOutputFormat.setOutputPath(join, new String line = val.tostring(); Path("/user/gates/tmp/joined")); int firstcomma = line.indexof(','); public void reduce( join.setnumreducetasks(50); String key = line.substring(0, firstcomma); Text key, Job joinjob = new Job(join); Grpd String value = line.substring(firstcomma = group + 1); Jnd Iterator<LongWritable> by url; iter, joinjob.adddependingjob(loadpages); Text outkey = new Text(key); OutputCollector<WritableComparable, Writable> oc, joinjob.adddependingjob(loadusers); // Prepend an index to the value so we know which file Reporter reporter) throws IOException { // it came from. // Add up all the values we see JobConf group = new JobConf(MRExample.class); Text outval = new Text("1" + value); group.setjobname("group URLs"); oc.collect(outkey, outval); long sum = 0; group.setinputformat(keyvaluetextinputformat.class); while (iter.hasnext()) { group.setoutputkeyclass(text.class); sum += iter.next().get(); group.setoutputvalueclass(longwritable.class); Smmd public static class LoadAndFilterUsers = extends foreach MapReduceBase Grpd reporter.setstatus("ok"); generate group, group.setoutputformat(sequencefileoutputformat.class); implements Mapper<LongWritable, Text, Text, Text> { group.setmapperclass(loadjoined.class); group.setcombinerclass(reduceurls.class); public void map(longwritable k, Text val, oc.collect(key, new LongWritable(sum)); group.setreducerclass(reduceurls.class); OutputCollector<Text, Text> oc, FileInputFormat.addInputPath(group, new Reporter reporter) throws IOException { Path("/user/gates/tmp/joined")); // Pull the key out public static class LoadClicks extends MapReduceBase FileOutputFormat.setOutputPath(group, new String line = val.tostring(); COUNT(Jnd) implements Mapper<WritableComparable, as clicks; Writable, LongWritable, Path("/user/gates/tmp/grouped")); int firstcomma = line.indexof(','); Text> { group.setnumreducetasks(50); String value = line.substring(firstcomma + 1); Job groupjob = new Job(group); int age = Integer.parseInt(value); public void map( groupjob.adddependingjob(joinjob); if (age < 18 age > 25) return; WritableComparable key, String key = line.substring(0, firstcomma); Writable val, JobConf top100 = new JobConf(MRExample.class); Text outkey = new Text(key); OutputCollector<LongWritable, Text> oc, top100.setjobname("top 100 sites"); Srtd // Prepend an index to = the value order so we know which file Smmd Reporter by reporter) throws clicks IOException { desc; top100.setinputformat(sequencefileinputformat.class); // it came from. oc.collect((longwritable)val, (Text)key); top100.setoutputkeyclass(longwritable.class); Text outval = new Text("2" + value); top100.setoutputvalueclass(text.class); oc.collect(outkey, outval); top100.setoutputformat(sequencefileoutputformat.class); public static class LimitClicks extends MapReduceBase top100.setmapperclass(loadclicks.class); implements Reducer<LongWritable, Text, LongWritable, Text> { top100.setcombinerclass(limitclicks.class); public static class Join extends MapReduceBase top100.setreducerclass(limitclicks.class); implements Reducer<Text, Text, Text, Text> { int count = 0; FileInputFormat.addInputPath(top100, new public void reduce( Path("/user/gates/tmp/grouped")); public void reduce(text key, LongWritable key, FileOutputFormat.setOutputPath(top100, new Iterator<Text> iter, Iterator<Text> iter, Path("/user/gates/top100sitesforusers18to25")); OutputCollector<Text, Text> oc, OutputCollector<LongWritable, Text> oc, top100.setnumreducetasks(1); Reporter reporter) throws IOException { Reporter reporter) throws IOException { Job limit = new Job(top100); // For each value, figure out which file it's from and limit.adddependingjob(groupjob); store it Top5 into top5sites ; // Only output the first 100 records // accordingly. while (count < 100 && iter.hasnext()) { JobControl jc = new JobControl("Find top 100 sites for users List<String> first = new ArrayList<String>(); oc.collect(key, iter.next()); 18 to 25"); List<String> second = new ArrayList<String>(); count++; jc.addjob(loadpages); jc.addjob(loadusers); while (iter.hasnext()) { jc.addjob(joinjob); Text t = iter.next(); jc.addjob(groupjob); String value = t.tostring(); public static void main(string[] args) throws IOException { jc.addjob(limit); if (value.charat(0) == '1') JobConf lp = new JobConf(MRExample.class); jc.run(); first.add(value.substring(1)); lp.setjobname("load Pages"); else second.add(value.substring(1)); lp.setinputformat(textinputformat.class); Top5 = limit Srtd 5;

16 Comparison /20 the lines of code Minutes /16 the development time Hadoop Pig Hadoop Pig performance 1.5x Hadoop

17 Ways to Run Pig Interactive shell Script file Embed in host language (e.g., Java) soon: Graphical editor

18 Status Open-source implementation Runs on Hadoop or local machine Active project; many refinements in the works Wide adoption in Yahoo 100s of users 1000s of Pig jobs/day 60% of ad-hoc Hadoop jobs are via Pig 40% of production jobs via Pig

19 Status Gaining traction externally log processing & aggregation building text indexes collaborative filtering, applied to image & video recommendation systems "The [Hofmann PLSA E/M] algorithm was implemented in pig in lines of pig-latin statements. Took a lot less compared to what it took in implementing the algorithm in Map-Reduce Java. Exactly that's the reason I wanted to try it out in Pig. It took 3-4 days for me to write it, starting from learning pig. -- Prasenjit Mukherjee, Mahout project

20 Crazy Ideas [USENIX 08] [VLDB 08] [SIGMOD 09]

21 Crazy Idea #1 Multi-Program Optimization

22 Motivation User programs repeatedly scan the same data files web crawl search log Goal: Reduce redundant IOs, and hence improve overall system throughput Approach: Introduce shared scans capability Careful scheduling of jobs, to maximize benefit of shared scans

23 Scheduling Shared Scans Scheduler looks at queues and anticipates arrivals Executor (Idle) 2 not popular Search Log Web Crawl Click Data

24 Crazy Idea #2 Adaptive Data Placement

25 Motivation Hadoop is good at localizing computation to data, for conventional map-reduce scenarios However, advanced query processing operations change this: Pre-hashed join of A & B: co-locate A & B? Frag-repl join of A & B: more replicas of B? Our idea: Adaptive pressure-based mechanism to move data s.t. better locality arises

26 Adaptive Data Placement jobs Job Localizer execution engine Scan C C A B Join Scan A A & B D A B Worker 1 Worker 2

27 Crazy Idea #3 Automatic Example Generator

28 Example Program (abstract view) Find users who tend to visit good pages. Load Visits(user, url, time) Load Pages(url, pagerank) Transform to (user, Canonicalize(url), time) Join url = url Group by user Transform to (user, Average(pagerank) as avgpr) Filter avgpr > 0.5

29 Load Visits(user, url, time) Transform to (user, Canonicalize(url), time) (Amy, cnn.com, 8am) (Amy, 9am) (Fred, 11am) Join url = url Load Pages(url, pagerank) (www.cnn.com, 0.9) (www.snails.com, 0.4) (Amy, 8am) (Amy, 9am) (Fred, 11am) (Amy, 8am, 0.9) (Amy, 9am, 0.4) (Fred, 11am, 0.4) Group by user (Amy, { (Amy, 8am, 0.9), (Amy, 9am, 0.4) ) (Fred, { (Fred, 11am, 0.4) ) Transform to (user, Average(pagerank) as avgpr) (Amy, 0.65) (Fred, 0.4) Filter avgpr > 0.5 (Amy, 0.65)

30 Objectives: Automatic Example Data Generator Realism Conciseness Completeness Challenges: Large original data Selective operators (e.g., join, filter) Noninvertible operators (e.g., UDFs)

31 Talk Summary Data-parallel language ( Pig Latin ) Sequence of data transformation steps Users can plug in custom code Research nuggets Joint scheduling of related programs, to amortize IOs Adaptive data placement, to enhance locality Automatic example data generator, to make user s life easier

32 Credits Yahoo! Grid Team project leads: Alan Gates Olga Natkovich Yahoo! Research project leads: Chris Olston Utkarsh Srivastava Ben Reed

Practice and Applications of Data Management CMPSCI 345. Lecture 19-20: Amazon Web Services

Practice and Applications of Data Management CMPSCI 345. Lecture 19-20: Amazon Web Services Practice and Applications of Data Management CMPSCI 345 Lecture 19-20: Amazon Web Services Extra credit: project part 3 } Open-ended addi*onal features. } Presenta*ons on Dec 7 } Need to sign up by Nov

More information

Hadoop Pig. Introduction Basic. Exercise

Hadoop Pig. Introduction Basic. Exercise Your Name Hadoop Pig Introduction Basic Exercise A set of files A database A single file Modern systems have to deal with far more data than was the case in the past Yahoo : over 170PB of data Facebook

More information

Why use Pig? Suppose you have user data in one file, website data in another, and you need to find the top 5 most visited sites by users aged

Why use Pig? Suppose you have user data in one file, website data in another, and you need to find the top 5 most visited sites by users aged What is Pig? An engine for executing programs on top of Hadoop It provides a language, Pig Latin, to specify these programs An Apache open source project http://pig.apache.org/ Why use Pig? Suppose you

More information

Distributed and Cloud Computing

Distributed and Cloud Computing Distributed and Cloud Computing K. Hwang, G. Fox and J. Dongarra Chapter 6: Cloud Programming and Software Environments Part 1 Adapted from Kai Hwang, University of Southern California with additions from

More information

Database design and implementation CMPSCI 645. Lectures 22: Parallel Databases and MapReduce

Database design and implementation CMPSCI 645. Lectures 22: Parallel Databases and MapReduce Database design and implementation CMPSCI 645 Lectures 22: Parallel Databases and MapReduce What is a parallel database? 2 Parallel v.s. distributed databases } Parallel database system: } Improve performance

More information

Lecture 23: Pig: Making Hadoop Easy (Slides provided by: Alan Gates, Yahoo!Research)

Lecture 23: Pig: Making Hadoop Easy (Slides provided by: Alan Gates, Yahoo!Research) Lecture 23: Pig: Making Hadoop Easy (Slides provided by: Alan Gates, Yahoo!Research) Friday, May 28, 2010 Dan Suciu - - 444 Spring 2010 1 What is Pig? An engine for executing programs on top of Hadoop

More information

Programming and Debugging Large- Scale Data Processing Workflows. Christopher Olston and many others Yahoo! Research

Programming and Debugging Large- Scale Data Processing Workflows. Christopher Olston and many others Yahoo! Research Programming and Debugging Large- Scale Data Processing Workflows Christopher Olston and many others Yahoo! Research Context Elaborate processing of large data sets e.g.: web search pre- processing cross-

More information

Programming and Debugging Large- Scale Data Processing Workflows

Programming and Debugging Large- Scale Data Processing Workflows Programming and Debugging Large- Scale Data Processing Workflows Christopher Olston Google Research (work done at Yahoo! Research, with many colleagues) Context Elaborate processing of large data sets

More information

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, 2009. Seth Ladd http://sethladd.com

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, 2009. Seth Ladd http://sethladd.com Hadoop and Eclipse Eclipse Hawaii User s Group May 26th, 2009 Seth Ladd http://sethladd.com Goal YOU can use the same technologies as The Big Boys Google Yahoo (2000 nodes) Last.FM AOL Facebook (2.5 petabytes

More information

Programming and Debugging Large- Scale Data Processing Workflows

Programming and Debugging Large- Scale Data Processing Workflows Programming and Debugging Large- Scale Data Processing Workflows Christopher Olston Google Research (work done at Yahoo! Research, with many colleagues) What I m Working on at Google We ve got fantasjc

More information

CSE 544: Principles of Database Systems

CSE 544: Principles of Database Systems CSE 544: Principles of Database Systems MapReduce, PigLatin CSE544 - Spring, 2012 1 Overview of Today s Lecture Cluster computing Map/reduce (paper) Degree sequence (brief discussion) PigLatin (brief discussion,

More information

Large Scale (Machine) Learning at Twitter. Aleksander Kołcz Twitter, Inc.

Large Scale (Machine) Learning at Twitter. Aleksander Kołcz Twitter, Inc. Large Scale (Machine) Learning at Twitter Aleksander Kołcz Twitter, Inc. What is Twitter Micro-blogging service Open exchange of information/opinions Private communication possible via DMs Content restricted

More information

So far, we've been protected from the full complexity of hadoop by using Pig. Let's see what we've been missing!

So far, we've been protected from the full complexity of hadoop by using Pig. Let's see what we've been missing! Mapping Page 1 Using Raw Hadoop 8:34 AM So far, we've been protected from the full complexity of hadoop by using Pig. Let's see what we've been missing! Hadoop Yahoo's open-source MapReduce implementation

More information

Extension: Combiner Functions. Extension: MapFile. Extension: Custom Partitioner. Extension: Counters. Hadoop Job Tuning 10/6/2011

Extension: Combiner Functions. Extension: MapFile. Extension: Custom Partitioner. Extension: Counters. Hadoop Job Tuning 10/6/2011 Extension: Combiner Functions Recall earlier discussion about combiner function Pre-reduces mapper output before transfer to reducers Does not change program semantics Usually (almost) same as reduce function,

More information

MR-(Mapreduce Programming Language)

MR-(Mapreduce Programming Language) MR-(Mapreduce Programming Language) Siyang Dai Zhi Zhang Shuai Yuan Zeyang Yu Jinxiong Tan sd2694 zz2219 sy2420 zy2156 jt2649 Objective of MR MapReduce is a software framework introduced by Google, aiming

More information

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13 Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 Hadoop Ecosystem Overview of this Lecture Module Background Google MapReduce The Hadoop Ecosystem Core components: Hadoop

More information

Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so:

Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so: Codelab 1 Introduction to the Hadoop Environment (version 0.17.0) Goals: 1. Set up and familiarize yourself with the Eclipse plugin 2. Run and understand a word counting program Setting up Eclipse: Step

More information

Introduction to Pig. Content developed and presented by: 2009 Cloudera, Inc.

Introduction to Pig. Content developed and presented by: 2009 Cloudera, Inc. Introduction to Pig Content developed and presented by: Outline Motivation Background Components How it Works with Map Reduce Pig Latin by Example Wrap up & Conclusions Motivation Map Reduce is very powerful,

More information

Copy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins)

Copy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins) Beijing Codelab 1 Introduction to the Hadoop Environment Spinnaker Labs, Inc. Contains materials Copyright 2007 University of Washington, licensed under the Creative Commons Attribution 3.0 License --

More information

Big Data Technology Pig: Query Language atop Map-Reduce

Big Data Technology Pig: Query Language atop Map-Reduce Big Data Technology Pig: Query Language atop Map-Reduce Eshcar Hillel Yahoo! Ronny Lempel Outbrain *Based on slides by Edward Bortnikov & Ronny Lempel Roadmap Previous class MR Implementation This class

More information

Connecting Hadoop with Oracle Database

Connecting Hadoop with Oracle Database Connecting Hadoop with Oracle Database Sharon Stephen Senior Curriculum Developer Server Technologies Curriculum The following is intended to outline our general product direction.

More information

Tutorial on Hadoop HDFS and MapReduce

Tutorial on Hadoop HDFS and MapReduce Tutorial on Hadoop HDFS and MapReduce Table Of Contents Introduction... 3 The Use Case... 4 Pre-Requisites... 5 Task 1: Access Your Hortonworks Virtual Sandbox... 5 Task 2: Create the MapReduce job...

More information

Hadoop WordCount Explained! IT332 Distributed Systems

Hadoop WordCount Explained! IT332 Distributed Systems Hadoop WordCount Explained! IT332 Distributed Systems Typical problem solved by MapReduce Read a lot of data Map: extract something you care about from each record Shuffle and Sort Reduce: aggregate, summarize,

More information

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich Big Data Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich MapReduce & Hadoop The new world of Big Data (programming model) Overview of this Lecture Module Background Google MapReduce The Hadoop

More information

Zebra and MapReduce. Table of contents. 1 Overview...2 2 Hadoop MapReduce APIs...2 3 Zebra MapReduce APIs...2 4 Zebra MapReduce Examples...

Zebra and MapReduce. Table of contents. 1 Overview...2 2 Hadoop MapReduce APIs...2 3 Zebra MapReduce APIs...2 4 Zebra MapReduce Examples... Table of contents 1 Overview...2 2 Hadoop MapReduce APIs...2 3 Zebra MapReduce APIs...2 4 Zebra MapReduce Examples... 2 1. Overview MapReduce allows you to take full advantage of Zebra's capabilities.

More information

Word count example Abdalrahman Alsaedi

Word count example Abdalrahman Alsaedi Word count example Abdalrahman Alsaedi To run word count in AWS you have two different ways; either use the already exist WordCount program, or to write your own file. First: Using AWS word count program

More information

Click Stream Data Analysis Using Hadoop

Click Stream Data Analysis Using Hadoop Governors State University OPUS Open Portal to University Scholarship Capstone Projects Spring 2015 Click Stream Data Analysis Using Hadoop Krishna Chand Reddy Gaddam Governors State University Sivakrishna

More information

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks Hadoop Introduction Olivier Renault Solution Engineer - Hortonworks Hortonworks A Brief History of Apache Hadoop Apache Project Established Yahoo! begins to Operate at scale Hortonworks Data Platform 2013

More information

Introduction to Apache Pig Indexing and Search

Introduction to Apache Pig Indexing and Search Large-scale Information Processing, Summer 2014 Introduction to Apache Pig Indexing and Search Emmanouil Tzouridis Knowledge Mining & Assessment Includes slides from Ulf Brefeld: LSIP 2013 Organizational

More information

Systems Engineering II. Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de

Systems Engineering II. Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de Systems Engineering II Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de About me! Since May 2015 2015 2012 Research Group Leader cfaed, TU Dresden PhD Student MPI- SWS Research Intern Microsoft

More information

Apache Hive. Big Data 2015

Apache Hive. Big Data 2015 Apache Hive Big Data 2015 Hive Configuration Translates HiveQL statements into a set of MapReduce jobs which are then executed on a Hadoop Cluster Execute on Hadoop Cluster HiveQL Hive Monitor/Report Client

More information

Scalable Computing with Hadoop

Scalable Computing with Hadoop Scalable Computing with Hadoop Doug Cutting cutting@apache.org dcutting@yahoo-inc.com 5/4/06 Seek versus Transfer B-Tree requires seek per access unless to recent, cached page so can buffer & pre-sort

More information

Pig vs Hive. Big Data 2014

Pig vs Hive. Big Data 2014 Pig vs Hive Big Data 2014 Pig Configuration In the bash_profile export all needed environment variables Pig Configuration Download a release of apache pig: pig-0.11.1.tar.gz Pig Configuration Go to the

More information

Word Count Code using MR2 Classes and API

Word Count Code using MR2 Classes and API EDUREKA Word Count Code using MR2 Classes and API A Guide to Understand the Execution of Word Count edureka! A guide to understand the execution and flow of word count WRITE YOU FIRST MRV2 PROGRAM AND

More information

Xiaoming Gao Hui Li Thilina Gunarathne

Xiaoming Gao Hui Li Thilina Gunarathne Xiaoming Gao Hui Li Thilina Gunarathne Outline HBase and Bigtable Storage HBase Use Cases HBase vs RDBMS Hands-on: Load CSV file to Hbase table with MapReduce Motivation Lots of Semi structured data Horizontal

More information

CS54100: Database Systems

CS54100: Database Systems CS54100: Database Systems Cloud Databases: The Next Post- Relational World 18 April 2012 Prof. Chris Clifton Beyond RDBMS The Relational Model is too limiting! Simple data model doesn t capture semantics

More information

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology Working With Hadoop Now that we covered the basics of MapReduce, let s look at some Hadoop specifics. Mostly based on Tom White s book Hadoop: The Definitive Guide, 3 rd edition Note: We will use the new

More information

MapReduce framework. (input) -> map -> -> combine -> -> reduce -> (output)

MapReduce framework. (input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3> (output) MapReduce framework - Operates exclusively on pairs, - that is, the framework views the input to the job as a set of pairs and produces a set of pairs as the output

More information

LANGUAGES FOR HADOOP: PIG & HIVE

LANGUAGES FOR HADOOP: PIG & HIVE Friday, September 27, 13 1 LANGUAGES FOR HADOOP: PIG & HIVE Michail Michailidis & Patrick Maiden Friday, September 27, 13 2 Motivation Native MapReduce Gives fine-grained control over how program interacts

More information

Case-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang

Case-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang Case-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang Independent Study Advanced Case-Based Reasoning Department of Computer Science

More information

Recap. CSE 486/586 Distributed Systems Data Analytics. Example 1: Scientific Data. Two Questions We ll Answer. Data Analytics. Example 2: Web Data C 1

Recap. CSE 486/586 Distributed Systems Data Analytics. Example 1: Scientific Data. Two Questions We ll Answer. Data Analytics. Example 2: Web Data C 1 ecap Distributed Systems Data Analytics Steve Ko Computer Sciences and Engineering University at Buffalo PC enables programmers to call functions in remote processes. IDL (Interface Definition Language)

More information

Lab 0 - Introduction to Hadoop/Eclipse/Map/Reduce CSE 490h - Winter 2007

Lab 0 - Introduction to Hadoop/Eclipse/Map/Reduce CSE 490h - Winter 2007 Lab 0 - Introduction to Hadoop/Eclipse/Map/Reduce CSE 490h - Winter 2007 To Do 1. Eclipse plug in introduction Dennis Quan, IBM 2. Read this hand out. 3. Get Eclipse set up on your machine. 4. Load the

More information

16.1 MAPREDUCE. For personal use only, not for distribution. 333

16.1 MAPREDUCE. For personal use only, not for distribution. 333 For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several

More information

Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction to Pig

Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction to Pig Introduction to Pig Agenda What is Pig? Key Features of Pig The Anatomy of Pig Pig on Hadoop Pig Philosophy Pig Latin Overview Pig Latin Statements Pig Latin: Identifiers Pig Latin: Comments Data Types

More information

Data Management in the Cloud

Data Management in the Cloud Data Management in the Cloud Ryan Stern stern@cs.colostate.edu : Advanced Topics in Distributed Systems Department of Computer Science Colorado State University Outline Today Microsoft Cloud SQL Server

More information

The Hadoop Eco System Shanghai Data Science Meetup

The Hadoop Eco System Shanghai Data Science Meetup The Hadoop Eco System Shanghai Data Science Meetup Karthik Rajasethupathy, Christian Kuka 03.11.2015 @Agora Space Overview What is this talk about? Giving an overview of the Hadoop Ecosystem and related

More information

BIG DATA, MAPREDUCE & HADOOP

BIG DATA, MAPREDUCE & HADOOP BIG, MAPREDUCE & HADOOP LARGE SCALE DISTRIBUTED SYSTEMS By Jean-Pierre Lozi A tutorial for the LSDS class LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 1 OBJECTIVES OF THIS LAB SESSION The LSDS

More information

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin

More information

Introduction to MapReduce and Hadoop

Introduction to MapReduce and Hadoop Introduction to MapReduce and Hadoop Jie Tao Karlsruhe Institute of Technology jie.tao@kit.edu Die Kooperation von Why Map/Reduce? Massive data Can not be stored on a single machine Takes too long to process

More information

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example MapReduce MapReduce and SQL Injections CS 3200 Final Lecture Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design

More information

Analyzing Web Application Log Files to Find Hit Count Through the Utilization of Hadoop MapReduce in Cloud Computing Environment

Analyzing Web Application Log Files to Find Hit Count Through the Utilization of Hadoop MapReduce in Cloud Computing Environment Analyzing Web Application Log Files to Find Hit Count Through the Utilization of Hadoop MapReduce in Cloud Computing Environment Sayalee Narkhede Department of Information Technology Maharashtra Institute

More information

Cloudera Certified Developer for Apache Hadoop

Cloudera Certified Developer for Apache Hadoop Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number

More information

Outline. What is Big Data? Hadoop HDFS MapReduce

Outline. What is Big Data? Hadoop HDFS MapReduce Intro To Hadoop Outline What is Big Data? Hadoop HDFS MapReduce 2 What is big data? A bunch of data? An industry? An expertise? A trend? A cliche? 3 Wikipedia big data In information technology, big data

More information

Big Data Technology CS 236620, Technion, Spring 2013

Big Data Technology CS 236620, Technion, Spring 2013 Big Data Technology CS 236620, Technion, Spring 2013 Structured Databases atop Map-Reduce Edward Bortnikov & Ronny Lempel Yahoo! Labs, Haifa Roadmap Previous class MR Implementation This class Query Languages

More information

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig Contents Acknowledgements... 1 Introduction to Hive and Pig... 2 Setup... 2 Exercise 1 Load Avro data into HDFS... 2 Exercise 2 Define an

More information

Distributed Aggregation in Cloud Databases. By: Aparna Tiwari tiwaria@umail.iu.edu

Distributed Aggregation in Cloud Databases. By: Aparna Tiwari tiwaria@umail.iu.edu Distributed Aggregation in Cloud Databases By: Aparna Tiwari tiwaria@umail.iu.edu ABSTRACT Data intensive applications rely heavily on aggregation functions for extraction of data according to user requirements.

More information

Pig Latin: A Not-So-Foreign Language for Data Processing

Pig Latin: A Not-So-Foreign Language for Data Processing Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Yahoo! Research {olston, breed, utkarsh, ravikumar, atomkins}@yahoo-inc.com

More information

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

More information

Extreme Computing. Hadoop MapReduce in more detail. www.inf.ed.ac.uk

Extreme Computing. Hadoop MapReduce in more detail. www.inf.ed.ac.uk Extreme Computing Hadoop MapReduce in more detail How will I actually learn Hadoop? This class session Hadoop: The Definitive Guide RTFM There is a lot of material out there There is also a lot of useless

More information

Play with Big Data on the Shoulders of Open Source

Play with Big Data on the Shoulders of Open Source OW2 Open Source Corporate Network Meeting Play with Big Data on the Shoulders of Open Source Liu Jie Technology Center of Software Engineering Institute of Software, Chinese Academy of Sciences 2012-10-19

More information

Mrs: MapReduce for Scientific Computing in Python

Mrs: MapReduce for Scientific Computing in Python Mrs: for Scientific Computing in Python Andrew McNabb, Jeff Lund, and Kevin Seppi Brigham Young University November 16, 2012 Large scale problems require parallel processing Communication in parallel processing

More information

A Comparison of Approaches to Large-Scale Data Analysis

A Comparison of Approaches to Large-Scale Data Analysis A Comparison of Approaches to Large-Scale Data Analysis Sam Madden MIT CSAIL with Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, and Michael Stonebraker In SIGMOD 2009 MapReduce

More information

Cloud Computing. Up until now

Cloud Computing. Up until now Cloud Computing Lecture 19 Cloud Programming 2011-2012 Up until now Introduction, Definition of Cloud Computing Pre-Cloud Large Scale Computing: Grid Computing Content Distribution Networks Cycle-Sharing

More information

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13 Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13 Astrid Rheinländer Wissensmanagement in der Bioinformatik What is Big Data? collection of data sets so large and complex

More information

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan sriram@sdsc.edu

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan sriram@sdsc.edu Introduc)on to the MapReduce Paradigm and Apache Hadoop Sriram Krishnan sriram@sdsc.edu Programming Model The computa)on takes a set of input key/ value pairs, and Produces a set of output key/value pairs.

More information

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person

More information

Cloud Computing. Lectures 7 and 8 Map Reduce 2014-2015

Cloud Computing. Lectures 7 and 8 Map Reduce 2014-2015 Cloud Computing Lectures 7 and 8 Map Reduce 2014-2015 1 Up until now Introduction Definition of Cloud Computing Grid Computing Content Distribution Networks Cycle-Sharing 2 Outline Map Reduce: What is

More information

Introduction to Parallel Programming and MapReduce

Introduction to Parallel Programming and MapReduce Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant

More information

Lots of Data, Little Money. A Last.fm perspective. Martin Dittus, martind@last.fm London Geek Nights, 2009-04-23

Lots of Data, Little Money. A Last.fm perspective. Martin Dittus, martind@last.fm London Geek Nights, 2009-04-23 Lots of Data, Little Money. A Last.fm perspective Martin Dittus, martind@last.fm London Geek Nights, 2009-04-23 Big Data Little Money You have lots of data You want to process it For your product (Last.fm:

More information

Introduction to Hadoop. Owen O Malley Yahoo Inc!

Introduction to Hadoop. Owen O Malley Yahoo Inc! Introduction to Hadoop Owen O Malley Yahoo Inc! omalley@apache.org Hadoop: Why? Need to process 100TB datasets with multiday jobs On 1 node: scanning @ 50MB/s = 23 days MTBF = 3 years On 1000 node cluster:

More information

Hadoop Framework. technology basics for data scientists. Spring - 2014. Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN

Hadoop Framework. technology basics for data scientists. Spring - 2014. Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN Hadoop Framework technology basics for data scientists Spring - 2014 Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN Warning! Slides are only for presenta8on guide We will discuss+debate addi8onal

More information

Introduction to Hadoop. Owen O Malley Yahoo Inc!

Introduction to Hadoop. Owen O Malley Yahoo Inc! Introduction to Hadoop Owen O Malley Yahoo Inc! omalley@apache.org Hadoop: Why? Need to process 100TB datasets with multiday jobs On 1 node: scanning @ 50MB/s = 23 days MTBF = 3 years On 1000 node cluster:

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

BIG DATA APPLICATIONS

BIG DATA APPLICATIONS BIG DATA ANALYTICS USING HADOOP AND SPARK ON HATHI Boyu Zhang Research Computing ITaP BIG DATA APPLICATIONS Big data has become one of the most important aspects in scientific computing and business analytics

More information

Implement Hadoop jobs to extract business value from large and varied data sets

Implement Hadoop jobs to extract business value from large and varied data sets Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to

More information

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe

More information

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış Istanbul Şehir University Big Data Camp 14 Hadoop Map Reduce Aslan Bakirov Kevser Nur Çoğalmış Agenda Map Reduce Concepts System Overview Hadoop MR Hadoop MR Internal Job Execution Workflow Map Side Details

More information

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information

More information

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Big Data Storage, Management and challenges. Ahmed Ali-Eldin Big Data Storage, Management and challenges Ahmed Ali-Eldin (Ambitious) Plan What is Big Data? And Why talk about Big Data? How to store Big Data? BigTables (Google) Dynamo (Amazon) How to process Big

More information

MAPREDUCE - HADOOP IMPLEMENTATION

MAPREDUCE - HADOOP IMPLEMENTATION MAPREDUCE - HADOOP IMPLEMENTATION http://www.tutorialspoint.com/map_reduce/implementation_in_hadoop.htm Copyright tutorialspoint.com MapReduce is a framework that is used for writing applications to process

More information

Data Science in the Wild

Data Science in the Wild Data Science in the Wild Lecture 3 Some slides are taken from J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 Data Science and Big Data Big Data: the data cannot

More information

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015 COSC 6397 Big Data Analytics 2 nd homework assignment Pig and Hive Edgar Gabriel Spring 2015 2 nd Homework Rules Each student should deliver Source code (.java files) Documentation (.pdf,.doc,.tex or.txt

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014 Lambda Architecture CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014 1 Goals Cover the material in Chapter 8 of the Concurrency Textbook The Lambda Architecture Batch Layer MapReduce

More information

Moving from CS 61A Scheme to CS 61B Java

Moving from CS 61A Scheme to CS 61B Java Moving from CS 61A Scheme to CS 61B Java Introduction Java is an object-oriented language. This document describes some of the differences between object-oriented programming in Scheme (which we hope you

More information

HPCHadoop: MapReduce on Cray X-series

HPCHadoop: MapReduce on Cray X-series HPCHadoop: MapReduce on Cray X-series Scott Michael Research Analytics Indiana University Cray User Group Meeting May 7, 2014 1 Outline Motivation & Design of HPCHadoop HPCHadoop demo Benchmarking Methodology

More information

Peers Techno log ies Pv t. L td. HADOOP

Peers Techno log ies Pv t. L td. HADOOP Page 1 Peers Techno log ies Pv t. L td. Course Brochure Overview Hadoop is a Open Source from Apache, which provides reliable storage and faster process by using the Hadoop distibution file system and

More information

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University 2014-05-12

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University 2014-05-12 Introduction to NoSQL Databases and MapReduce Tore Risch Information Technology Uppsala University 2014-05-12 What is a NoSQL Database? 1. A key/value store Basic index manager, no complete query language

More information

Big Data for the JVM developer. Costin Leau, Elasticsearch @costinl

Big Data for the JVM developer. Costin Leau, Elasticsearch @costinl Big Data for the JVM developer Costin Leau, Elasticsearch @costinl Agenda Data Trends Data Pipelines JVM and Big Data Tool Eco-system Data Landscape Data Trends http://www.emc.com/leadership/programs/digital-universe.htm

More information

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since Feb

More information

COURSE CONTENT Big Data and Hadoop Training

COURSE CONTENT Big Data and Hadoop Training COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop

More information

Giovanni Simonini. DBGroup Università di Modena e Reggio Emilia Dipartimento di Ingegneria Enzo Ferrari. unimore

Giovanni Simonini. DBGroup Università di Modena e Reggio Emilia Dipartimento di Ingegneria Enzo Ferrari. unimore Giovanni Simonini DBGroup Università di Modena e Reggio Emilia Dipartimento di Ingegneria Enzo Ferrari 1 Abstractions on top of MapReduce Pig Hive MapReduce and Machine learning Mahout MapReduce and graphs

More information

Introduc)on to Map- Reduce. Vincent Leroy

Introduc)on to Map- Reduce. Vincent Leroy Introduc)on to Map- Reduce Vincent Leroy Sources Apache Hadoop Yahoo! Developer Network Hortonworks Cloudera Prac)cal Problem Solving with Hadoop and Pig Slides will be available at hgp://lig- membres.imag.fr/leroyv/

More information

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010 Hadoop s Entry into the Traditional Analytical DBMS Market Daniel Abadi Yale University August 3 rd, 2010 Data, Data, Everywhere Data explosion Web 2.0 more user data More devices that sense data More

More information

Report Vertiefung, Spring 2013 Constant Interval Extraction using Hadoop

Report Vertiefung, Spring 2013 Constant Interval Extraction using Hadoop Report Vertiefung, Spring 2013 Constant Interval Extraction using Hadoop Thomas Brenner, 08-928-434 1 Introduction+and+Task+ Temporal databases are databases expanded with a time dimension in order to

More information

Relational Processing on MapReduce

Relational Processing on MapReduce Relational Processing on MapReduce Jerome Simeon IBM Watson Research Content obtained from many sources, notably: Jimmy Lin course on MapReduce. Our Plan Today 1. Recap: Key relational DBMS notes Key Hadoop

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing MapReduce and Hadoop 15 319, spring 2010 17 th Lecture, Mar 16 th Majd F. Sakr Lecture Goals Transition to MapReduce from Functional Programming Understand the origins of

More information

CIEL A universal execution engine for distributed data-flow computing

CIEL A universal execution engine for distributed data-flow computing Reviewing: CIEL A universal execution engine for distributed data-flow computing Presented by Niko Stahl for R202 Outline 1. Motivation 2. Goals 3. Design 4. Fault Tolerance 5. Performance 6. Related Work

More information

A bit about Hadoop. Luca Pireddu. March 9, 2012. CRS4Distributed Computing Group. luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 1 / 18

A bit about Hadoop. Luca Pireddu. March 9, 2012. CRS4Distributed Computing Group. luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 1 / 18 A bit about Hadoop Luca Pireddu CRS4Distributed Computing Group March 9, 2012 luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 1 / 18 Often seen problems Often seen problems Low parallelism I/O is

More information

Introduc8on to Apache Spark

Introduc8on to Apache Spark Introduc8on to Apache Spark Jordan Volz, Systems Engineer @ Cloudera 1 Analyzing Data on Large Data Sets Python, R, etc. are popular tools among data scien8sts/analysts, sta8s8cians, etc. Why are these

More information