web-scale data processing Christopher Olston and many others Yahoo! Research
|
|
- Jemima Underwood
- 8 years ago
- Views:
Transcription
1 web-scale data processing Christopher Olston and many others Yahoo! Research
2 Motivation Projects increasingly revolve around analysis of big data sets Extracting structured data, e.g. face detection Understanding complex large-scale phenomena social systems (e.g. user-generated web content) economic systems (e.g. advertising markets) computer systems (e.g. web search engines) Data analysis is inner loop at Yahoo! et al. Big data necessitates parallel processing
3 Examples 1. Detect faces You have a function detectfaces() You want to run it over n images n is big 2. Study web usage You have a web crawl and click log Find sessions that end with the best page
4 Existing Work Parallel architectures cluster computing multi-core processors Data-parallel software parallel DBMS Map-Reduce, Dryad Data-parallel languages SQL NESL
5 Pig Project Data-parallel language ( Pig Latin ) Relational data manipulation primitives Imperative programming style Plug in code to customize processing Various crazy ideas Multi-program optimization Adaptive data placement Automatic example data generator
6 Pig Latin Language [SIGMOD 08]
7 Example 1 Detect faces in many images. I = load /mydata/images using ImageParser() as (id, image); F = foreach I generate id, detectfaces(image); store F into /mydata/faces ;
8 Example 2 Find sessions that end with the best page. Visits user url time Amy 8:00 Amy 8:05 Amy 10:00 Amy 10:05 Fred cnn.com/index.htm 12:00 Pages url pagerank
9 Efficient Evaluation Method the answer repartition by user repartition by url group-wise processing (identify sessions; examine pageranks) join Visits Pages
10 In Pig Latin Visits = load /data/visits as (user, url, time); Visits = foreach Visits generate user, Canonicalize(url), time; Pages = load /data/pages as (url, pagerank); VP = join Visits by url, Pages by url; UserVisits = group VP by user; Sessions = foreach UserVisits generate flatten(findsessions(*)); HappyEndings = filter Sessions by BestIsLast(*); store HappyEndings into '/data/happy_endings';
11 Pig Latin, in general transformations on sets of records easy for users high-level, extensible data processing primitives easy for the system exposes opportunities for parallelism and reuse operators: FILTER FOREACH GENERATE GROUP binary operators: JOIN COGROUP UNION
12 Related Languages SQL: declarative all-in-one blocks NESL: lacks join, cogroup Map-Reduce: special case of Pig Latin a = FOREACH input GENERATE flatten(map(*)); b = GROUP a BY $0; c = FOREACH b GENERATE Reduce(*); Sawzall: rigid map-then-reduce structure
13 Pig Latin = Sweet Spot Between SQL & Map-Reduce SQL Map-Reduce Programming style Large blocks of declarative Plug together pipes (but "I much prefer writing in Pig [Latin] versus SQL. The step-by-step method of constraints restricted to linear, 2-step creating a program in Pig [Latin] is much cleaner and simpler to use than the single block method of SQL. It is easier to keep track of data what flow) your variables Built-in are, data and where you Group-by, are in the process Sort, Join, of analyzing Filter, your Group-by, data. Sort manipulations Aggregate, Top-k, etc Jasmine Novak, Engineer, Yahoo! Execution model Fancy; trust the query Simple, transparent "PIG seems to give optimizer the necessary parallel programming construct Opportunities (FOREACH, for FLATTEN, Many COGROUP.. etc) and also give Few sufficient (logic control buried in map() automatic back optimization to the programmer (which purely declarative approach and reduce()) like [SQL on top of Map-Reduce] doesn t). -- Ricky Ho, Adobe Software
14 Map-Reduce as Backend ( SQL ) user automatic rewrite + optimize Pig or or Map-Reduce cluster details in [VLDB09]
15 Pig Latin vs. Map-Reduce: Code Reduction import java.io.ioexception; reporter.setstatus("ok"); lp.setoutputkeyclass(text.class); import java.util.arraylist; lp.setoutputvalueclass(text.class); import java.util.iterator; lp.setmapperclass(loadpages.class); Users = load users as (name, age); import java.util.list; // Do the cross product and collect the values FileInputFormat.addInputPath(lp, new for (String s1 : first) { Path("/user/gates/pages")); import org.apache.hadoop.fs.path; for (String s2 : second) { FileOutputFormat.setOutputPath(lp, import org.apache.hadoop.io.longwritable; String outval = key + "," + s1 + "," + s2; new Path("/user/gates/tmp/indexed_pages")); import org.apache.hadoop.io.text; oc.collect(null, new Text(outval)); lp.setnumreducetasks(0); import org.apache.hadoop.io.writable; reporter.setstatus("ok"); Job loadpages = new Job(lp); import org.apache.hadoop.io.writablecomparable; import Fltrd org.apache.hadoop.mapred.fileinputformat; = filter Users by JobConf lfu = new JobConf(MRExample.class); import org.apache.hadoop.mapred.fileoutputformat; lfu.setjobname("load and Filter Users"); import org.apache.hadoop.mapred.jobconf; lfu.setinputformat(textinputformat.class); import org.apache.hadoop.mapred.keyvaluetextinputformat; public static class LoadJoined extends MapReduceBase lfu.setoutputkeyclass(text.class); import org.apache.hadoop.mapred.mapper; implements Mapper<Text, Text, Text, LongWritable> { lfu.setoutputvalueclass(text.class); import org.apache.hadoop.mapred.mapreducebase; lfu.setmapperclass(loadandfilterusers.class); import org.apache.hadoop.mapred.outputcollector; public void map( FileInputFormat.addInputPath(lfu, new import org.apache.hadoop.mapred.recordreader; age >= 18 Text and k, age <= 25; Path("/user/gates/users")); import org.apache.hadoop.mapred.reducer; Text val, FileOutputFormat.setOutputPath(lfu, import org.apache.hadoop.mapred.reporter; OutputCollector<Text, LongWritable> oc, new Path("/user/gates/tmp/filtered_users")); import org.apache.hadoop.mapred.sequencefileinputformat; Reporter reporter) throws IOException { lfu.setnumreducetasks(0); import org.apache.hadoop.mapred.sequencefileoutputformat; // Find the url Job loadusers = new Job(lfu); import org.apache.hadoop.mapred.textinputformat; String line = val.tostring(); import org.apache.hadoop.mapred.jobcontrol.job; int firstcomma = line.indexof(','); JobConf join = new JobConf(MRExample.class); import Views org.apache.hadoop.mapred.jobcontrol.jobcontrol; = load views int secondcomma = line.indexof(',', as (user, firstcomma); url); join.setjobname("join Users and Pages"); import org.apache.hadoop.mapred.lib.identitymapper; String key = line.substring(firstcomma, secondcomma); join.setinputformat(keyvaluetextinputformat.class); // drop the rest of the record, I don't need it anymore, join.setoutputkeyclass(text.class); public class MRExample { // just pass a 1 for the combiner/reducer to sum instead. join.setoutputvalueclass(text.class); public static class LoadPages extends MapReduceBase Text outkey = new Text(key); join.setmapperclass(identitymapper.class); implements Mapper<LongWritable, Text, Text, Text> { oc.collect(outkey, new LongWritable(1L)); join.setreducerclass(join.class); FileInputFormat.addInputPath(join, new Jnd public void map(longwritable = k, Text join val, Fltrd by name, Views Path("/user/gates/tmp/indexed_pages")); by user; OutputCollector<Text, Text> oc, public static class ReduceUrls extends MapReduceBase FileInputFormat.addInputPath(join, new Reporter reporter) throws IOException { implements Reducer<Text, LongWritable, WritableComparable, Path("/user/gates/tmp/filtered_users")); // Pull the key out Writable> { FileOutputFormat.setOutputPath(join, new String line = val.tostring(); Path("/user/gates/tmp/joined")); int firstcomma = line.indexof(','); public void reduce( join.setnumreducetasks(50); String key = line.substring(0, firstcomma); Text key, Job joinjob = new Job(join); Grpd String value = line.substring(firstcomma = group + 1); Jnd Iterator<LongWritable> by url; iter, joinjob.adddependingjob(loadpages); Text outkey = new Text(key); OutputCollector<WritableComparable, Writable> oc, joinjob.adddependingjob(loadusers); // Prepend an index to the value so we know which file Reporter reporter) throws IOException { // it came from. // Add up all the values we see JobConf group = new JobConf(MRExample.class); Text outval = new Text("1" + value); group.setjobname("group URLs"); oc.collect(outkey, outval); long sum = 0; group.setinputformat(keyvaluetextinputformat.class); while (iter.hasnext()) { group.setoutputkeyclass(text.class); sum += iter.next().get(); group.setoutputvalueclass(longwritable.class); Smmd public static class LoadAndFilterUsers = extends foreach MapReduceBase Grpd reporter.setstatus("ok"); generate group, group.setoutputformat(sequencefileoutputformat.class); implements Mapper<LongWritable, Text, Text, Text> { group.setmapperclass(loadjoined.class); group.setcombinerclass(reduceurls.class); public void map(longwritable k, Text val, oc.collect(key, new LongWritable(sum)); group.setreducerclass(reduceurls.class); OutputCollector<Text, Text> oc, FileInputFormat.addInputPath(group, new Reporter reporter) throws IOException { Path("/user/gates/tmp/joined")); // Pull the key out public static class LoadClicks extends MapReduceBase FileOutputFormat.setOutputPath(group, new String line = val.tostring(); COUNT(Jnd) implements Mapper<WritableComparable, as clicks; Writable, LongWritable, Path("/user/gates/tmp/grouped")); int firstcomma = line.indexof(','); Text> { group.setnumreducetasks(50); String value = line.substring(firstcomma + 1); Job groupjob = new Job(group); int age = Integer.parseInt(value); public void map( groupjob.adddependingjob(joinjob); if (age < 18 age > 25) return; WritableComparable key, String key = line.substring(0, firstcomma); Writable val, JobConf top100 = new JobConf(MRExample.class); Text outkey = new Text(key); OutputCollector<LongWritable, Text> oc, top100.setjobname("top 100 sites"); Srtd // Prepend an index to = the value order so we know which file Smmd Reporter by reporter) throws clicks IOException { desc; top100.setinputformat(sequencefileinputformat.class); // it came from. oc.collect((longwritable)val, (Text)key); top100.setoutputkeyclass(longwritable.class); Text outval = new Text("2" + value); top100.setoutputvalueclass(text.class); oc.collect(outkey, outval); top100.setoutputformat(sequencefileoutputformat.class); public static class LimitClicks extends MapReduceBase top100.setmapperclass(loadclicks.class); implements Reducer<LongWritable, Text, LongWritable, Text> { top100.setcombinerclass(limitclicks.class); public static class Join extends MapReduceBase top100.setreducerclass(limitclicks.class); implements Reducer<Text, Text, Text, Text> { int count = 0; FileInputFormat.addInputPath(top100, new public void reduce( Path("/user/gates/tmp/grouped")); public void reduce(text key, LongWritable key, FileOutputFormat.setOutputPath(top100, new Iterator<Text> iter, Iterator<Text> iter, Path("/user/gates/top100sitesforusers18to25")); OutputCollector<Text, Text> oc, OutputCollector<LongWritable, Text> oc, top100.setnumreducetasks(1); Reporter reporter) throws IOException { Reporter reporter) throws IOException { Job limit = new Job(top100); // For each value, figure out which file it's from and limit.adddependingjob(groupjob); store it Top5 into top5sites ; // Only output the first 100 records // accordingly. while (count < 100 && iter.hasnext()) { JobControl jc = new JobControl("Find top 100 sites for users List<String> first = new ArrayList<String>(); oc.collect(key, iter.next()); 18 to 25"); List<String> second = new ArrayList<String>(); count++; jc.addjob(loadpages); jc.addjob(loadusers); while (iter.hasnext()) { jc.addjob(joinjob); Text t = iter.next(); jc.addjob(groupjob); String value = t.tostring(); public static void main(string[] args) throws IOException { jc.addjob(limit); if (value.charat(0) == '1') JobConf lp = new JobConf(MRExample.class); jc.run(); first.add(value.substring(1)); lp.setjobname("load Pages"); else second.add(value.substring(1)); lp.setinputformat(textinputformat.class); Top5 = limit Srtd 5;
16 Comparison /20 the lines of code Minutes /16 the development time Hadoop Pig Hadoop Pig performance 1.5x Hadoop
17 Ways to Run Pig Interactive shell Script file Embed in host language (e.g., Java) soon: Graphical editor
18 Status Open-source implementation Runs on Hadoop or local machine Active project; many refinements in the works Wide adoption in Yahoo 100s of users 1000s of Pig jobs/day 60% of ad-hoc Hadoop jobs are via Pig 40% of production jobs via Pig
19 Status Gaining traction externally log processing & aggregation building text indexes collaborative filtering, applied to image & video recommendation systems "The [Hofmann PLSA E/M] algorithm was implemented in pig in lines of pig-latin statements. Took a lot less compared to what it took in implementing the algorithm in Map-Reduce Java. Exactly that's the reason I wanted to try it out in Pig. It took 3-4 days for me to write it, starting from learning pig. -- Prasenjit Mukherjee, Mahout project
20 Crazy Ideas [USENIX 08] [VLDB 08] [SIGMOD 09]
21 Crazy Idea #1 Multi-Program Optimization
22 Motivation User programs repeatedly scan the same data files web crawl search log Goal: Reduce redundant IOs, and hence improve overall system throughput Approach: Introduce shared scans capability Careful scheduling of jobs, to maximize benefit of shared scans
23 Scheduling Shared Scans Scheduler looks at queues and anticipates arrivals Executor (Idle) 2 not popular Search Log Web Crawl Click Data
24 Crazy Idea #2 Adaptive Data Placement
25 Motivation Hadoop is good at localizing computation to data, for conventional map-reduce scenarios However, advanced query processing operations change this: Pre-hashed join of A & B: co-locate A & B? Frag-repl join of A & B: more replicas of B? Our idea: Adaptive pressure-based mechanism to move data s.t. better locality arises
26 Adaptive Data Placement jobs Job Localizer execution engine Scan C C A B Join Scan A A & B D A B Worker 1 Worker 2
27 Crazy Idea #3 Automatic Example Generator
28 Example Program (abstract view) Find users who tend to visit good pages. Load Visits(user, url, time) Load Pages(url, pagerank) Transform to (user, Canonicalize(url), time) Join url = url Group by user Transform to (user, Average(pagerank) as avgpr) Filter avgpr > 0.5
29 Load Visits(user, url, time) Transform to (user, Canonicalize(url), time) (Amy, cnn.com, 8am) (Amy, 9am) (Fred, 11am) Join url = url Load Pages(url, pagerank) ( 0.9) ( 0.4) (Amy, 8am) (Amy, 9am) (Fred, 11am) (Amy, 8am, 0.9) (Amy, 9am, 0.4) (Fred, 11am, 0.4) Group by user (Amy, { (Amy, 8am, 0.9), (Amy, 9am, 0.4) ) (Fred, { (Fred, 11am, 0.4) ) Transform to (user, Average(pagerank) as avgpr) (Amy, 0.65) (Fred, 0.4) Filter avgpr > 0.5 (Amy, 0.65)
30 Objectives: Automatic Example Data Generator Realism Conciseness Completeness Challenges: Large original data Selective operators (e.g., join, filter) Noninvertible operators (e.g., UDFs)
31 Talk Summary Data-parallel language ( Pig Latin ) Sequence of data transformation steps Users can plug in custom code Research nuggets Joint scheduling of related programs, to amortize IOs Adaptive data placement, to enhance locality Automatic example data generator, to make user s life easier
32 Credits Yahoo! Grid Team project leads: Alan Gates Olga Natkovich Yahoo! Research project leads: Chris Olston Utkarsh Srivastava Ben Reed
Practice and Applications of Data Management CMPSCI 345. Lecture 19-20: Amazon Web Services
Practice and Applications of Data Management CMPSCI 345 Lecture 19-20: Amazon Web Services Extra credit: project part 3 } Open-ended addi*onal features. } Presenta*ons on Dec 7 } Need to sign up by Nov
More informationHadoop Pig. Introduction Basic. Exercise
Your Name Hadoop Pig Introduction Basic Exercise A set of files A database A single file Modern systems have to deal with far more data than was the case in the past Yahoo : over 170PB of data Facebook
More informationDistributed and Cloud Computing
Distributed and Cloud Computing K. Hwang, G. Fox and J. Dongarra Chapter 6: Cloud Programming and Software Environments Part 1 Adapted from Kai Hwang, University of Southern California with additions from
More informationDatabase design and implementation CMPSCI 645. Lectures 22: Parallel Databases and MapReduce
Database design and implementation CMPSCI 645 Lectures 22: Parallel Databases and MapReduce What is a parallel database? 2 Parallel v.s. distributed databases } Parallel database system: } Improve performance
More informationProgramming and Debugging Large- Scale Data Processing Workflows. Christopher Olston and many others Yahoo! Research
Programming and Debugging Large- Scale Data Processing Workflows Christopher Olston and many others Yahoo! Research Context Elaborate processing of large data sets e.g.: web search pre- processing cross-
More informationProgramming and Debugging Large- Scale Data Processing Workflows
Programming and Debugging Large- Scale Data Processing Workflows Christopher Olston Google Research (work done at Yahoo! Research, with many colleagues) Context Elaborate processing of large data sets
More informationHadoop and Eclipse. Eclipse Hawaii User s Group May 26th, 2009. Seth Ladd http://sethladd.com
Hadoop and Eclipse Eclipse Hawaii User s Group May 26th, 2009 Seth Ladd http://sethladd.com Goal YOU can use the same technologies as The Big Boys Google Yahoo (2000 nodes) Last.FM AOL Facebook (2.5 petabytes
More informationHow To Write A Program On A Microsoft.Com (For Microsoft) With A Microsail (For Linux) And A Microosoft.Com Microsails (For Windows) (For Macintosh) ( For Microsils
Programming and Debugging Large- Scale Data Processing Workflows Christopher Olston Google Research (work done at Yahoo! Research, with many colleagues) What I m Working on at Google We ve got fantasjc
More informationLarge Scale (Machine) Learning at Twitter. Aleksander Kołcz Twitter, Inc.
Large Scale (Machine) Learning at Twitter Aleksander Kołcz Twitter, Inc. What is Twitter Micro-blogging service Open exchange of information/opinions Private communication possible via DMs Content restricted
More informationCSE 544: Principles of Database Systems
CSE 544: Principles of Database Systems MapReduce, PigLatin CSE544 - Spring, 2012 1 Overview of Today s Lecture Cluster computing Map/reduce (paper) Degree sequence (brief discussion) PigLatin (brief discussion,
More informationMR-(Mapreduce Programming Language)
MR-(Mapreduce Programming Language) Siyang Dai Zhi Zhang Shuai Yuan Zeyang Yu Jinxiong Tan sd2694 zz2219 sy2420 zy2156 jt2649 Objective of MR MapReduce is a software framework introduced by Google, aiming
More informationSystems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13
Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 Hadoop Ecosystem Overview of this Lecture Module Background Google MapReduce The Hadoop Ecosystem Core components: Hadoop
More informationStep 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so:
Codelab 1 Introduction to the Hadoop Environment (version 0.17.0) Goals: 1. Set up and familiarize yourself with the Eclipse plugin 2. Run and understand a word counting program Setting up Eclipse: Step
More informationCopy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins)
Beijing Codelab 1 Introduction to the Hadoop Environment Spinnaker Labs, Inc. Contains materials Copyright 2007 University of Washington, licensed under the Creative Commons Attribution 3.0 License --
More informationIntroduction to Pig. Content developed and presented by: 2009 Cloudera, Inc.
Introduction to Pig Content developed and presented by: Outline Motivation Background Components How it Works with Map Reduce Pig Latin by Example Wrap up & Conclusions Motivation Map Reduce is very powerful,
More informationBig Data Technology Pig: Query Language atop Map-Reduce
Big Data Technology Pig: Query Language atop Map-Reduce Eshcar Hillel Yahoo! Ronny Lempel Outbrain *Based on slides by Edward Bortnikov & Ronny Lempel Roadmap Previous class MR Implementation This class
More informationConnecting Hadoop with Oracle Database
Connecting Hadoop with Oracle Database Sharon Stephen Senior Curriculum Developer Server Technologies Curriculum The following is intended to outline our general product direction.
More informationHadoop WordCount Explained! IT332 Distributed Systems
Hadoop WordCount Explained! IT332 Distributed Systems Typical problem solved by MapReduce Read a lot of data Map: extract something you care about from each record Shuffle and Sort Reduce: aggregate, summarize,
More informationBig Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich
Big Data Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich MapReduce & Hadoop The new world of Big Data (programming model) Overview of this Lecture Module Background Google MapReduce The Hadoop
More informationZebra and MapReduce. Table of contents. 1 Overview...2 2 Hadoop MapReduce APIs...2 3 Zebra MapReduce APIs...2 4 Zebra MapReduce Examples...
Table of contents 1 Overview...2 2 Hadoop MapReduce APIs...2 3 Zebra MapReduce APIs...2 4 Zebra MapReduce Examples... 2 1. Overview MapReduce allows you to take full advantage of Zebra's capabilities.
More informationHadoop Introduction. Olivier Renault Solution Engineer - Hortonworks
Hadoop Introduction Olivier Renault Solution Engineer - Hortonworks Hortonworks A Brief History of Apache Hadoop Apache Project Established Yahoo! begins to Operate at scale Hortonworks Data Platform 2013
More informationWord count example Abdalrahman Alsaedi
Word count example Abdalrahman Alsaedi To run word count in AWS you have two different ways; either use the already exist WordCount program, or to write your own file. First: Using AWS word count program
More informationClick Stream Data Analysis Using Hadoop
Governors State University OPUS Open Portal to University Scholarship Capstone Projects Spring 2015 Click Stream Data Analysis Using Hadoop Krishna Chand Reddy Gaddam Governors State University Sivakrishna
More informationIntroduction to Apache Pig Indexing and Search
Large-scale Information Processing, Summer 2014 Introduction to Apache Pig Indexing and Search Emmanouil Tzouridis Knowledge Mining & Assessment Includes slides from Ulf Brefeld: LSIP 2013 Organizational
More informationApache Hive. Big Data 2015
Apache Hive Big Data 2015 Hive Configuration Translates HiveQL statements into a set of MapReduce jobs which are then executed on a Hadoop Cluster Execute on Hadoop Cluster HiveQL Hive Monitor/Report Client
More informationSystems Engineering II. Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de
Systems Engineering II Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de About me! Since May 2015 2015 2012 Research Group Leader cfaed, TU Dresden PhD Student MPI- SWS Research Intern Microsoft
More informationScalable Computing with Hadoop
Scalable Computing with Hadoop Doug Cutting cutting@apache.org dcutting@yahoo-inc.com 5/4/06 Seek versus Transfer B-Tree requires seek per access unless to recent, cached page so can buffer & pre-sort
More informationPig vs Hive. Big Data 2014
Pig vs Hive Big Data 2014 Pig Configuration In the bash_profile export all needed environment variables Pig Configuration Download a release of apache pig: pig-0.11.1.tar.gz Pig Configuration Go to the
More informationWorking With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology
Working With Hadoop Now that we covered the basics of MapReduce, let s look at some Hadoop specifics. Mostly based on Tom White s book Hadoop: The Definitive Guide, 3 rd edition Note: We will use the new
More informationWord Count Code using MR2 Classes and API
EDUREKA Word Count Code using MR2 Classes and API A Guide to Understand the Execution of Word Count edureka! A guide to understand the execution and flow of word count WRITE YOU FIRST MRV2 PROGRAM AND
More informationCase-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang
Case-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang Independent Study Advanced Case-Based Reasoning Department of Computer Science
More informationLANGUAGES FOR HADOOP: PIG & HIVE
Friday, September 27, 13 1 LANGUAGES FOR HADOOP: PIG & HIVE Michail Michailidis & Patrick Maiden Friday, September 27, 13 2 Motivation Native MapReduce Gives fine-grained control over how program interacts
More informationXiaoming Gao Hui Li Thilina Gunarathne
Xiaoming Gao Hui Li Thilina Gunarathne Outline HBase and Bigtable Storage HBase Use Cases HBase vs RDBMS Hands-on: Load CSV file to Hbase table with MapReduce Motivation Lots of Semi structured data Horizontal
More informationCS54100: Database Systems
CS54100: Database Systems Cloud Databases: The Next Post- Relational World 18 April 2012 Prof. Chris Clifton Beyond RDBMS The Relational Model is too limiting! Simple data model doesn t capture semantics
More informationHow To Write A Mapreduce Program In Java.Io 4.4.4 (Orchestra)
MapReduce framework - Operates exclusively on pairs, - that is, the framework views the input to the job as a set of pairs and produces a set of pairs as the output
More informationData Management in the Cloud
Data Management in the Cloud Ryan Stern stern@cs.colostate.edu : Advanced Topics in Distributed Systems Department of Computer Science Colorado State University Outline Today Microsoft Cloud SQL Server
More informationBig Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction to Pig
Introduction to Pig Agenda What is Pig? Key Features of Pig The Anatomy of Pig Pig on Hadoop Pig Philosophy Pig Latin Overview Pig Latin Statements Pig Latin: Identifiers Pig Latin: Comments Data Types
More informationThe Hadoop Eco System Shanghai Data Science Meetup
The Hadoop Eco System Shanghai Data Science Meetup Karthik Rajasethupathy, Christian Kuka 03.11.2015 @Agora Space Overview What is this talk about? Giving an overview of the Hadoop Ecosystem and related
More informationBIG DATA, MAPREDUCE & HADOOP
BIG, MAPREDUCE & HADOOP LARGE SCALE DISTRIBUTED SYSTEMS By Jean-Pierre Lozi A tutorial for the LSDS class LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 1 OBJECTIVES OF THIS LAB SESSION The LSDS
More informationCloudera Certified Developer for Apache Hadoop
Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number
More informationHow To Write A Map In Java (Java) On A Microsoft Powerbook 2.5 (Ahem) On An Ipa (Aeso) Or Ipa 2.4 (Aseo) On Your Computer Or Your Computer
Lab 0 - Introduction to Hadoop/Eclipse/Map/Reduce CSE 490h - Winter 2007 To Do 1. Eclipse plug in introduction Dennis Quan, IBM 2. Read this hand out. 3. Get Eclipse set up on your machine. 4. Load the
More informationPig Latin: A Not-So-Foreign Language for Data Processing
Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Yahoo! Research {olston, breed, utkarsh, ravikumar, atomkins}@yahoo-inc.com
More informationHow To Analyze Log Files In A Web Application On A Hadoop Mapreduce System
Analyzing Web Application Log Files to Find Hit Count Through the Utilization of Hadoop MapReduce in Cloud Computing Environment Sayalee Narkhede Department of Information Technology Maharashtra Institute
More informationIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop Jie Tao Karlsruhe Institute of Technology jie.tao@kit.edu Die Kooperation von Why Map/Reduce? Massive data Can not be stored on a single machine Takes too long to process
More informationParallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data
Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin
More informationRecap. CSE 486/586 Distributed Systems Data Analytics. Example 1: Scientific Data. Two Questions We ll Answer. Data Analytics. Example 2: Web Data C 1
ecap Distributed Systems Data Analytics Steve Ko Computer Sciences and Engineering University at Buffalo PC enables programmers to call functions in remote processes. IDL (Interface Definition Language)
More informationDistributed Aggregation in Cloud Databases. By: Aparna Tiwari tiwaria@umail.iu.edu
Distributed Aggregation in Cloud Databases By: Aparna Tiwari tiwaria@umail.iu.edu ABSTRACT Data intensive applications rely heavily on aggregation functions for extraction of data according to user requirements.
More informationBIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig
BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig Contents Acknowledgements... 1 Introduction to Hive and Pig... 2 Setup... 2 Exercise 1 Load Avro data into HDFS... 2 Exercise 2 Define an
More information16.1 MAPREDUCE. For personal use only, not for distribution. 333
For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several
More informationCloud Computing. Up until now
Cloud Computing Lecture 19 Cloud Programming 2011-2012 Up until now Introduction, Definition of Cloud Computing Pre-Cloud Large Scale Computing: Grid Computing Content Distribution Networks Cycle-Sharing
More informationA Comparison of Approaches to Large-Scale Data Analysis
A Comparison of Approaches to Large-Scale Data Analysis Sam Madden MIT CSAIL with Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, and Michael Stonebraker In SIGMOD 2009 MapReduce
More informationBig Data Technology CS 236620, Technion, Spring 2013
Big Data Technology CS 236620, Technion, Spring 2013 Structured Databases atop Map-Reduce Edward Bortnikov & Ronny Lempel Yahoo! Labs, Haifa Roadmap Previous class MR Implementation This class Query Languages
More informationExtreme Computing. Hadoop MapReduce in more detail. www.inf.ed.ac.uk
Extreme Computing Hadoop MapReduce in more detail How will I actually learn Hadoop? This class session Hadoop: The Definitive Guide RTFM There is a lot of material out there There is also a lot of useless
More informationDATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
More informationMrs: MapReduce for Scientific Computing in Python
Mrs: for Scientific Computing in Python Andrew McNabb, Jeff Lund, and Kevin Seppi Brigham Young University November 16, 2012 Large scale problems require parallel processing Communication in parallel processing
More informationPlay with Big Data on the Shoulders of Open Source
OW2 Open Source Corporate Network Meeting Play with Big Data on the Shoulders of Open Source Liu Jie Technology Center of Software Engineering Institute of Software, Chinese Academy of Sciences 2012-10-19
More informationImplement Hadoop jobs to extract business value from large and varied data sets
Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to
More informationBig Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13
Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13 Astrid Rheinländer Wissensmanagement in der Bioinformatik What is Big Data? collection of data sets so large and complex
More informationMapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example
MapReduce MapReduce and SQL Injections CS 3200 Final Lecture Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design
More informationBiomap Jobs and the Big Picture
Lots of Data, Little Money. A Last.fm perspective Martin Dittus, martind@last.fm London Geek Nights, 2009-04-23 Big Data Little Money You have lots of data You want to process it For your product (Last.fm:
More informationHadoop Framework. technology basics for data scientists. Spring - 2014. Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN
Hadoop Framework technology basics for data scientists Spring - 2014 Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN Warning! Slides are only for presenta8on guide We will discuss+debate addi8onal
More informationPeers Techno log ies Pv t. L td. HADOOP
Page 1 Peers Techno log ies Pv t. L td. Course Brochure Overview Hadoop is a Open Source from Apache, which provides reliable storage and faster process by using the Hadoop distibution file system and
More informationBIG DATA APPLICATIONS
BIG DATA ANALYTICS USING HADOOP AND SPARK ON HATHI Boyu Zhang Research Computing ITaP BIG DATA APPLICATIONS Big data has become one of the most important aspects in scientific computing and business analytics
More informationBig Data Management and NoSQL Databases
NDBI040 Big Data Management and NoSQL Databases Lecture 3. Apache Hadoop Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Apache Hadoop Open-source
More informationHPCHadoop: MapReduce on Cray X-series
HPCHadoop: MapReduce on Cray X-series Scott Michael Research Analytics Indiana University Cray User Group Meeting May 7, 2014 1 Outline Motivation & Design of HPCHadoop HPCHadoop demo Benchmarking Methodology
More informationCOURSE CONTENT Big Data and Hadoop Training
COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop
More informationINTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
More informationSpark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY
Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person
More informationIntroduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan sriram@sdsc.edu
Introduc)on to the MapReduce Paradigm and Apache Hadoop Sriram Krishnan sriram@sdsc.edu Programming Model The computa)on takes a set of input key/ value pairs, and Produces a set of output key/value pairs.
More informationData Science in the Wild
Data Science in the Wild Lecture 3 Some slides are taken from J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 Data Science and Big Data Big Data: the data cannot
More informationReport Vertiefung, Spring 2013 Constant Interval Extraction using Hadoop
Report Vertiefung, Spring 2013 Constant Interval Extraction using Hadoop Thomas Brenner, 08-928-434 1 Introduction+and+Task+ Temporal databases are databases expanded with a time dimension in order to
More informationA REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information
More informationIntroduction to Cloud Computing
Introduction to Cloud Computing MapReduce and Hadoop 15 319, spring 2010 17 th Lecture, Mar 16 th Majd F. Sakr Lecture Goals Transition to MapReduce from Functional Programming Understand the origins of
More informationMoving from CS 61A Scheme to CS 61B Java
Moving from CS 61A Scheme to CS 61B Java Introduction Java is an object-oriented language. This document describes some of the differences between object-oriented programming in Scheme (which we hope you
More informationHadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010
Hadoop s Entry into the Traditional Analytical DBMS Market Daniel Abadi Yale University August 3 rd, 2010 Data, Data, Everywhere Data explosion Web 2.0 more user data More devices that sense data More
More informationIstanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış
Istanbul Şehir University Big Data Camp 14 Hadoop Map Reduce Aslan Bakirov Kevser Nur Çoğalmış Agenda Map Reduce Concepts System Overview Hadoop MR Hadoop MR Internal Job Execution Workflow Map Side Details
More informationRelational Processing on MapReduce
Relational Processing on MapReduce Jerome Simeon IBM Watson Research Content obtained from many sources, notably: Jimmy Lin course on MapReduce. Our Plan Today 1. Recap: Key relational DBMS notes Key Hadoop
More informationCOSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015
COSC 6397 Big Data Analytics 2 nd homework assignment Pig and Hive Edgar Gabriel Spring 2015 2 nd Homework Rules Each student should deliver Source code (.java files) Documentation (.pdf,.doc,.tex or.txt
More informationBig Data for the JVM developer. Costin Leau, Elasticsearch @costinl
Big Data for the JVM developer Costin Leau, Elasticsearch @costinl Agenda Data Trends Data Pipelines JVM and Big Data Tool Eco-system Data Landscape Data Trends http://www.emc.com/leadership/programs/digital-universe.htm
More informationChameleon: The Performance Tuning Tool for MapReduce Query Processing Systems
paper:38 Chameleon: The Performance Tuning Tool for MapReduce Query Processing Systems Edson Ramiro Lucsa Filho 1, Ivan Luiz Picoli 2, Eduardo Cunha de Almeida 2, Yves Le Traon 1 1 University of Luxembourg
More informationIntroduc)on to Map- Reduce. Vincent Leroy
Introduc)on to Map- Reduce Vincent Leroy Sources Apache Hadoop Yahoo! Developer Network Hortonworks Cloudera Prac)cal Problem Solving with Hadoop and Pig Slides will be available at hgp://lig- membres.imag.fr/leroyv/
More informationHadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com
Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since Feb
More informationBIG DATA - HADOOP PROFESSIONAL amron
0 Training Details Course Duration: 30-35 hours training + assignments + actual project based case studies Training Materials: All attendees will receive: Assignment after each module, video recording
More informationIntroduction to Parallel Programming and MapReduce
Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant
More informationHealth Care Claims System Prototype
SGT WHITE PAPER Health Care Claims System Prototype MongoDB and Hadoop 2015 SGT, Inc. All Rights Reserved 7701 Greenbelt Road, Suite 400, Greenbelt, MD 20770 Tel: (301) 614-8600 Fax: (301) 614-8601 www.sgt-inc.com
More informationIntroduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University 2014-05-12
Introduction to NoSQL Databases and MapReduce Tore Risch Information Technology Uppsala University 2014-05-12 What is a NoSQL Database? 1. A key/value store Basic index manager, no complete query language
More informationChapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
More informationElastic Map Reduce. Shadi Khalifa Database Systems Laboratory (DSL) khalifa@cs.queensu.ca
Elastic Map Reduce Shadi Khalifa Database Systems Laboratory (DSL) khalifa@cs.queensu.ca The Amazon Web Services Universe Cross Service Features Management Interface Platform Services Infrastructure Services
More informationHalloween Costume Ideas For Xbox 360
Semantics with Failures Practical Considerations If map and reduce are deterministic, then output identical to non-faulting sequential execution For non-deterministic operators, different reduce tasks
More informationBig Data Storage, Management and challenges. Ahmed Ali-Eldin
Big Data Storage, Management and challenges Ahmed Ali-Eldin (Ambitious) Plan What is Big Data? And Why talk about Big Data? How to store Big Data? BigTables (Google) Dynamo (Amazon) How to process Big
More informationHadoop: Understanding the Big Data Processing Method
Hadoop: Understanding the Big Data Processing Method Deepak Chandra Upreti 1, Pawan Sharma 2, Dr. Yaduvir Singh 3 1 PG Student, Department of Computer Science & Engineering, Ideal Institute of Technology
More informationWorkshop on Hadoop with Big Data
Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly
More informationOther Map-Reduce (ish) Frameworks. William Cohen
Other Map-Reduce (ish) Frameworks William Cohen 1 Outline More concise languages for map- reduce pipelines Abstractions built on top of map- reduce General comments Speci
More informationLambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014
Lambda Architecture CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014 1 Goals Cover the material in Chapter 8 of the Concurrency Textbook The Lambda Architecture Batch Layer MapReduce
More informationDistributed Calculus with Hadoop MapReduce inside Orange Search Engine. mardi 3 juillet 12
Distributed Calculus with Hadoop MapReduce inside Orange Search Engine What is Big Data? $ 5 billions (2012) to $ 50 billions (by 2017) Forbes «Big Data is the new definitive source of competitive advantage
More informationand HDFS for Big Data Applications Serge Blazhievsky Nice Systems
Introduction PRESENTATION to Hadoop, TITLE GOES MapReduce HERE and HDFS for Big Data Applications Serge Blazhievsky Nice Systems SNIA Legal Notice The material contained in this tutorial is copyrighted
More informationProcessing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems
Processing of massive data: MapReduce 2. Hadoop 1 MapReduce Implementations Google were the first that applied MapReduce for big data analysis Their idea was introduced in their seminal paper MapReduce:
More informationData processing goes big
Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,
More informationCIEL A universal execution engine for distributed data-flow computing
Reviewing: CIEL A universal execution engine for distributed data-flow computing Presented by Niko Stahl for R202 Outline 1. Motivation 2. Goals 3. Design 4. Fault Tolerance 5. Performance 6. Related Work
More informationHadoop Streaming. 2012 coreservlets.com and Dima May. 2012 coreservlets.com and Dima May
2012 coreservlets.com and Dima May Hadoop Streaming Originals of slides and source code for examples: http://www.coreservlets.com/hadoop-tutorial/ Also see the customized Hadoop training courses (onsite
More information