Programming and Debugging Large- Scale Data Processing Workflows. Christopher Olston and many others Yahoo! Research
|
|
- Melvyn Allison
- 8 years ago
- Views:
Transcription
1 Programming and Debugging Large- Scale Data Processing Workflows Christopher Olston and many others Yahoo! Research
2 Context Elaborate processing of large data sets e.g.: web search pre- processing cross- dataset linkage web informa=on extrac=on inges?on storage & processing serving
3 Context storage & processing workflow manager e.g. Nova dataflow programming framework e.g. Pig distributed sor=ng & hashing e.g. Map- Reduce Overview scalable file system e.g. GFS Detail: Inspector Gadget, RubySky Debugging aides: Before: example data generator During: instrumenta=on framework ABer: provenance metadata manager
4 Pig: A High- Level Dataflow Language and Run=me for Hadoop Web browsing sessions with happy endings. Visits = load /data/visits as (user, url, time);! Visits = foreach Visits generate user, Canonicalize(url), time;!! Pages = load /data/pages as (url, pagerank);!! VP = join Visits by url, Pages by url;! UserVisits = group VP by user;! Sessions = foreach UserVisits generate flatten(findsessions(*));! HappyEndings = filter Sessions by BestIsLast(*);!! store HappyEndings into '/data/happy_endings';!
5 vs. map- reduce: less code! "The [Hofmann PLSA E/M] algorithm was implemented in pig in lines of pig-latin statements. Took a lot less compared to what it took in implementing the algorithm in Map-Reduce Java. Exactly that's the reason I wanted to try it out in Pig. It took 3-4 days for me to write it, starting from learning pig. " " -- Prasenjit Mukherjee, Mahout project" /20 the lines of code Minutes /16 the development =me Hadoop Pig Hadoop Pig performs on par with raw Hadoop
6 vs. SQL: step- by- step style; lower- level control "I much prefer writing in Pig [Latin] versus SQL. The step-by-step method of" creating a program in Pig [Latin] is much cleaner and simpler to use than the single block method of SQL. It is easier to keep track of what your variables are, and where you are in the process of analyzing your data. " " -- Jasmine Novak, Engineer, Yahoo!" "PIG seems to give the necessary parallel programming construct (FOREACH, FLATTEN, COGROUP.. etc) and also give sufficient control back to the programmer (which purely declarative approach like [SQL on top of Map- Reduce] doesnʼt). " " -- Ricky Ho, Adobe Software"
7 Conceptually: A Graph of Data Transforma=ons Find users who tend to visit good pages. Load Visits(user, url, =me) Load Pages(url, pagerank) Transform to (user, Canonicalize(url), =me) Join url = url Group by user Transform to (user, Average(pagerank) as avgpr) Filter avgpr > 0.5
8 Load Visits(user, url, =me) Load Pages(url, pagerank) Illustrated! Transform to (user, Canonicalize(url), =me) (Amy, cnn.com, 8am) (Amy, hhp:// 9am) (Fred, 11am) (Amy, 8am) (Amy, 9am) (Fred, 11am) Join url = url Group by user (Amy, 8am, 0.9) (Amy, 9am, 0.4) (Fred, 11am, 0.4) ( 0.9) ( 0.4) (Amy, { (Amy, 8am, 0.9), (Amy, 9am, 0.4) }) (Fred, { (Fred, 11am, 0.4) }) Transform to (user, Average(pagerank) as avgpr) ILLUSTRATE lets me check the output of my lengthy (Amy, 0.65) batch jobs and their (Fred, 0.4) custom functions without having to do a lengthy run of a long pipeline. [This feature] enables me to be productive. Filter " " avgpr > Russell Jurney, LinkedIn" (Amy, 0.65)
9 (Naïve Algorithm) Load Visits(user, url, =me) Transform to (user, Canonicalize(url), =me) (Amy, cnn.com, 8am) (Amy, hhp:// 9am) (Fred, 11am) (Amy, 8am) (Amy, 9am) (Fred, 11am) Join url = url Group by user Transform to (user, Average(pagerank) as avgpr) Load Pages(url, pagerank) ( 0.9) ( 0.4) Filter avgpr > 0.5
10 Pig Project Status Produc=zed at Yahoo (~12- person team) 1000s of jobs/day 70% of Hadoop jobs Open- source (the Apache Pig Project) Offered on Amazon Elas=c Map- Reduce Used by LinkedIn, Twiher, Yahoo,...
11 Next: NOVA storage & processing workflow manager e.g. Nova dataflow programming framework e.g. Pig distributed sor=ng & hashing e.g. Map- Reduce scalable file system e.g. GFS Debugging aides: Before: example data generator During: instrumenta=on framework ABer: provenance metadata manager
12 Why a Workflow Manager? Con=nuous data processing (simulated on top of Pig/Hadoop sta=c processing layer) Independent scheduling of workflow modules
13 Example RSS feed NEW Workflow template detec=on ALL news ar=cles ALL news site templates ALL NEW template tagging NEW NEW shingling NEW shingle hashes seen ALL NEW NEW de- duping NEW unique ar=cles
14 Data Passes Through Many Sub- Systems Nova datum X inges=on Pig Map- Reduce low- latency processor serving datum Y GFS provenance of X? metadata queries
15 Ibis Project metadata Ibis metadata queries integrated metadata answers data processing sub- systems metadata manager users Benefits: Provide uniform view to users Factor out metadata management code Decouple metadata life=me from data/subsystem life=me Challenges: Overhead of shipping metadata Disparate data/processing granulari=es
16 What s Hard About Mul=- Granularity Provenance? Inference: Given rela=onships expressed at one granularity, answer queries about other granulari=es (the seman;cs are tricky here!) Efficiency: Implement inference without resor=ng to materializing everything in terms of finest granularity (e.g. cells)
17 Next: INSPECTOR GADGET storage & processing workflow manager e.g. Nova dataflow programming framework e.g. Pig distributed sor=ng & hashing e.g. Map- Reduce scalable file system e.g. GFS Debugging aides: Before: example data generator During: instrumenta=on framework ABer: provenance metadata manager
18 Mo=vated by User Interviews Interviewed 10 Yahoo dataflow programmers (mostly Pig users; some users of other dataflow environments) Asked them how they (wish they could) debug
19 Summary of User Interviews # of requests feature 7 crash culprit determina=on 5 row- level integrity alerts 4 table- level integrity alerts 4 data samples 3 data summaries 3 memory use monitoring 3 backward tracing (provenance) 2 forward tracing 2 golden data/logic tes=ng 2 step- through debugging 2 latency alerts 1 latency profiling 1 overhead profiling 1 trial runs
20 Our Approach Goal: a programming framework for adding these behaviors, and others, to Pig Precept: avoid modifying Pig or tampering with data flowing through Pig Approach: perform Pig script rewri=ng insert special UDFs that look like no- ops to Pig
21 Pig w/ Inspector Gadget load filter load join IG coordinator group count store
22 Example: Crash Culprit Determina;on Phases 1 to n- 1: record counts Phase n: records load filter load join IG coordinator group Phases 1 to n- 1: maintain count lower bounds Phase n: maintain last- seen records count store
23 Example: Forward Tracing load load filter IG coordinator traced records join group tracing instruc=ons report traced records to user count store
24 Flow end user result dataflow program + app. parameters applica=on IG driver library launch instrumented dataflow run(s) raw result(s) load load IG coordinator filter join dataflow engine run;me store
25 Agent & Coordinator APIs Agent Class init(args) tags = observerecord(record, tags) receivemessage(source, message) finish() Agent Messaging sendtocoordinator(message) sendtoagent(agentid, message) senddownstream(message) sendupstream(message) Coordinator Class init(args) receivemessage(source, message) output = finish() Coordinator Messaging sendtoagent(agentid, message)
26 Applica=ons Developed Using IG # of requests feature lines of code (Java) 7 crash culprit determina=on row- level integrity alerts 89 4 table- level integrity alerts 99 4 data samples 97 3 data summaries memory use monitoring N/A 3 backward tracing (provenance) forward tracing golden data/logic tes=ng step- through debugging N/A 2 latency alerts latency profiling overhead profiling trial runs 93
27 RubySky: Unified Scrip=ng Framework for Processing and Debugging Scenario: ad- hoc data analysis e.g. we just did a crawl; is it too biased toward Yahoo content? In this scenario, some=mes using new data; always running new code à Bugs are the norm à Debugging should be a first class ci=zen!
28 RubySky Example :Crawl_data <= load(context, "/crawl_data", schema {string :url;...})!! # quick-and-dirty determination of "domain" from "url" (e.g. -> yahoo.com)! :With_domains <= Crawl_data.foreach(as {string :url; string :domain}) do r! url = r[:url]! prefix = if (url[0, prefix.length] == prefix)! domain = url[prefix.length, url.index('/', prefix.length) - prefix.length]! else! throw Don t know how to parse this url: + url! end! Tuple.new([url, domain])! end!! :Yahoo_only <= With_domains.filter do r! r[:domain].match(/^yahoo/)! end!! :Grouped <= Yahoo_only.group!! :Grouped.each { r puts NUMBER OF YAHOO URLs: + r[:yahoo_only].size }!
29 RubySky Example (cont.) :Crawl_data <= load(context, "/crawl_data", schema {string :url;...})!! # quick-and-dirty determination of "domain" from "url" (e.g. -> yahoo.com)! :With_domains <= Crawl_data.foreach(as {string :url; string :domain}) do r! url = r[:url]! prefix = if (url[0, prefix.length] == prefix)! domain = url[prefix.length, url.index('/', prefix.length) - prefix.length]! else! domain = punt url! end! Tuple.new([url, domain])! end!! # SPY: examine a few extracted domains, as sanity check! :Domain_sample <= With_domains.foreach(as {string :domain}) do r! $count += 1! ($count < SAMPLE_SIZE)? [Tuple.new([r[:domain]])] : []! end! show Domain_sample!! # SPY: alert if any extracted domain is empty or null! :Domain_missing <= With_domains.foreach(as {string :alert}) do r! (r[:domain] == "" or r[:domain] == nil)? [Tuple.new(["ALERT! Missing domain from: " + r[:url]])] : []! end! show Domain_missing!! :Yahoo_only <= With_domains.filter do r! r[:domain].match(/^yahoo/)! end!! :Grouped <= Yahoo_only.group!! :Grouped.each { r puts NUMBER OF YAHOO URLs: + r[:yahoo_only].size }!
30 RubySky Execu=on launch load client punt request response (code and/or data) domain samples empty domain alerts extract domain spy agent filter group answer (# yahoo URLs) count
31 Summary storage & processing workflow manager e.g. Nova dataflow programming framework e.g. Pig distributed sor=ng & hashing e.g. Map- Reduce scalable file system e.g. GFS Nova Pig Debugging aides: Before: example data generator During: instrumenta=on framework ABer: provenance metadata manager Dataflow Illustrator Inspector Gadget, RubySky Ibis
32 Related Work Pig: DryadLINQ, Hive, Jaql, Scope, rela;onal query languages Nova: BigTable, CBP, Oozie, Percolator, scien;fic workflow, incremental view maintenance Dataflow illustrator: [Mannila/Raiha, PODS 86], reverse query processing, constraint databases, hardware verifica;on & model checking Inspector gadget, RubySky: XTrace, taint tracking, aspect- oriented programming Ibis: Kepler COMAD, ZOOM user views, provenance management for databases & scien;fic workflows
33 Collaborators Shubham Chopra Tyson Condie Anish Das Sarma Alan Gates Pradeep Kamath Ravi Kumar Shravan Narayanamurthy Olga Natkovich Benjamin Reed Santhosh Srinivasan Utkarsh Srivastava Andrew Tomkins
Programming and Debugging Large- Scale Data Processing Workflows
Programming and Debugging Large- Scale Data Processing Workflows Christopher Olston Google Research (work done at Yahoo! Research, with many colleagues) Context Elaborate processing of large data sets
More informationHow To Write A Program On A Microsoft.Com (For Microsoft) With A Microsail (For Linux) And A Microosoft.Com Microsails (For Windows) (For Macintosh) ( For Microsils
Programming and Debugging Large- Scale Data Processing Workflows Christopher Olston Google Research (work done at Yahoo! Research, with many colleagues) What I m Working on at Google We ve got fantasjc
More informationweb-scale data processing Christopher Olston and many others Yahoo! Research
web-scale data processing Christopher Olston and many others Yahoo! Research Motivation Projects increasingly revolve around analysis of big data sets Extracting structured data, e.g. face detection Understanding
More informationSystems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13
Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 Hadoop Ecosystem Overview of this Lecture Module Background Google MapReduce The Hadoop Ecosystem Core components: Hadoop
More informationSystems Engineering II. Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de
Systems Engineering II Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de About me! Since May 2015 2015 2012 Research Group Leader cfaed, TU Dresden PhD Student MPI- SWS Research Intern Microsoft
More informationDistributed Calculus with Hadoop MapReduce inside Orange Search Engine. mardi 3 juillet 12
Distributed Calculus with Hadoop MapReduce inside Orange Search Engine What is Big Data? $ 5 billions (2012) to $ 50 billions (by 2017) Forbes «Big Data is the new definitive source of competitive advantage
More informationData processing goes big
Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,
More informationBig Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich
Big Data Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich MapReduce & Hadoop The new world of Big Data (programming model) Overview of this Lecture Module Background Google MapReduce The Hadoop
More informationBig Data Technology Pig: Query Language atop Map-Reduce
Big Data Technology Pig: Query Language atop Map-Reduce Eshcar Hillel Yahoo! Ronny Lempel Outbrain *Based on slides by Edward Bortnikov & Ronny Lempel Roadmap Previous class MR Implementation This class
More informationInspector Gadget: A Framework for Custom Monitoring and Debugging of Distributed Dataflows
Inspector Gadget: A Framework for Custom Monitoring and Debugging of Distributed Dataflows Christopher Olston and Benjamin Reed Yahoo! Research ABSTRACT We consider how to monitor and debug query processing
More informationHadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
More informationChameleon: The Performance Tuning Tool for MapReduce Query Processing Systems
paper:38 Chameleon: The Performance Tuning Tool for MapReduce Query Processing Systems Edson Ramiro Lucsa Filho 1, Ivan Luiz Picoli 2, Eduardo Cunha de Almeida 2, Yves Le Traon 1 1 University of Luxembourg
More informationHadoop Job Oriented Training Agenda
1 Hadoop Job Oriented Training Agenda Kapil CK hdpguru@gmail.com Module 1 M o d u l e 1 Understanding Hadoop This module covers an overview of big data, Hadoop, and the Hortonworks Data Platform. 1.1 Module
More informationData and Algorithms of the Web: MapReduce
Data and Algorithms of the Web: MapReduce Mauro Sozio May 13, 2014 Mauro Sozio (Telecom Paristech) Data and Algorithms of the Web: MapReduce May 13, 2014 1 / 39 Outline 1 MapReduce Introduction MapReduce
More informationA Brief Introduction to Apache Tez
A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value
More informationCloud Computing. Up until now
Cloud Computing Lecture 19 Cloud Programming 2011-2012 Up until now Introduction, Definition of Cloud Computing Pre-Cloud Large Scale Computing: Grid Computing Content Distribution Networks Cycle-Sharing
More informationPerformance Management in Big Data Applica6ons. Michael Kopp, Technology Strategist @mikopp
Performance Management in Big Data Applica6ons Michael Kopp, Technology Strategist NoSQL: High Volume/Low Latency DBs Web Java Key Challenges 1) Even Distribu6on 2) Correct Schema and Access paperns 3)
More informationProgramming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview
Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce
More informationRecap. CSE 486/586 Distributed Systems Data Analytics. Example 1: Scientific Data. Two Questions We ll Answer. Data Analytics. Example 2: Web Data C 1
ecap Distributed Systems Data Analytics Steve Ko Computer Sciences and Engineering University at Buffalo PC enables programmers to call functions in remote processes. IDL (Interface Definition Language)
More informationHive Interview Questions
HADOOPEXAM LEARNING RESOURCES Hive Interview Questions www.hadoopexam.com Please visit www.hadoopexam.com for various resources for BigData/Hadoop/Cassandra/MongoDB/Node.js/Scala etc. 1 Professional Training
More informationWorkshop on Hadoop with Big Data
Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly
More informationAnalysis of Web Archives. Vinay Goel Senior Data Engineer
Analysis of Web Archives Vinay Goel Senior Data Engineer Internet Archive Established in 1996 501(c)(3) non profit organization 20+ PB (compressed) of publicly accessible archival material Technology partner
More informationBIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig
BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig Contents Acknowledgements... 1 Introduction to Hive and Pig... 2 Setup... 2 Exercise 1 Load Avro data into HDFS... 2 Exercise 2 Define an
More informationApache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc. aaa@cloudera.com, twicer: @awadallah
Apache Hadoop: The Pla/orm for Big Data Amr Awadallah CTO, Founder, Cloudera, Inc. aaa@cloudera.com, twicer: @awadallah 1 The Problems with Current Data Systems BI Reports + Interac7ve Apps RDBMS (aggregated
More informationCSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing
More informationCloud Computing at Google. Architecture
Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale
More informationBig Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
More informationBig Data Too Big To Ignore
Big Data Too Big To Ignore Geert! Big Data Consultant and Manager! Currently finishing a 3 rd Big Data project! IBM & Cloudera Certified! IBM & Microsoft Big Data Partner 2 Agenda! Defining Big Data! Introduction
More informationBig Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park
Big Data: Using ArcGIS with Apache Hadoop Erik Hoel and Mike Park Outline Overview of Hadoop Adding GIS capabilities to Hadoop Integrating Hadoop with ArcGIS Apache Hadoop What is Hadoop? Hadoop is a scalable
More informationData Management in the Cloud
Data Management in the Cloud Ryan Stern stern@cs.colostate.edu : Advanced Topics in Distributed Systems Department of Computer Science Colorado State University Outline Today Microsoft Cloud SQL Server
More informationCOSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015
COSC 6397 Big Data Analytics 2 nd homework assignment Pig and Hive Edgar Gabriel Spring 2015 2 nd Homework Rules Each student should deliver Source code (.java files) Documentation (.pdf,.doc,.tex or.txt
More informationIbis: Scaling Python Analy=cs on Hadoop and Impala
Ibis: Scaling Python Analy=cs on Hadoop and Impala Wes McKinney, Budapest BI Forum 2015-10- 14 @wesmckinn 1 Me R&D at Cloudera Serial creator of structured data tools / user interfaces Mathema=cian MIT
More informationSOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera
SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP Eva Andreasson Cloudera Most FAQ: Super-Quick Overview! The Apache Hadoop Ecosystem a Zoo! Oozie ZooKeeper Hue Impala Solr Hive Pig Mahout HBase MapReduce
More informationStratoSphere Above the Clouds
Stratosphere Parallel Analytics in the Cloud beyond Map/Reduce 14th International Workshop on High Performance Transaction Systems (HPTS) Poster Sessions, Mon Oct 24 2011 Thomas Bodner T.U. Berlin StratoSphere
More informationHadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
More informationIntroduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University 2014-05-12
Introduction to NoSQL Databases and MapReduce Tore Risch Information Technology Uppsala University 2014-05-12 What is a NoSQL Database? 1. A key/value store Basic index manager, no complete query language
More informationInternals of Hadoop Application Framework and Distributed File System
International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop
More informationBIG DATA What it is and how to use?
BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14
More informationHadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter : @carbone
Hadoop2, Spark Big Data, real time, machine learning & use cases Cédric Carbone Twitter : @carbone Agenda Map Reduce Hadoop v1 limits Hadoop v2 and YARN Apache Spark Streaming : Spark vs Storm Machine
More informationHBase Schema Design. NoSQL Ma4ers, Cologne, April 2013. Lars George Director EMEA Services
HBase Schema Design NoSQL Ma4ers, Cologne, April 2013 Lars George Director EMEA Services About Me Director EMEA Services @ Cloudera ConsulFng on Hadoop projects (everywhere) Apache Commi4er HBase and Whirr
More informationW H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract
W H I T E P A P E R Deriving Intelligence from Large Data Using Hadoop and Applying Analytics Abstract This white paper is focused on discussing the challenges facing large scale data processing and the
More informationImplement Hadoop jobs to extract business value from large and varied data sets
Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to
More informationLecture 10: HBase! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl
Big Data Processing, 2014/15 Lecture 10: HBase!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind the
More informationHadoop Introduction. Olivier Renault Solution Engineer - Hortonworks
Hadoop Introduction Olivier Renault Solution Engineer - Hortonworks Hortonworks A Brief History of Apache Hadoop Apache Project Established Yahoo! begins to Operate at scale Hortonworks Data Platform 2013
More informationInvestigating Hadoop for Large Spatiotemporal Processing Tasks
Investigating Hadoop for Large Spatiotemporal Processing Tasks David Strohschein dstrohschein@cga.harvard.edu Stephen Mcdonald stephenmcdonald@cga.harvard.edu Benjamin Lewis blewis@cga.harvard.edu Weihe
More informationThe Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org
The Flink Big Data Analytics Platform Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org What is Apache Flink? Open Source Started in 2009 by the Berlin-based database research groups In the Apache
More informationHow To Analyze Log Files In A Web Application On A Hadoop Mapreduce System
Analyzing Web Application Log Files to Find Hit Count Through the Utilization of Hadoop MapReduce in Cloud Computing Environment Sayalee Narkhede Department of Information Technology Maharashtra Institute
More informationESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
More informationHadoop Pig. Introduction Basic. Exercise
Your Name Hadoop Pig Introduction Basic Exercise A set of files A database A single file Modern systems have to deal with far more data than was the case in the past Yahoo : over 170PB of data Facebook
More informationA STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)
More informationIntroduc)on to. Eric Nagler 11/15/11
Introduc)on to Eric Nagler 11/15/11 What is Oozie? Oozie is a workflow scheduler for Hadoop Originally, designed at Yahoo! for their complex search engine workflows Now it is an open- source Apache incubator
More informationHow To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI
More informationbrief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS...253 PART 4 BEYOND MAPREDUCE...385
brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 1 Hadoop in a heartbeat 3 2 Introduction to YARN 22 PART 2 DATA LOGISTICS...59 3 Data serialization working with text and beyond 61 4 Organizing and
More informationBig Data Technology CS 236620, Technion, Spring 2013
Big Data Technology CS 236620, Technion, Spring 2013 Structured Databases atop Map-Reduce Edward Bortnikov & Ronny Lempel Yahoo! Labs, Haifa Roadmap Previous class MR Implementation This class Query Languages
More informationChapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
More informationMoving From Hadoop to Spark
+ Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com sujee@elephantscale.com Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee
More informationBig Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies
Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies Big Data: Global Digital Data Growth Growing leaps and bounds by 40+% Year over Year! 2009 =.8 Zetabytes =.08
More informationOnline Content Optimization Using Hadoop. Jyoti Ahuja Dec 20 2011
Online Content Optimization Using Hadoop Jyoti Ahuja Dec 20 2011 What do we do? Deliver right CONTENT to the right USER at the right TIME o Effectively and pro-actively learn from user interactions with
More informationComparing High Level MapReduce Query Languages
Comparing High Level MapReduce Query Languages R.J. Stewart, P.W. Trinder, and H-W. Loidl Mathematical And Computer Sciences Heriot Watt University *DRAFT VERSION* Abstract. The MapReduce parallel computational
More informationUnified Batch & Stream Processing Platform
Unified Batch & Stream Processing Platform Himanshu Bari Director Product Management Most Big Data Use Cases Are About Improving/Re-write EXISTING solutions To KNOWN problems Current Solutions Were Built
More informationMapReduce everywhere. Carsten Hufe & Michael Hausenblas
MapReduce everywhere Carsten Hufe & Michael Hausenblas About Carsten Hufe Big Data Consultant at comsysto Vodafone Telefonica / o2 Payback Hadoop Ecosystem, Distributed systems Committer of JumboDB Twitter
More informationApache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack
Apache Spark Document Analysis Course (Fall 2015 - Scott Sanner) Zahra Iman Some slides from (Matei Zaharia, UC Berkeley / MIT& Harold Liu) Reminder SparkConf JavaSpark RDD: Resilient Distributed Datasets
More informationBig Data and Scripting Systems build on top of Hadoop
Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform Pig is the name of the system Pig Latin is the provided programming language Pig Latin is
More informationGraph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang
Graph Mining on Big Data System Presented by Hefu Chai, Rui Zhang, Jian Fang Outline * Overview * Approaches & Environment * Results * Observations * Notes * Conclusion Overview * What we have done? *
More informationA Tour of the Zoo the Hadoop Ecosystem Prafulla Wani
A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani Technical Architect - Big Data Syntel Agenda Welcome to the Zoo! Evolution Timeline Traditional BI/DW Architecture Where Hadoop Fits In 2 Welcome to
More informationSpark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY
Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person
More informationHadoop. http://hadoop.apache.org/ Sunday, November 25, 12
Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using
More informationPig vs Hive: Benchmarking High Level Query Languages
Pig vs Hive: Benchmarking High Level Query Languages Benjamin Jakobus IBM, Ireland Dr. Peter McBrien Imperial College London, UK Abstract This article presents benchmarking results 1 of two benchmarking
More informationHadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future
More informationBIG DATA SERIES: HADOOP DEVELOPER TRAINING PROGRAM. An Overview
BIG DATA SERIES: HADOOP DEVELOPER TRAINING PROGRAM An Overview Contents Contents... 1 BIG DATA SERIES: HADOOP DEVELOPER TRAINING PROGRAM... 1 Program Overview... 4 Curriculum... 5 Module 1: Big Data: Hadoop
More informationAmerican International Journal of Research in Science, Technology, Engineering & Mathematics
American International Journal of Research in Science, Technology, Engineering & Mathematics Available online at http://www.iasir.net ISSN (Print): 2328-3491, ISSN (Online): 2328-3580, ISSN (CD-ROM): 2328-3629
More informationGetting Started with Hadoop. Raanan Dagan Paul Tibaldi
Getting Started with Hadoop Raanan Dagan Paul Tibaldi What is Apache Hadoop? Hadoop is a platform for data storage and processing that is Scalable Fault tolerant Open source CORE HADOOP COMPONENTS Hadoop
More informationHow to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning
How to use Big Data in Industry 4.0 implementations LAURI ILISON, PhD Head of Big Data and Machine Learning Big Data definition? Big Data is about structured vs unstructured data Big Data is about Volume
More informationManifest for Big Data Pig, Hive & Jaql
Manifest for Big Data Pig, Hive & Jaql Ajay Chotrani, Priyanka Punjabi, Prachi Ratnani, Rupali Hande Final Year Student, Dept. of Computer Engineering, V.E.S.I.T, Mumbai, India Faculty, Computer Engineering,
More informationHow To Use A Webmail On A Pc Or Macodeo.Com
Big data workloads and real-world data sets Gang Lu Institute of Computing Technology, Chinese Academy of Sciences BigDataBench Tutorial MICRO 2014 Cambridge, UK INSTITUTE OF COMPUTING TECHNOLOGY 1 Five
More informationBest Practices for Hadoop Data Analysis with Tableau
Best Practices for Hadoop Data Analysis with Tableau September 2013 2013 Hortonworks Inc. http:// Tableau 6.1.4 introduced the ability to visualize large, complex data stored in Apache Hadoop with Hortonworks
More informationDATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
More informationHadoop & Spark Using Amazon EMR
Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?
More informationHadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
More informationINTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
More informationBSC vision on Big Data and extreme scale computing
BSC vision on Big Data and extreme scale computing Jesus Labarta, Eduard Ayguade,, Fabrizio Gagliardi, Rosa M. Badia, Toni Cortes, Jordi Torres, Adrian Cristal, Osman Unsal, David Carrera, Yolanda Becerra,
More informationArchitectures for Big Data Analytics A database perspective
Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum
More informationBIG DATA - HADOOP PROFESSIONAL amron
0 Training Details Course Duration: 30-35 hours training + assignments + actual project based case studies Training Materials: All attendees will receive: Assignment after each module, video recording
More informationApache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
More informationBig Data Analytics Platform @ Nokia
Big Data Analytics Platform @ Nokia 1 Selecting the Right Tool for the Right Workload Yekesa Kosuru Nokia Location & Commerce Strata + Hadoop World NY - Oct 25, 2012 Agenda Big Data Analytics Platform
More informationIntroduction to Big Data! with Apache Spark" UC#BERKELEY#
Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"
More informationBig Data and Scripting Systems build on top of Hadoop
Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform interactive execution of map reduce jobs Pig is the name of the system Pig Latin is the
More informationApplication Development. A Paradigm Shift
Application Development for the Cloud: A Paradigm Shift Ramesh Rangachar Intelsat t 2012 by Intelsat. t Published by The Aerospace Corporation with permission. New 2007 Template - 1 Motivation for the
More informationBig Data and Hadoop for the Executive A Reference Guide
Big Data and Hadoop for the Executive A Reference Guide Overview The amount of information being collected by companies today is incredible. Wal- Mart has 460 terabytes of data, which, according to the
More informationMySQL and Hadoop. Percona Live 2014 Chris Schneider
MySQL and Hadoop Percona Live 2014 Chris Schneider About Me Chris Schneider, Database Architect @ Groupon Spent the last 10 years building MySQL architecture for multiple companies Worked with Hadoop for
More informationHadoop implementation of MapReduce computational model. Ján Vaňo
Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed
More informationSpark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data
Spark and Shark High- Speed In- Memory Analytics over Hadoop and Hive Data Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Cliff Engle, Michael Franklin, Haoyuan Li,
More informationCOURSE CONTENT Big Data and Hadoop Training
COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop
More informationDistributed Aggregation in Cloud Databases. By: Aparna Tiwari tiwaria@umail.iu.edu
Distributed Aggregation in Cloud Databases By: Aparna Tiwari tiwaria@umail.iu.edu ABSTRACT Data intensive applications rely heavily on aggregation functions for extraction of data according to user requirements.
More informationBIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata
BIG DATA: FROM HYPE TO REALITY Leandro Ruiz Presales Partner for C&LA Teradata Evolution in The Use of Information Action s ACTIVATING MAKE it happen! Insights OPERATIONALIZING WHAT IS happening now? PREDICTING
More informationCloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu
Lecture 4 Introduction to Hadoop & GAE Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu Outline Introduction to Hadoop The Hadoop ecosystem Related projects
More informationParallel Computing: Strategies and Implications. Dori Exterman CTO IncrediBuild.
Parallel Computing: Strategies and Implications Dori Exterman CTO IncrediBuild. In this session we will discuss Multi-threaded vs. Multi-Process Choosing between Multi-Core or Multi- Threaded development
More informationBig Data With Hadoop
With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
More informationOpen source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
More information