Recap. CSE 486/586 Distributed Systems Data Analytics. Example 1: Scientific Data. Two Questions We ll Answer. Data Analytics. Example 2: Web Data C 1

Similar documents
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

Play with Big Data on the Shoulders of Open Source

Big Data and Apache Hadoop s MapReduce

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Map Reduce / Hadoop / HDFS

Big Data Technology Pig: Query Language atop Map-Reduce

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

CSE-E5430 Scalable Cloud Computing Lecture 2

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Hadoop implementation of MapReduce computational model. Ján Vaňo

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

How To Scale Out Of A Nosql Database

Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing

Hadoop & its Usage at Facebook

MapReduce and Hadoop Distributed File System

Advanced Data Management Technologies

Hadoop and No SQL Nagaraj Kulkarni

BIG DATA IN SCIENCE & EDUCATION

Hadoop. Sunday, November 25, 12

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Open source Google-style large scale data analysis with Hadoop

Application Development. A Paradigm Shift

MapReduce Algorithms. Sergei Vassilvitskii. Saturday, August 25, 12

Big Data Infrastructures & Technologies

Hadoop IST 734 SS CHUNG

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

How To Analyze Log Files In A Web Application On A Hadoop Mapreduce System

Big Data With Hadoop

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Data Mining with Hadoop at TACC

Hadoop and Map-reduce computing

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Cloud Computing: MapReduce and Hadoop

Hadoop & its Usage at Facebook

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

Big Data Processing with Google s MapReduce. Alexandru Costan

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

White Paper. HPCC Systems : Introduction to HPCC (High-Performance Computing Cluster)

Planning LEFEBVRE. Philippe Intelligence in IoT GAUTIER GOLIRO

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Map Reduce & Hadoop Recommended Text:

Large scale processing using Hadoop. Ján Vaňo

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

MapReduce Systems. Outline. Computer Speedup. Sara Bouchenak

LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

MapReduce. Introduction and Hadoop Overview. 13 June Lab Course: Databases & Cloud Computing SS 2012

INTRO TO BIG DATA. Djoerd Hiemstra. Big Data in Clinical Medicinel, 30 June 2014

MapReduce and Hadoop Distributed File System V I J A Y R A O

16.1 MAPREDUCE. For personal use only, not for distribution. 333

The Hadoop Framework

White Paper. HPCC Systems: Data Intensive Supercomputing Solutions

Data-Intensive Computing with Map-Reduce and Hadoop

Introduction to Hadoop

BIG DATA, MAPREDUCE & HADOOP

Hadoop for MySQL DBAs. Copyright 2011 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

CS54100: Database Systems

Parallel Processing of cluster by Map Reduce

Hadoop. Scalable Distributed Computing. Claire Jaja, Julian Chan October 8, 2013

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Click Stream Data Analysis Using Hadoop

SEAIP 2009 Presentation

Hadoop and Map-Reduce. Swati Gore

Big Data Analysis and HADOOP

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

MapReduce (in the cloud)

Apache Hadoop FileSystem and its Usage in Facebook

Application and practice of parallel cloud computing in ISP. Guangzhou Institute of China Telecom Zhilan Huang

Introduction to Hadoop

Workshop on Hadoop with Big Data

A programming model in Cloud: MapReduce

Lecture 10 - Functional programming: Hadoop and MapReduce

Performance and Energy Efficiency of. Hadoop deployment models

Using Hadoop for Webscale Computing. Ajay Anand Yahoo! Usenix 2008

MapReduce: Simplified Data Processing on Large Clusters. Jeff Dean, Sanjay Ghemawat Google, Inc.

Open source large scale distributed data management with Google s MapReduce and Bigtable

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Big Data Analytics Hadoop and Spark

CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof.

The MapReduce Framework

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Data Mining in the Swamp

MapReduce with Apache Hadoop Analysing Big Data

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Suresh Lakavath csir urdip Pune, India

Introduction to Parallel Programming and MapReduce

Paolo Costa

MAPREDUCE Programming Model

Transcription:

ecap Distributed Systems Data Analytics Steve Ko Computer Sciences and Engineering University at Buffalo PC enables programmers to call functions in remote processes. IDL (Interface Definition Language) allows programmers to define remote procedure calls. Stubs are used to make it appear that the call is local. Semantics Cannot provide exactly once At least once At most once Depends on the application requirements 2 Two Questions We ll Answer What is data analytics? What are the programming paradigms for it? Example : Scientific Data CEN (European Organization for Nuclear esearch) @ Geneva: Large Hadron Collider (LHC) Experiment 300 GB of data per second 5 petabytes (5 million gigabytes) of data annually enough to fill more than.7 million dual-layer DVDs a year 3 4 Example 2: Web Data Google 20+ billion web pages» ~20KB each = 400 TB ~ 4 months to read the web And growing» 999 vs. 2009: ~ 00X Yahoo! US Library of Congress every day (20TB/day) 2 billion photos 2 billion mail + messenger sent per day And growing Data Analytics Computations on very large data sets How large? TBs to PBs uch time is spent on data moving/reading/writing Shift of focus Used to be: computation (think supercomputers) Now: data 5 6 C

Popular Environment Environment for storing TBs ~ PBs of data Cluster of cheap commodity PCs As we have been discussing in class 000s of servers Data stored as plain files on file systems Data scattered over the servers Failure is the norm How do you process all this data? Turn to History Dataflow programming Data sources and operations Data items go through a series of transformations using operations. Very popular concept any examples Even CPU designs back in 80 s and 90 s SQL, data streaming, etc. Challenges How to efficiently fetch data? When and how to schedule different operations? What if there s a failure (both for data and computation)? 0 + 2 * 7 8 Dataflow Programming This style of programming is now very popular with large clusters. any examples apeduce, Pig, Hive, Dryad, Spark, etc. Two examples we ll look at apeduce and Pig What is apeduce? A system for processing large amounts of data Introduced by Google in 2004 Inspired by map & reduce in Lisp OpenSource implementation: Hadoop by Yahoo! Used by many, many companies A9.com, AOL, Facebook, The New York Times, Last.fm, Baidu.com, Joost, Veoh, etc. 9 0 Background: ap & educe in Lisp Sum of squares of a list (in Lisp) (map square ( 2 3 4)) Output: ( 4 9 6) [processes each record individually] Background: ap & educe in Lisp Sum of squares of a list (in Lisp) (reduce + ( 4 9 6)) (+ 6 (+ 9 (+ 4 ) ) ) Output: 30 [processes set of all records in a batch] 2 3 4 4 9 6 f f f f f f f returned 4 9 6 initial 5 4 30 2 C 2

Background: ap & educe in Lisp ap processes each record individually educe processes (combines) set of all records in a batch What Google People Have Noticed Keyword search ap Find a keyword in each web page individually, and if it is found, return the UL of the web page educe Combine all results (ULs) and return it Count of the # of occurrences of each word ap Count the # of occurrences in each web page individually, and return the list of <word, #> educe For each word, sum up (combine) the count Notice the similarities? 3 4 What Google People Have Noticed Google apeduce Lots of storage + compute cycles nearby Opportunity Files are distributed already! (GFS) A machine can processes its own web pages (map) CP U CP U Took the concept from Lisp, and applied to large-scale data-processing Takes two functions from a programmer (map and reduce), and performs three steps ap uns map for each file individually in parallel Shuffle Collects the output from all map executions Transforms the map output into the reduce input Divides the map output into chunks educe uns reduce (using a map output chunk as the input) in parallel 5 6 Programmer s Point of View Inverted Indexing Example Programmer writes two functions map() and reduce() The programming interface is fixed map (in_key, in_value) -> list of (out_key, intermediate_value) reduce (out_key, list of intermediate_value) -> (out_key, out_value) Caution: not exactly the same as Lisp Word -> list of web pages containing the word every -> http://m-w.com/ http://answers.. its -> http://itsa.org/. http://youtube Input: web pages Output: word-> urls 7 8 C 3

ap Interface Input: <in_key, in_value> pair => <url, content> Output: list of intermediate <key, value> pairs => list of <word, url> Shuffle apeduce system Collects outputs from all map executions Groups all intermediate values by the same key key = http://url0.com value = every happy family is alike. key = http://url.com value = every unhappy family is unhappy in its own way. ap Input: <url, content> <every, http://url0.com> <happy, http://url0.com> <family, http://url0.com> map() <every, http://url.com> <unhappy, http://url.com> <family, http://url.com> ap Output: list of <word, url> 9 <every, http://url0.com> <happy, http://url0.com> <family, http://url0.com> <every, http://url.com> <unhappy, http://url.com> <family, http://url.com> ap Output: list of <word, url> every -> http://url0.com http://url.com happy -> http://url0.com unhappy -> http://url.com family -> http://url0.com http://url.com educe Input: <word, list of urls> 20 educe Interface Input: <out_key, list of intermediate_value> Output: <out_key, out_value> Execution Overview ap phase every -> http://url0.com <every, http://url0.com, http://url.com happy -> http://url0.com unhappy -> http://url.com reduce() http://url.com > <happy, http://url0.com > <unhappy, http://url.com > Shuffle phase family -> http://url0.com http://url.com <family, http://url0.com, http://url.com > educe phase educe Input: <word, list of urls> educe Output: <word, string of urls> 2 22 Implementing apeduce Execution Overview Externally for user Write a map function, and a reduce function Submit a job; wait for result No need to know anything about the environment (Google: 4000 servers + 48000 disks, many failures) Internally for apeduce system designer un map in parallel Shuffle: combine map results to produce reduce input un reduce in parallel Deal with failures Input files sent to map tasks aster Intermediate keys partitioned into reduce tasks Input Files ap workers educe workers Output 23 24 C 4

Task Assignment Fault-tolerance: e-execution aster Worker pull. Worker signals idle 2. aster assigns task 3. Task retrieves data 4. Task executes aster e-execute on failure Input Splits ap workers educe workers Output Input Splits ap workers educe workers Output 25 26 achines Share oles Problems of apeduce aster So far, logical view of cluster In reality Each cluster machine stores data And runs apeduce workers Lots of storage + compute cycles nearby Any you can think of? There s only two functions you can work with (not expressive enough sometimes.) Functional-style (a barrier for some people) Turing completeness (or computationally universal) If it can simulate a single-taped Turing machine. ost general languages (C/C++, Java, Lisp, etc.) are. SQL is. apeduce is not. 27 28 Pig Why Pig? apeduce has limitations: only two functions any tasks require more than one apeduce Functional thinking: barrier for some Pig Defines a set of high-level simple commands Compiles the commands and generates multiple apeduce jobs uns them in parallel Pig Example load /data/visits ; group visits by url; foreach gvisits generate url, count(visits); load /data/urlinfo ; join visitcounts by url, urlinfo by url; group visitcounts by category; foreach gcategories generate top(visitcounts,0); 29 30 C 5

Pig Example Summary Load Visits Group by url ap educe Foreach url generate count Load Url Info ap 2 Data analytics shifts the focus from computation to data. any programming paradigms are emerging. apeduce Pig any others Join on url Group by category educe 2 ap 3 Foreach category generate top0(urls) educe 3 3 32 ore Details Papers J. Dean and S. Ghemawat, apeduce: Simplified Data Processing on Large Clusters, OSDI 2004 C. Olston, B. eed, U. Srivastava,. Kumar, and A. Tomkins, Pig Latin: A Not-So-Foreign Language For Data Processing, SIGOD 2008 ULs http://hadoop.apache.org/core/ http://wiki.apache.org/hadoop/ http://hadoop.apache.org/pig/ http://wiki.apache.org/pig/ Slides http://labs.google.com/papers/mapreduce-osdi04-slides/ index.html http://www.systems.ethz.ch/education/past-courses/hs08/mapreduce/slides/intro.pdf http://www.cs.uiuc.edu/class/sp09/cs525/l4tmp.b.ppt http://infolab.stanford.edu/~usriv/talks/sigmod08-pig-latin.ppt Acknowledgements These slides contain material developed and copyrighted by Indranil Gupta (UIUC). 33 34 C 6