Big Data Technology Pig: Query Language atop Map-Reduce

Similar documents
Big Data Technology CS , Technion, Spring 2013

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

Recap. CSE 486/586 Distributed Systems Data Analytics. Example 1: Scientific Data. Two Questions We ll Answer. Data Analytics. Example 2: Web Data C 1

Big Data Technology Core Hadoop: HDFS-YARN Internals

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction to Pig

Relational Data Processing on MapReduce

Big Data With Hadoop

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Data Management in the Cloud

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Using distributed technologies to analyze Big Data

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

White Paper. HPCC Systems : Introduction to HPCC (High-Performance Computing Cluster)

Efficient Data Access and Data Integration Using Information Objects Mica J. Block

Cloudera Certified Developer for Apache Hadoop

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Introduction to Apache Pig Indexing and Search

Relational Processing on MapReduce

How To Scale Out Of A Nosql Database

extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Chapter 13: Query Processing. Basic Steps in Query Processing

Overview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group)

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

F1: A Distributed SQL Database That Scales. Presentation by: Alex Degtiar (adegtiar@cmu.edu) /21/2013

Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang

Big Data Weather Analytics Using Hadoop

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Cloud Computing at Google. Architecture

American International Journal of Research in Science, Technology, Engineering & Mathematics

Introduction to Pig. Content developed and presented by: 2009 Cloudera, Inc.

Apache HBase. Crazy dances on the elephant back

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Best Practices for Hadoop Data Analysis with Tableau

Impala: A Modern, Open-Source SQL Engine for Hadoop. Marcel Kornacker Cloudera, Inc.

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

COURSE CONTENT Big Data and Hadoop Training

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Introduction to Hadoop

web-scale data processing Christopher Olston and many others Yahoo! Research

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Hadoop implementation of MapReduce computational model. Ján Vaňo

How To Handle Big Data With A Data Scientist

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Internals of Hadoop Application Framework and Distributed File System

AllegroGraph. a graph database. Gary King gwking@franz.com

CS 378 Big Data Programming

Report: Declarative Machine Learning on MapReduce (SystemML)

Real-time Big Data Analytics with Storm

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

The Sierra Clustered Database Engine, the technology at the heart of

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

White Paper. HPCC Systems: Data Intensive Supercomputing Solutions

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Chapter 1: Introduction. Database Management System (DBMS) University Database Example

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context

Big Data Course Highlights

Big Data and Scripting map/reduce in Hadoop

Big Data Technology Motivating NoSQL Databases: Computing Page Importance Metrics at Crawl Time

CS 378 Big Data Programming. Lecture 2 Map- Reduce

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Advanced Oracle SQL Tuning

Infrastructures for big data

Xiaoming Gao Hui Li Thilina Gunarathne

Big Data for Conventional Programmers Big Data - Not a Big Deal

Programming and Debugging Large- Scale Data Processing Workflows

Big Data Processing with Google s MapReduce. Alexandru Costan

A Comparison of Approaches to Large-Scale Data Analysis

Big Data looks Tiny from the Stratosphere

Apache Kylin Introduction Dec 8,

Alternatives to HIVE SQL in Hadoop File Structure

Implement Hadoop jobs to extract business value from large and varied data sets

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

This article is the second

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Big Data and Scripting Systems build on top of Hadoop

Big Systems, Big Data

Unified Big Data Analytics Pipeline. 连 城

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Amazon Redshift & Amazon DynamoDB Michael Hanisch, Amazon Web Services Erez Hadas-Sonnenschein, clipkit GmbH Witali Stohler, clipkit GmbH

A Brief Introduction to Apache Tez

Transcription:

Big Data Technology Pig: Query Language atop Map-Reduce Eshcar Hillel Yahoo! Ronny Lempel Outbrain *Based on slides by Edward Bortnikov & Ronny Lempel

Roadmap Previous class MR Implementation This class Query Languages atop MR Beyond MR SQL primitives and implementation Pig a procedural query language

Pig for Hadoop http://hortonworks.com/hdp/

Map-Reduce Critique Too low-level (only engineers can program) Hand-coding for many common operations Data flows extremely rigid (linear) This is exactly what SQL databases have been designed for! Can we reuse some of the good stuff?

Relational Databases in a Nutshell The data is structured Reflects Entities and Relationships ERD = Entity-Relationship Diagram ERD captured by schema Data set = set of tables Table = set of tuples (rows) Row = set of items (columns, attributes) Typically, uniquely addressable by primary key Relational Algebra Theoretical foundation for relational databases

SQL (Structured Query Language) Declarative programming language Focus on what you want, not how to retrieve Flexible, Great for non-engineers Designed for relational databases Select (read) Insert/Update/Delete (write) Costs Administration (schema management) Query optimization (compiler or manual)

SQL Primitives Project SELECT url FROM pages Filter (Select) SELECT url FROM pages WHERE pagerank > 0.95

SQL Primitives Join ( ) SELECT visits.user, pages.category FROM visits, pages WHERE visits.url = pages.url SELECT p1.header, p2.header FROM pages p1, pages p2, links l1, links l2 WHERE alias p1.url=l1.from AND p2.url=l2.to AND l1. id = l2.id

SQL Primitives Aggregation SELECT url, COUNT(url) FROM visits GROUP BY url HAVING COUNT(url) > 1000 COUNT, SUM, AVG, STDDEV, MIN, MAX

SQL Primitives Sort SELECT cookie, query FROM querylog WHERE date < 1/1/2013 AND date > 1/12/2012 ORDER BY cookie, date

SQL Execution Query plan = compiler-generated program Operators + flow of control Operators receive and return tuple sets

Query Plan - Aggregation SELECT url, COUNT(url) FROM visits GROUP BY url HAVING COUNT(url) > 1000 Filter by count Aggregate by count Group by url Load visits visits

Query Plan - Join SELECT p1.header, p2.header FROM pages p1, pages p2, links l1, links l2 WHERE p1.url=l1.from AND p2.url=l2.to AND l1. id = l2.id Join l1.id = l2.id Join l1.from=p1.url Join l2.to=p2.url Load Load Load Load pages links pages links

Dataflow Architectures Query plan = DAG Nodes Links = relational operators = queues Performance boosted through Data parallelism Compute parallelism Pipelining

Query Optimization Multiple plans can be logically equivalent Relation algebra allows operator re-ordering Multiple operator implementations possible Goal: pick a plan that minimizes query latency Minimize I/O, communication and computation No-brainer optimizations: push the filters deep Nontrivial optimizations: order of joins Challenge: exponential-size search space

SQL versus Map-Reduce SQL Good for structured data Rich declarative API (tool for analysts) Compiler optimization, non-transparent to users Map-Reduce Good for structured and unstructured data Simple programmatic API (tool for engineers) Users can optimize their jobs manually Can we enjoy the best of both worlds?

Implementation atop Map-Reduce Project, Filter easy (how?) Sort for free (not always required) Aggregation mostly easy (how?) How to handle TOP-K, AVG, STDDEV? Join? Hint: use multiple Map inputs

Join over MR (1) L M-Left M-Left M-Left Sort by key + L Partition by key R R R R Select L.x, R.y From L, R Where L.key = R.key M-Right M-Right M-Right Sort by key + R Partition by key key L key R key R key R key R

Join over MR (2) Typical situation: Big Small E.g., Page Accesses (B) User Details (S) Idea: avoid reduce altogether Approach: replicate S to all mappers Use shared files (distributed thru the MR cache) At each mapper: Store in RAM, hashed by join key Scan the B-partition, compute match per record

Pig Latin Started at Yahoo! Research (2008) Procedural (control) yet high-level (simplicity) Operates directly on flat files Schemas optional (defined on the fly) Nested (non-normalized) data and operators Operators are customizable (UDF) Amenable to optimization 7 April 2016 20

Example Analysis Task Find the top 10 most visited pages in each category Visits User Url Time Amy cnn.com 8:00 Amy bbc.com 10:00 Amy flickr.com 10:05 Fred cnn.com 12:00 Urls Url Category PageRank cnn.com News 0.9 bbc.com News 0.8 flickr.com Photos 0.7 espn.com Sports 0.9 7 April 2016 21

SQL Data Flow Foreach category generate top(10) UDF Group by url Foreach url generate count Group by category Join by url Load Visits Load Urls 6 April 2016 22

In Pig Latin visits gvisits visitcounts urlinfo visitcounts = load /data/visits as (user, url, time); = group visits by url; = foreach gvisits generate url, count(visits); = load /data/urlinfo as (url, category, prank); = join visitcounts by url, urlinfo by url; gcategories = group visitcounts by category; topurls = foreach gcategories generate top(visitcounts,10); store topurls into /data/topurls ; 6 April 2016 23

MR Implementation R 3 M 3 R 2 Foreach category generate top(10) Group by category Join by url R 1 Foreach url generate count M 1 Group by url Load Visits M 2 Load Urls 6 April 2016 24

Anything Else to Do? High-level languages improve MR s usability Cannot address built-in performance problems MR designed for batch processing E.g., daily production jobs, focused on throughput Not great for low-latency applications High job setup overhead A no-op 1K-task job runs for dozens of seconds in the absence of contention! 6 April 2016 25

Summary SQL atop MR Good for data warehouses, batch queries The less legacy semantics, the better scalability Bad for interactive ad-hoc queries Batch-oriented (high launch overhead) Scan-oriented, lookups are expensive Intermediate results materialized on DFS Dataflow underexploited (limited pipelining)

Next Class Key-value store use case: user profile store

Further Reading A comparison of approaches to large-scale data analysis Map Reduce: A Major Step Backwards Pig Latin: A Not-So-Foreign Language for Data Processing