LANGUAGES FOR HADOOP: PIG & HIVE

Similar documents
Word Count Code using MR2 Classes and API

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Word count example Abdalrahman Alsaedi

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

An Implementation of Sawzall on Hadoop

Big Data for the JVM developer. Costin Leau,

Hadoop Framework. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

Performance Overhead on Relational Join in Hadoop using Hive/Pig/Streaming - A Comparative Analysis

Zebra and MapReduce. Table of contents. 1 Overview Hadoop MapReduce APIs Zebra MapReduce APIs Zebra MapReduce Examples...

Cloud Computing Era. Trend Micro

11/18/15 CS q Hadoop was not designed to migrate data from traditional relational databases to its HDFS. q This is where Hive comes in.

CS54100: Database Systems

Enhancing Massive Data Analytics with the Hadoop Ecosystem

BIG DATA APPLICATIONS

Hadoop WordCount Explained! IT332 Distributed Systems

The Hadoop Eco System Shanghai Data Science Meetup

Hadoop: Understanding the Big Data Processing Method

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, Seth Ladd

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014

Xiaoming Gao Hui Li Thilina Gunarathne

Internals of Hadoop Application Framework and Distributed File System

Data Science in the Wild

Hadoop Configuration and First Examples

Hadoop Job Oriented Training Agenda

The Cloud Computing Era and Ecosystem. Phoenix Liau, Technical Manager

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

CASE STUDY OF HIVE USING HADOOP 1

Tutorial- Counting Words in File(s) using MapReduce

How To Write A Mapreduce Program In Java.Io (Orchestra)

Introduction to MapReduce and Hadoop

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

American International Journal of Research in Science, Technology, Engineering & Mathematics

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Building Big Data Pipelines using OSS. Costin Leau Staff Engineer

Data Management in the Cloud

Want to read more? You can buy this book at oreilly.com in print and ebook format. Buy 2 books, get the 3rd FREE!

Hive A Petabyte Scale Data Warehouse Using Hadoop

Hadoop Streaming coreservlets.com and Dima May coreservlets.com and Dima May

PIG LATIN AND HIVE THANKS TO M. GROSSNIKLAUS

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

An Introduction to Apostolos N. Papadopoulos

Introduction to Apache Hive

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Introduc)on to Map- Reduce. Vincent Leroy

HIVE. Data Warehousing & Analytics on Hadoop. Joydeep Sen Sarma, Ashish Thusoo Facebook Data Team

Hadoop Basics with InfoSphere BigInsights

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Introduction to Apache Hive

Hadoop & Pig. Dr. Karina Hauser Senior Lecturer Management & Entrepreneurship

HPCHadoop: MapReduce on Cray X-series

Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction to Pig

Mrs: MapReduce for Scientific Computing in Python

HDInsight Essentials. Rajesh Nadipalli. Chapter No. 1 "Hadoop and HDInsight in a Heartbeat"

Advanced SQL Query To Flink Translator

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

Introduction to Big Data Science. Wuhui Chen

Enterprise Data Storage and Analysis on Tim Barr

MR-(Mapreduce Programming Language)

Hadoop/Hive General Introduction

Introduction to Apache Pig Indexing and Search

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Report Vertiefung, Spring 2013 Constant Interval Extraction using Hadoop

Introduction To Hive

Hadoop Distributed File System. -Kishan Patel ID#

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

Hive Interview Questions

Integration of Apache Hive and HBase

Workshop on Hadoop with Big Data

map/reduce connected components

Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so:

Hadoop and Hive Development at Facebook. Dhruba Borthakur Zheng Shao {dhruba, Presented at Hadoop World, New York October 2, 2009

web-scale data processing Christopher Olston and many others Yahoo! Research

Analysis of Big data through Hadoop Ecosystem Components like Flume, MapReduce, Pig and Hive

INTERNATIONAL JOURNAL OF INFORMATION TECHNOLOGY & MANAGEMENT INFORMATION SYSTEM (IJITMIS)

Download and install Download virtual machine Import virtual machine in Virtualbox

NOSQL DATABASE SYSTEMS

Introduction to Pig. Content developed and presented by: 2009 Cloudera, Inc.

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Cloudera Certified Developer for Apache Hadoop

ITG Software Engineering

Alternatives to HIVE SQL in Hadoop File Structure

Hadoop Integration Guide

Implement Hadoop jobs to extract business value from large and varied data sets

Programming Hadoop Map-Reduce Programming, Tuning & Debugging. Arun C Murthy Yahoo! CCDI acm@yahoo-inc.com ApacheCon US 2008

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Transcription:

Friday, September 27, 13 1 LANGUAGES FOR HADOOP: PIG & HIVE Michail Michailidis & Patrick Maiden

Friday, September 27, 13 2 Motivation Native MapReduce Gives fine-grained control over how program interacts with data Not very reusable Can be arduous for simple tasks Last week general Hadoop Framework using AWS Does not allow for easy data manipulation Must be handled in map() function Some use cases are best handled by a system that sits between these two

Friday, September 27, 13 3 Why Declarative Languages? In most database systems, a declarative language is used (i.e. SQL) Data Independence User applications cannot change organization of data Schema structure of the data Allows code for queries to be much more concise User only cares about the part of the data he wants SQL: SELECT count(*) FROM word_frequency WHERE word = the Java-esque: int countthe = 0; for (String word: words){ if (word.equals( the ){ countthe++; } } return countthe;

Friday, September 27, 13 4 Native MapReduce - Wordcount In native MapReduce, simple tasks can be a hassle to code: package org.myorg;! import java.io.ioexception;! import java.util.*;! import org.apache.hadoop.fs.path;! import org.apache.hadoop.conf.*;! import org.apache.hadoop.io.*;! Import org.apache.hadoop.mapreduce.*;! import org.apache.hadoop.mapreduce.lib.input.fileinputformat;! import org.apache.hadoop.mapreduce.lib.input.textinputformat;! import org.apache.hadoop.mapreduce.lib.output.fileoutputformat;! import org.apache.hadoop.mapreduce.lib.output.textoutputformat;!! public class WordCount {!!!public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {! private final static IntWritable one = new IntWritable(1);! private Text word = new Text();! public void map(longwritable key, Text value, Context context)! throws IOException, InterruptedException {! String line = value.tostring();! StringTokenizer tokenizer = new StringTokenizer(line);! while (tokenizer.hasmoretokens()) {! word.set(tokenizer.nexttoken());! context.write(word, one); }! }! }!! public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {! public void reduce(text key, Iterable<IntWritable> values, Context! context) throws IOException, InterruptedException {! int sum = 0;! for (IntWritable val : values) {! sum += val.get();! }! context.write(key, new IntWritable(sum));! }! }!! public static void main(string[] args) throws Exception {! Configuration conf = new Configuration();! Job job = new Job(conf, "wordcount");! job.setoutputkeyclass(text.class);! job.setoutputvalueclass(intwritable.class);! job.setmapperclass(map.class);! job.setreducerclass(reduce.class);! job.setinputformatclass(textinputformat.class);! job.setoutputformatclass(textoutputformat.class);! FileInputFormat.addInputPath(job, new Path(args[0]));! FileOutputFormat.setOutputPath(job, new Path(args[1]));! job.waitforcompletion(true);! }! source: http://wiki.apache.org/hadoop/wordcount

Friday, September 27, 13 5 Pig (Latin) Developed by Yahoo! around 2006 Now maintained by Apache Pig Latin Language Pig System implemented on Hadoop Image source: ebiquity.umbc.edu Citation note: Many of the examples are pulled from the research paper on Pig Latin

Friday, September 27, 13 6 Pig Latin Language Overview Pig Latin language of Pig Framework Data-flow language Stylistically between declarative and procedural Allows for easy integration of user-defined functions (UDFs) First Class Citizens All primitives can be parallelized Debugger

Friday, September 27, 13 7 Pig Latin - Wordcount Wordcount in Pig Latin is significantly simpler: input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet' AS (line:chararray); words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word; filtered_words = FILTER words BY word MATCHES '\\w+'; word_groups = GROUP filtered_words BY word; word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word; ordered_word_count = ORDER word_count BY count DESC; STORE ordered_word_count INTO '/tmp/number-of-words-on-internet'; source: http://en.wikipedia.org/wiki/pig_(programming_tool)

Friday, September 27, 13 8 Pig Latin Data Model Atom providence Tuple ( providence, apple ) Bag { ( providence, apple ) } [ goes ( providence, ( javascript, red )) to à Map { ( school ) } ( new york ) gender à female

Friday, September 27, 13 9 Pig Under the Hood Parsing/Validation Logical Planning Optimizes data storage Bags only made when necessary Complied into MapReduce Jobs Current implementation of Pig uses Hadoop map 1 reduce 1 map i reduce i map i+1 reduce i+1 load filter group cogroup cogroup C 1 C i C i+1 load

Friday, September 27, 13 10 Pig Latin Key Commands LOAD Specifies input format Does not actually load data into tables Can specify schema or use position notation $0 for first field, and so on queries = LOAD query_log.txt USING myload() AS (userid, querystring, timestamp); FOREACH Allows iteration over tuples expanded_queries = FOREACH queries GENERATE userid, expandquery(querystring); FILTER Returns only data that passes the condition real_queries = FILTER queries BY userid neq bot ;

Friday, September 27, 13 11 Pig Latin Key Commands (2) COGROUP/GROUP JOIN Preserves nested structure Normal equi-join Flattens data grouped_data = COGROUP results BY querystring, revenue BY querystring; join_result = JOIN results BY querystring, revenue BY querystring; results: (querystring, url, rank) (lakers, nba.com, 1) (lakers, espn.com, 2) (kings, nhl.com, 1) (kings, nba.com,2) COGROUP grouped_data: ( { (lakers, nba.com, 1) } {(lakers, top, 50) }) lakers,, (lakers, espn.com, 2) side, 20),, ( { } { }) kings (kings, nhl.com, 1) (kings, top, 30) (kings, nba.com,2) (kings, side, 10) revenue: (querystring, adslot, amount) (lakers, top, 50) (lakers, side, 20) (kings, top, 30) (kings, side, 10) JOIN join_result: (lakers, nba.com, 1, top, 50) (lakers, espn.com, 2, side, 20) (kings, nhl.com, 1, top, 30) (kings, nba.com, 2, side, 10)

Friday, September 27, 13 12 Pig Latin Key Commands (3) Some Other Commands UNION Union of 2+ bags CROSS Cross product of 2+ bags ORDER Sorts by a certain field DISTINCT Removes duplicates STORE Outputs Data grouped_revenue = GROUP revenue BY querystring; query_revenues = FOREACH grouped_revenue{ }; top_slot = FILTER revenue BY adslot eq top ; GENERATE querystring, SUM(top_slot.amount), SUM(revenue.amount); Commands can be nested

Friday, September 27, 13 13 Pig Latin MapReduce Example The current implementation of Pig compiles into MapReduce jobs However, if the workflow itself requires a MapReduce structure, it is extremely easy to implement in Pig Latin map_result = FOREACH input GENERATE FLATTEN(map(*)); key_groups = GROUP map_result BY $0; output = FOREACH key_groups GENERATE reduce(*); MapReduce in 3 lines Any UDF can be used for map() and reduce() here

Friday, September 27, 13 14 Pig Latin UDF Support Pig Latin currently supports the following languages: Java Python Javascript Ruby Piggy Bank Repository of Java UDFs OSS Possibility for more support Image sources: http://www.jbase.com/new/products/images/java.png http://research.yahoo.com/files/images/pig_open.gif http://img2.nairaland.com/attachments/693947_python-logo_png 26f0333ad80ad765dabb1115dde48966 http://barcode-bdg.org/wp-content/uploads/2013/04/ruby_logo.jpg JS

Friday, September 27, 13 15 Pig Pen Debugger Image Source: http://infolab.stanford.edu/~usriv/papers/pig-latin.pdf Debugging big data can be tricky Programs take a long time to run Pig provides debugging tools that generate a sandboxed data set Generates a small dataset that is representative of the full one

Friday, September 27, 13 16 Hive Development by Facebook started in early 2007 Open-sourced in 2008 Uses HiveQL as language Why was it built? Hadoop lacked the expressiveness of languages like SQL Important: It is not a database! Queries take many minutes à

Friday, September 27, 13 17 HiveQL Language Overview Subset of SQL with some deviations: Only equality predicates are supported for joins Inserts override existing data INSERT OVERWRITE Table t1 INSERT INTO,UPDATE,DELETE are not supported à simpler locking mechanisms FROM can be before SELECT for multiple insertions Very expressive for complex logic in terms of map-reduce (MAP / REDUCE predicates) Arbitrary data format insertion is provided through extensions in many languages

Source: http://blog.gopivotal.com/products/hadoop-101- programming-mapreduce-with-native-libraries-hive-pig-andcascading Friday, September 27, 13 18 Hive - Wordcount CREATE EXTERNAL TABLE lines(line string) LOAD DATA INPATH books OVERWRITE INTO TABLE lines SELECT word, count(*) FROM lines LATERAL VIEW explode(split(text, )) ltable as word GROUP BY word;

Friday, September 27, 13 19 Type System Basic Types: Integers, Floats, String Complex Types: Associative Arrays map<key-type, value-type> Lists list<element-type> Structs struct<field-name: field-type, > Arbitrary Complex Types: list< map< string, struct< p1:int,p2:int > > > represents list of associative arrays that map strings to structs that contain two ints access with dot operator

Friday, September 27, 13 20 Data Model and Storage TABLES (dirs) PARTITIONS (dirs) BUCKETS (files) e.g partition by date,time Used for Subsampling Directory Structure /root-path /table1 /partition1 (2011-11) /bucket1 (1/3) /bucket2 (2/3) /bucket3 (3/3) /partition2 (2011-12) /bucket1 (1/3) /bucket2 (2/3) /bucket3 (3/3) /table2 /bucket1 (1/2) /bucket2 (2/2)

Friday, September 27, 13 21 System Architecture Metastore (mysql): stores metadata about tables columns, partitions etc Driver: manages the lifecycle of HiveQL through Hive Query Compiler: compiles HiveQL à DAG of map/reduce tasks Optimizer: (part of compiler) Optimizes the DAG Execution Engine: executes the tasks produced by compiler in proper dependency JDBC, ODBC, Thrift: provide cross-language support

Friday, September 27, 13 22 Optimization steps A lot of similarities with SQL optimization Column pruning Predicate pushdown Partition pruning Map Side Joins Join Reordering Student(id,name,year) 5000 rows Student_Course(sid,cid) 23000 rows SELECT name FROM Student JOIN ON CREATE ( SELECT Student.id TABLE FROM Student = Student (... ) Student_Course.sid) PARTITION WHERE year BY class_year = 3 WHERE year Smaller = 3 Large tables tables are in joins SELECT * FROM Student WHERE are replicated in all class_year streamed >= in 2012 the JOIN The resulting will be 23000x5000 map/reduce => jobs 115000000 will mappers reducer need only and and rows id smaller joined from with Student If Minimizes other we tables tables select are the 3kept rd -year data in Students first transmission memory Here and only then the join, between data the in operation mappers partitions will and 2012 be reducers just and 2013 900x5000=20700000 rows will be used

Friday, September 27, 13 23 Pig & Hive - Comparison Pig Schema-optional Great at ETL (Extract- Transform-Load) jobs Dataflow style language Nested data in Bags Debugging tools Hive Schema-aware Great at ad-hoc analytics Leverages SQL expertise Supports sampling and partitioning of data Optimizes queries

Friday, September 27, 13 24 Hive & Pig Limitations Neither of these systems are databases in and of themselves Rely on Hadoop and MapReduce Can be slow, especially when compared to other systems for simple tasks Neither of these systems are well-suited for heirachicallystructured data

Friday, September 27, 13 25 References Pig Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins: Pig Latin: A Not-So-Foreign Language for Data Processing Cloudera Pig Tutorial http://blog.cloudera.com/wp-content/uploads/2010/01/introtopig.pdf http://pig.apache.org/docs/ Hive Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning Zhang, Suresh Antony, Hao Liu and Raghotham Murthy Hive A Petabyte Scale Data Warehouse Using Hadoop Big Data Warehousing: Pig vs Hive comparison http://www.slideshare.net/casertaconcepts/bdw-meetup-feb-11-2013-v3 Pig vs Hive vs Native Map Reduce http://stackoverflow.com/questions/17950248/pig-vs-hive-vs-native-map-reduce http://hive.apache.org/docs/