Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

Similar documents

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Big Data Technology Pig: Query Language atop Map-Reduce

Play with Big Data on the Shoulders of Open Source

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Scaling Up 2 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. HBase, Hive

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

American International Journal of Research in Science, Technology, Engineering & Mathematics

Facebook s Petabyte Scale Data Warehouse using Hive and Hadoop

Recap. CSE 486/586 Distributed Systems Data Analytics. Example 1: Scientific Data. Two Questions We ll Answer. Data Analytics. Example 2: Web Data C 1

Scaling Up HBase, Hive, Pegasus

Cloudera Certified Developer for Apache Hadoop

Big Data Infrastructures & Technologies

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Data Warehouse Overview. Namit Jain

Hadoop Distributed File System. -Kishan Patel ID#

Planning LEFEBVRE. Philippe Intelligence in IoT GAUTIER GOLIRO

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

ITG Software Engineering

Hadoop Job Oriented Training Agenda

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

The Hadoop Eco System Shanghai Data Science Meetup

Hive A Petabyte Scale Data Warehouse Using Hadoop

11/18/15 CS q Hadoop was not designed to migrate data from traditional relational databases to its HDFS. q This is where Hive comes in.

Comparing SQL and NOSQL databases

Hive User Group Meeting August 2009

Integration of Apache Hive and HBase

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Big Data and Scripting Systems build on top of Hadoop

Cloud Computing at Google. Architecture

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Introduction to Apache Hive

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Introduction to Apache Hive

Hadoop IST 734 SS CHUNG

Big Data Technology CS , Technion, Spring 2013

ISSN: (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies

How To Scale Out Of A Nosql Database

Introduction to Pig. Content developed and presented by: 2009 Cloudera, Inc.

Introduction to Hadoop

Xiaoming Gao Hui Li Thilina Gunarathne

Big Data With Hadoop

Data Management in the Cloud

Workshop on Hadoop with Big Data

Using distributed technologies to analyze Big Data

Big Data Hive! Laurent d Orazio

Internals of Hadoop Application Framework and Distributed File System

Big Data and Scripting Systems build on top of Hadoop

Introduction to cloud computing

Hypertable Architecture Overview

Click Stream Data Analysis Using Hadoop

Hadoop and Hive Development at Facebook. Dhruba Borthakur Zheng Shao {dhruba, Presented at Hadoop World, New York October 2, 2009

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Complete Java Classes Hadoop Syllabus Contact No:

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

How To Improve Performance In A Database

A programming model in Cloud: MapReduce

Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction to Pig

Two kinds of Map/Reduce programming In Java/Python In Pig+Java Today, we'll start with Pig

Hadoop/Hive General Introduction

Big Data Course Highlights

Distributed storage for structured data

APACHE HADOOP JERRIN JOSEPH CSU ID#

Open source large scale distributed data management with Google s MapReduce and Bigtable

COURSE CONTENT Big Data and Hadoop Training

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Pig Latin: A Not-So-Foreign Language for Data Processing

Implement Hadoop jobs to extract business value from large and varied data sets

Hadoop: The Definitive Guide

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Alternatives to HIVE SQL in Hadoop File Structure

Hadoop 只支援用 Java 開發嘛? Is Hadoop only support Java? 總不能全部都重新設計吧? 如何與舊系統相容? Can Hadoop work with existing software?

White Paper. HPCC Systems : Introduction to HPCC (High-Performance Computing Cluster)

Impala: A Modern, Open-Source SQL Engine for Hadoop. Marcel Kornacker Cloudera, Inc.

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

HDFS. Hadoop Distributed File System

Big Table A Distributed Storage System For Data

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Big Data: What (you can get out of it), How (to approach it) and Where (this trend can be applied) natalia_ponomareva sobachka yahoo.

Big Data and Apache Hadoop s MapReduce

LANGUAGES FOR HADOOP: PIG & HIVE

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Big Data Weather Analytics Using Hadoop

Relational Data Processing on MapReduce

Lecture Data Warehouse Systems

A Comparison of Approaches to Large-Scale Data Analysis

HIVE. Data Warehousing & Analytics on Hadoop. Joydeep Sen Sarma, Ashish Thusoo Facebook Data Team

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Apache Flink Next-gen data analysis. Kostas

Big Data: Pig Latin. P.J. McBrien. Imperial College London. P.J. McBrien (Imperial College London) Big Data: Pig Latin 1 / 36

Apache Hive. Big Data 2015

How To Create A Table In Sql (Ahem)

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Transcription:

Big Data Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

MapReduce & Hadoop The new world of Big Data (programming model)

Overview of this Lecture Module Background Google MapReduce The Hadoop Ecosystem Core components: Hadoop MapReduce Hadoop Distributed File System (HDFS) Other selected Hadoop projects: Pig Hive HBase 3

http://pig.apache.org/ 4

Pig & Pig Latin MapReduce model is too low-level and rigid one-input, two-stage data flow Custom code even for common operations hard to maintain and reuse Pig Latin: high-level data flow language Pig: a system that compiles Pig Latin into physical MapReduce plans that are executed over Hadoop 5

Pig & Pig Latin dataflow program written in Pig Latin language Pig system physical dataflow job A high-level language provides: more transparent program structure easier program development and maintenance automatic optimization opportunities Hadoop 6

Example Find the top 10 most visited pages in each category. Visits User Url Time Amy cnn.com 8:00 Amy bbc.com 10:00 Amy flickr.com 10:05 Fred cnn.com 12:00 Url Info Url Category PageRank cnn.com News 0.9 bbc.com News 0.8 flickr.com Photos 0.7 espn.com Sports 0.9 7

Example Data Flow Diagram Load Visits Group by url Foreach url generate count Load Url Info Join on url Group by category Foreach category generate top10 urls 8

Example in Pig Latin visits gvisits visitcounts urlinfo visitcounts = load /data/visits as (user, url, time); = group visits by url; = foreach gvisits generate url, count(visits); = load /data/urlinfo as (url, category, prank); = join visitcounts by url, urlinfo by url; gcategories = group visitcounts by category; topurls = foreach gcategories generate top(visitcounts,10); store topurls into /data/topurls ; 9

Quick Start and Interoperability visits gvisits visitcounts urlinfo visitcounts = load /data/visits as (user, url, time); = group visits by url; = foreach gvisits generate url, count(visits); = load /data/urlinfo as (url, category, prank); = join visitcounts by url, urlinfo by url; gcategories Operates = group visitcounts directly over by category; files. topurls = foreach gcategories generate top(visitcounts,10); store topurls into /data/topurls ; 10

Quick Start and Interoperability visits gvisits visitcounts urlinfo visitcounts = load /data/visits as (user, url, time); = group visits by url; = foreach gvisits generate url, count(visits); = load /data/urlinfo as (url, category, prank); = join visitcounts by url, urlinfo by url; Schemas are optional; gcategories = group visitcounts by category; topurls = can foreach be assigned gcategories dynamically. generate top(visitcounts,10); store topurls into /data/topurls ; 11

User-Code as a First-Class Citizen visits gvisits visitcounts urlinfo visitcounts = load /data/visits as (user, url, time); = group visits by url; = foreach gvisits generate url, count(visits); User-Defined Functions (UDFs) can be used in every construct Load, Store Group, Filter, Foreach = load /data/urlinfo as (url, category, prank); = join visitcounts by url, urlinfo by url; gcategories = group visitcounts by category; topurls = foreach gcategories generate top(visitcounts,10); store topurls into /data/topurls ; 12

Nested Data Model Pig Latin has a fully nested data model with four types: Atom: simple atomic value (int, long, float, double, chararray, bytearray) Example: alice Tuple: sequence of fields, each of which can be of any type Example: ( alice, lakers ) Bag: collection of tuples, possibly with duplicates Example: Map: collection of data items, where each item can be looked up through a key Example: 13

Expressions in Pig Latin 14

Commands in Pig Latin Command LOAD STORE Description Read data from file system. Write data to file system. FOREACH.. GENERATE Apply an expression to each record and output one or more records. FILTER Apply a predicate and remove records that do not return true. GROUP/COGROUP Collect records with the same key from one or more inputs. JOIN Join two or more inputs based on a key. CROSS Cross product two or more inputs. 15

Commands in Pig Latin (cont d) Command UNION SPLIT ORDER DISTINCT STREAM DUMP LIMIT Description Merge two or more data sets. Split data into two or more sets, based on filter conditions. Sort records based on a key. Remove duplicate tuples. Send all records through a user provided binary. Write output to stdout. Limit the number of records. 16

LOAD file as a bag of tuples optional deserializer logical bag handle optional tuple schema 17

STORE a bag of tuples in Pig output file optional serializer STORE command triggers the actual input reading and processing in Pig. 18

FOREACH.. GENERATE a bag of tuples UDF output tuple with two fields 19

FILTER a bag of tuples filtering condition (comparison) filtering condition (UDF) 20

COGROUP vs. JOIN group identifier equi-join field 21

COGROUP vs. JOIN JOIN ~ COGROUP + FLATTEN 22

COGROUP vs. GROUP GROUP ~ COGROUP with only one input data set Example: group-by-aggregate 23

Nested Operations in Pig Latin FILTER, ORDER, and DISTINCT can be nested within a FOREACH command. 24

MapReduce in Pig Latin A MapReduce program can be expressed in Pig Latin. map UDF produces a bag of key-value pairs key is the first field reduce UDF 25

Pig System Overview SQL user automatic rewrite + optimize Pig Hadoop Map-Reduce or or cluster 26

Compilation into MapReduce Load Visits Group by url Map 1 Every (co)group or join operation forms a map-reduce boundary. Reduce 1 Map 2 Foreach url generate count Load Url Info Other operations are pipelined into map and reduce phases. Join on url Group by category Foreach category generate top10(urls) Reduce 2 Map 3 Reduce 3 27

Pig vs. MapReduce MapReduce welds together 3 primitives: process records create groups process groups In Pig, these primitives are: explicit independent fully composable Pig adds primitives for common operations: filtering data sets projecting data sets combining 2 or more data sets 28

Pig vs. DBMS workload data representation programming style customizable processing DBMS Bulk and random reads & writes; indexes, transactions System controls data format Must pre-declare schema (flat data model, 1NF) System of constraints (declarative) Custom functions secondclass to logic expressions Pig Bulk reads & writes only; no indexes or transactions Pigs eat anything (nested data model) Sequence of steps (procedural) Easy to incorporate custom functions 29

http://hive.apache.org/ 30

Hive What? A system for managing and querying structured data is built on top of Hadoop uses MapReduce for execution uses HDFS for storage maintains structural metadata in a system catalog Key building principles: SQL-like declarative query language (HiveQL) support for nested data types extensibility (types, functions, formats, scripts) performance 31

Hive Why? Big data Facebook: 100s of TBs of new data every day Traditional data warehousing systems have limitations proprietary, expensive, limited availability and scalability Hadoop removes these limitations, but it has a low-level programming model custom programs hard to maintain and reuse Hive brings traditional warehousing tools and techniques to the Hadoop eco system. Hive puts structure on top of the data in Hadoop + provides an SQL-like language to query that data. 32

Example: HiveQL vs. Hadoop MapReduce $ hive> select key, count(1) from kv1 where key > 100 group by key; instead of: $ cat > /tmp/reducer.sh uniq -c awk '{print $2"\t"$1} $ cat > /tmp/map.sh awk -F '\001' '{if($1 > 100) print $1} $ bin/hadoop jar contrib/hadoop-0.19.2-dev-streaming.jar -input /user/hive/warehouse/kv1 -file /tmp/map.sh -file /tmp/reducer.sh -mapper map.sh -reducer reducer.sh -output /tmp/largekey -numreducetasks 1 $ bin/hadoop dfs -cat /tmp/largekey/part* 33

Hive Data Model and Organization Tables Data is logically organized into tables. Each table has a corresponding directory under a particular warehouse directory in HDFS. The data in a table is serialized and stored in files under that directory. The serialization format of each table is stored in the system catalog, called Metastore. Table schema is checked during querying, not during loading ( schema on read vs. schema on write ). 34

Hive Data Model and Organization Partitions Each table can be further split into partitions, based on the values of one or more of its columns. Data for each partition is stored under a subdirectory of the table directory. Example: Table T under: /user/hive/warehouse/t/ Partition T on columns A and B Data for A=a and B=b will be stored in files under: /user/hive/warehouse/t/a=a/b=b/ 35

Hive Data Model and Organization Buckets Data in each partition can be further divided into buckets, based on the hash of a column in the table. Each bucket is stored as a file in the partition directory. Example: If bucketing on column C (hash on C): /user/hive/warehouse/t/a=a/b=b/part-0000 /user/hive/warehouse/t/a=a/b=b/part-1000 36

Primitive types Hive Column Types integers (tinyint, smallint, int, bigint) floating point numbers (float, double) boolean string timestamp Complex types array<any-type> map<primitive-type, any-type> struct<field-name: any-type,..> Arbitrary level of nesting 37

Hive Query Model DDL: data definition statements to create tables with specific serialization formats, partitioning/ bucketing columns CREATE TABLE DML: data manipulation statements to load and insert data (no updates or deletes) LOAD.. INSERT OVERWRITE.. HiveQL: SQL-like querying statements SELECT.. FROM.. WHERE.. (subset of SQL) 38

Example Status updates table: CREATE TABLE status_updates (userid int, status string, ds string) ROW FORMAT DELIMITED FIELDS TERMINATED BY `\t`; Load the data daily from log files: LOAD DATA LOCAL INPATH /logs/status_updates INTO TABLE status_updates PARTITION (ds= 2009-03-20 ) 39

Example Query (Filter) Filter status updates containing michael jackson. SELECT * FROM status_updates WHERE status LIKE michael jackson 40

Example Query (Aggregation) Find the total number of status_updates in a given day. SELECT COUNT(1) FROM status_updates WHERE ds = 2009-08-01 41

Hive Architecture 42

Hive Architecture 43

Metastore System catalog that contains metadata about Hive tables namespace list of columns and their types; owner, storage, and serialization information partition and bucketing information statistics Not stored in HDFS should be optimized for online transactions with random accesses and updates use a traditional relational database (e.g., MySQL) Hive manages the consistency between metadata and data explicitly. 44

Query Compiler Converts query language strings into plans: DDL -> metadata operations DML/LOAD -> HDFS operations DML/INSERT and HiveQL -> DAG of MapReduce jobs Consists of several steps: Parsing Semantic analysis Logical plan generation Query optimization and rewriting Physical plan generation 45

Example Optimizations Column pruning Predicate pushdown Partition pruning Combine multiple joins with the same join key into a single multi-way join, which can be handled by a single MapReduce job Add repartition operators for join and group-by operators to mark the boundary between map and reduce phases 46

Hive Extensibility Define new column types. Define new functions written in Java: UDF: user-defined functions UDA: user-defined aggregation functions Add support for new data formats by defining custom serialize/de-serialize methods ( SerDe ). Embed custom map/reduce scripts written in any language using a simple streaming interface. 47

http://hbase.apache.org/ 48

HBase Distributed column-oriented database built on top of HDFS modeled after Google s BigTable not relational, no support for SQL real-time read/write random access to very large data sets Main design goal: Scale linearly by adding nodes sparsely-populated tables billions of rows millions of columns automatic horizontal partitioning/replication across a cluster 49

Data Model Tables consist of rows and column families. Rows are sorted by key. Rows are horizontally partitioned, which form the basic unit of data distribution and load balancing. Columns are grouped into families, which form the basic unit of access control. Columns that belong to a given column family can be added on the fly. Cells are versioned by timestamp. 50

Data Model Multi-dimensional sorted map Indexed by a row key, column key, and a timestamp Each value in the map is an uninterpreted array of bytes. (row: string, column: string, time:int64) -> string 51

BigTable - Webtable Example A large collection of web pages reverse URLs as row keys other info about web pages as column names web page contents in contents column family, versioned with the timestamps when they were fetched 52

Rows Arbitrary strings as row keys Lexicographic order by row key Atomic reads and writes to each row (single-row transactions) Row ranges are dynamically partitioned into tablets ( regions in HBase), which form the unit of distribution and load balancing. Row keys affect locality of data access. Example: Reversed URL s in the Webtable example help group pages of the same domain together into contiguous rows. 53

Column Families Column families define units of access control. Different sets of columns may have different properties and access patterns. Column families are stored separately on disk. Tables are configurable by family (e.g., version retention policy, compression, cache priority). Unbounded number of columns, smaller number of rarely changing column families. Column keys named as: family:qualifier In the Webtable example: anchor:<name of referring site> 54

Timestamps Each cell in a table can contain multiple versions of the same data, which are indexed by timestamp. Timestamps can be assigned by the system or by client apps. Versions are stored in decreasing timestamp order. Garbage collection: Keep last n versions of a cell. Keep last n days versions of a cell. In the Webtable example: Keep only the most recent 3 versions of every page s content. 55

Summary Programming models (with execution and storage support) for big data Pig / Pig Latin and Hive / HiveQL ( and Google Sawzall) Google MapReduce and Hadoop MapReduce Google BigTable and HBase Google GFS and Hadoop HDFS 56

References Pig Latin: A Not-So-Foreign Language for Data Processing, C. Olston et al, SIGMOD 2008. Building a High-Level Dataflow System on top of Map-Reduce: The Pig Experience, A. F. Gates et al, VLDB 2009. Hive: A Warehousing Solution Over a Map-Reduce Framework, A. Thusoo et al, VLDB 2009. Hive: A Petabyte Scale Data Warehouse Using Hadoop, A. Thusoo et al, ICDE 2010. BigTable: A Distributed Storage System for Structured Data, F. Chang et al, OSDI 2006. 57