Big Data Technology CS 236620, Technion, Spring 2013



Similar documents
Big Data Technology Pig: Query Language atop Map-Reduce

Big Data With Hadoop

Using distributed technologies to analyze Big Data

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

How To Scale Out Of A Nosql Database

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

How To Handle Big Data With A Data Scientist

Implement Hadoop jobs to extract business value from large and varied data sets

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Internals of Hadoop Application Framework and Distributed File System

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Hadoop IST 734 SS CHUNG

Big Data Course Highlights

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

Big Data Technology Core Hadoop: HDFS-YARN Internals

Data Management in the Cloud

Apache HBase. Crazy dances on the elephant back

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Can the Elephants Handle the NoSQL Onslaught?

Cloudera Certified Developer for Apache Hadoop

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Hadoop Ecosystem B Y R A H I M A.

THE DEVELOPER GUIDE TO BUILDING STREAMING DATA APPLICATIONS

Trafodion Operational SQL-on-Hadoop

Xiaoming Gao Hui Li Thilina Gunarathne

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Open source Google-style large scale data analysis with Hadoop

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Big Data Analytics Nokia

This article is the second

Hadoop for MySQL DBAs. Copyright 2011 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Cloud Computing at Google. Architecture

Big Systems, Big Data

Hadoop and Map-Reduce. Swati Gore

Apache Kylin Introduction Dec 8,

Impala: A Modern, Open-Source SQL Engine for Hadoop. Marcel Kornacker Cloudera, Inc.

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Comparing SQL and NOSQL databases

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Open source large scale distributed data management with Google s MapReduce and Bigtable

Moving From Hadoop to Spark

Hadoop Job Oriented Training Agenda

Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction to Pig

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Introduction to Apache Cassandra

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Database Scalability and Oracle 12c

Big Data Technology CS , Technion, Spring 2014

Splice Machine: SQL-on-Hadoop Evaluation Guide

11/18/15 CS q Hadoop was not designed to migrate data from traditional relational databases to its HDFS. q This is where Hive comes in.

American International Journal of Research in Science, Technology, Engineering & Mathematics

Infrastructures for big data

The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Unified Big Data Analytics Pipeline. 连 城

Data Warehouse Optimization

CSE-E5430 Scalable Cloud Computing Lecture 2

Architectures for Big Data Analytics A database perspective

Integrating Big Data into the Computing Curricula

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Oracle Big Data SQL Technical Update

A Brief Introduction to Apache Tez

Application Development. A Paradigm Shift

Facebook s Petabyte Scale Data Warehouse using Hive and Hadoop

BIG DATA What it is and how to use?

How To Use Big Data For Telco (For A Telco)

MySQL and Hadoop. Percona Live 2014 Chris Schneider

HADOOP. Revised 10/19/2015

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Analytics on Spark &

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Workshop on Hadoop with Big Data

White Paper: What You Need To Know About Hadoop

Complete Java Classes Hadoop Syllabus Contact No:

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

Using RDBMS, NoSQL or Hadoop?

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard

Lofan Abrams Data Services for Big Data Session # 2987

F1: A Distributed SQL Database That Scales. Presentation by: Alex Degtiar (adegtiar@cmu.edu) /21/2013

PostgreSQL Business Intelligence & Performance Simon Riggs CTO, 2ndQuadrant PostgreSQL Major Contributor

Relational Processing on MapReduce

Hadoop & Spark Using Amazon EMR

Transcription:

Big Data Technology CS 236620, Technion, Spring 2013 Structured Databases atop Map-Reduce Edward Bortnikov & Ronny Lempel Yahoo! Labs, Haifa

Roadmap Previous class MR Implementation This class Query Languages atop MR Beyond MR Database Theory in a Nutshell Query Language Implementation Apache Hive and HCatalog

Map-Reduce Critique Too low-level (only engineers can program) Hand-coding for many common operations Data flows extremely rigid (linear) Hard to maintain, extend, and optimize This is exactly what SQL databases have been designed for! Can we reuse some of the good stuff?

Database Applications Dichotomy Online Transaction Processing (OLTP) Example: online e-commerce Write-intensive workloads Concurrency control, transactions Optimization goal: latency Data Warehousing/Data Analytics (DW/DA) Example: fraud analysis Read-dominated workloads Optimization goal: throughput

80 s-90 s: One-SQL-Fits-All Online Transaction Processing (OLTP) Read-Write Workload Real-Time Latency-Sensitive Transaction-Oriented Benchmark: TPC-C Analytics BI Tools Latency-Oriented Data Cubes Read-Only Workload Batch Processing Non-Transactional ETL Benchmark: TPC-H SQL DBMS (Oracle, DB2, SQL Server, MySQL, ) ACID transactions Moderate scale (TBs not PBs)

The NoSQL Revolution Split between Analytics and OLTP Google (early 2000 s): Map-Reduce vs Bigtable Open-source: Hadoop MR vs Hbase Started from breaking many things (overhead) Simplified semantics Eliminated transactions Nowadays, returning to basics in many areas Focus for the next 2 weeks: analytics systems

Relational Databases in a Nutshell The data is structured Reflects Entities and Relationships ERD = Entity-Relationship Diagram ERD captured by schema Data set = set of tables Table = set of tuples (rows) Row = set of items (columns, attributes) Typically, uniquely addressable by primary key Relational Algebra Theoretical foundation for relational databases

Working with Relational Data Create the schema (1 or more tables) DDL Data Definition Language Load data into the tables Often requires ETL (Extract/Transform/Load) tools Query the data DML Data Manipulation Language SQL (Structured Query Language) Implementation of DDL and DML

Capturing Relationships De-normalized design Nested data, single table Normalized design Flat data, multiple tables with cross-references Customer Birthday Transactions Customer Id Date Amount Jones 1/1/1980 Id Date Amount 12890 14/10/2003 87 12904 15/10/2003 50 Wilkinson 1/2/1980 Id Date Amount 12898 14/10/2003 21 Stevens 1/3/1980 Id Date Amount 12907 15/10/2003 18 14920 20/11/2003 70 Jones 12890 14/10/2003 87 Jones 12904 15/10/2003 50 Wilkinson 12898 14/10/2003 21 Stevens 12907 15/10/2003 18 Stevens 14920 20/11/2003 70 Customer Birthday Jones 1/1/1980 Wilkinson 1/2/1980 Stevens 1/3/1980

Database Designer s Dilemma Normalized Design Transparent, easy to maintain and extend Amenable to horizontal scalability Easy to verify referential integrity Non-biased towards specific query workloads Runtime cost: cross-table joins De-normalized design Constrains application semantics Might suffer from modification anomalies Can be optimized for specific workloads, no joins

Example: TPC-H Schema

SQL (Structured Query Language) Declarative programming language Focus on what you want, not how to retrieve Flexible, Great for non-engineers Designed for relational databases Select (read) Insert/Update/Delete (write) Costs Administration (schema management) Query optimization (compiler or manual)

SQL Primitives Project SELECT url FROM pages Filter (Select) SELECT url FROM pages WHERE pagerank > 0.95

SQL Primitives Join ( ) SELECT visits.user, pages.category FROM visits, pages WHERE visits.url = pages.url SELECT p1.header, p2.header alias FROM pages p1, pages p2, links l1, links l2 WHERE p1.url=l1.from AND p2.url=l2.to AND l1. id = l2.id

SQL Primitives Aggregation SELECT url, COUNT(url) FROM visits GROUP BY url HAVING COUNT(url) > 1000 COUNT, SUM, AVG, STDDEV, MIN, MAX

SQL Primitives Sort SELECT cookie, query FROM querylog WHERE date < 1/1/2013 AND date > 1/12/2012 ORDER BY cookie, date

SQL Implementation Specific programming language type Machine = [distributed] runtime environment Instruction set = relational operators Receive and return tuple sets Query plan = compiler-generated program Operators + flow of control

Query Plan - Aggregation SELECT url, COUNT(url) FROM visits GROUP BY url HAVING COUNT(url) > 1000 Filter by count Aggregate by count Group by url Load visits visits

Query Plan - Join SELECT p1.header, p2.header FROM pages p1, pages p2, links l1, links l2 WHERE p1.url=l1.from AND p2.url=l2.to AND l1. id = l2.id Join l1.id = l2.id Join l1.from=p1.url Join l2.to=p2.url Load Load Load Load pages links pages links

Dataflow Architectures Query plan = DAG Nodes = relational operators Links = queues Performance boosted through Data parallelism Compute parallelism Pipelining

Query Optimization Multiple plans can be logically equivalent Relation algebra allows operator re-ordering Multiple operator implementations possible Goal: pick a plan that minimizes query latency Minimize I/O, communication and computation No-brainer optimizations: push the filters deep Nontrivial optimizations: order of joins Challenge: exponential-size search space

SQL versus Map-Reduce SQL Good for structured data Rich declarative API (tool for analysts) Machine optimization, non-transparent to users Map-Reduce Good for structured and unstructured data Simple programmatic API (tool for engineers) Users can optimize their jobs manually Can we enjoy the best of both worlds?

Implementation atop Map-Reduce Project, Filter easy (how?) Sort for free (not always required) Aggregation mostly easy (how?) Stream processing (O(1) intermediate state) How to handle TOP-K, AVG, STDDEV? Join? Hint: use multiple Map inputs

Join over MR (1) L M-Left M-Left M-Left Sort by key + L Partition by key R R R R Select L.x, R.y From L, R Where L.key = R.key M-Right M-Right M-Right Sort by key + R Partition by key key L key R key R key R key R

Join over MR (2) Typical situation: Big Small E.g., Page Accesses (B) User Details (S) Idea: avoid reduce altogether Approach: replicate S to all mappers Use shared files (distributed thru the MR cache) At each mapper: Store in RAM, hashed by join key Scan the B-partition, compute match per record

Apache Hive Data warehouse atop Hadoop MR Structured (schema-based) data SQL dialect - HiveQL Select, project, join (2-way), aggregation Accommodates user-defined functions (UDF) [As of recently] fairly weak compiler optimization

Apache HCatalog Table and storage management service Shared schema and data type mechanism Table abstraction Users unaware of where/how their data is stored Interoperability across tools Map-Reduce, Hive, Pig (next class)

Summary SQL atop MR Good for data warehouses, batch queries The less legacy semantics, the better scalability Bad for interactive ad-hoc queries Batch-oriented (high launch overhead) Scan-oriented, lookups are expensive Intermediate results materialized on DFS Dataflow underexploited (limited pipelining)

Next Class Pig Latin a procedural query language Real-time query processing

Further Reading A comparison of approaches to large-scale data analysis Map Reduce: A Major Step Backwards