Big Data Technology CS , Technion, Spring 2013

Size: px

Start display at page:

Download "Big Data Technology CS 236620, Technion, Spring 2013"

Ginger Norman
9 years ago
Views:

1 Big Data Technology CS , Technion, Spring 2013 Structured Databases atop Map-Reduce Edward Bortnikov & Ronny Lempel Yahoo! Labs, Haifa

2 Roadmap Previous class MR Implementation This class Query Languages atop MR Beyond MR Database Theory in a Nutshell Query Language Implementation Apache Hive and HCatalog

3 Map-Reduce Critique Too low-level (only engineers can program) Hand-coding for many common operations Data flows extremely rigid (linear) Hard to maintain, extend, and optimize This is exactly what SQL databases have been designed for! Can we reuse some of the good stuff?

(linear) Hard to maintain, extend, and optimize This is exactly

4 Database Applications Dichotomy Online Transaction Processing (OLTP) Example: online e-commerce Write-intensive workloads Concurrency control, transactions Optimization goal: latency Data Warehousing/Data Analytics (DW/DA) Example: fraud analysis Read-dominated workloads Optimization goal: throughput

transactions Optimization goal: latency Data Warehousing/Data Analytics

5 80 s-90 s: One-SQL-Fits-All Online Transaction Processing (OLTP) Read-Write Workload Real-Time Latency-Sensitive Transaction-Oriented Benchmark: TPC-C Analytics BI Tools Latency-Oriented Data Cubes Read-Only Workload Batch Processing Non-Transactional ETL Benchmark: TPC-H SQL DBMS (Oracle, DB2, SQL Server, MySQL, ) ACID transactions Moderate scale (TBs not PBs)

Latency-Oriented Data Cubes Read-Only Workload Batch Processing Non-Transactional ETL

6 The NoSQL Revolution Split between Analytics and OLTP Google (early 2000 s): Map-Reduce vs Bigtable Open-source: Hadoop MR vs Hbase Started from breaking many things (overhead) Simplified semantics Eliminated transactions Nowadays, returning to basics in many areas Focus for the next 2 weeks: analytics systems

many things (overhead) Simplified semantics Eliminated transactions Nowadays,

7 Relational Databases in a Nutshell The data is structured Reflects Entities and Relationships ERD = Entity-Relationship Diagram ERD captured by schema Data set = set of tables Table = set of tuples (rows) Row = set of items (columns, attributes) Typically, uniquely addressable by primary key Relational Algebra Theoretical foundation for relational databases

tables Table = set of tuples (rows) Row = set of items (columns, attributes) Typically,

8 Working with Relational Data Create the schema (1 or more tables) DDL Data Definition Language Load data into the tables Often requires ETL (Extract/Transform/Load) tools Query the data DML Data Manipulation Language SQL (Structured Query Language) Implementation of DDL and DML

(Extract/Transform/Load) tools Query the data DML Data Manipulation

9 Capturing Relationships De-normalized design Nested data, single table Normalized design Flat data, multiple tables with cross-references Customer Birthday Transactions Customer Id Date Amount Jones 1/1/1980 Id Date Amount /10/ /10/ Wilkinson 1/2/1980 Id Date Amount /10/ Stevens 1/3/1980 Id Date Amount /10/ /11/ Jones /10/ Jones /10/ Wilkinson /10/ Stevens /10/ Stevens /11/ Customer Birthday Jones 1/1/1980 Wilkinson 1/2/1980 Stevens 1/3/1980

Amount 12898 14/10/2003 21 Stevens 1/3/1980 Id Date Amount 12907 15/10/2003 18 14920 20/11/2003 70 Jones 12890 14/10/2003 87 Jones 12904 15/10/2003

10 Database Designer s Dilemma Normalized Design Transparent, easy to maintain and extend Amenable to horizontal scalability Easy to verify referential integrity Non-biased towards specific query workloads Runtime cost: cross-table joins De-normalized design Constrains application semantics Might suffer from modification anomalies Can be optimized for specific workloads, no joins

specific query workloads Runtime cost: cross-table joins De-normalized design Constrains

11 Example: TPC-H Schema

12 SQL (Structured Query Language) Declarative programming language Focus on what you want, not how to retrieve Flexible, Great for non-engineers Designed for relational databases Select (read) Insert/Update/Delete (write) Costs Administration (schema management) Query optimization (compiler or manual)

Designed for relational databases Select (read) Insert/Update/Delete

13 SQL Primitives Project SELECT url FROM pages Filter (Select) SELECT url FROM pages WHERE pagerank > 0.95

14 SQL Primitives Join ( ) SELECT visits.user, pages.category FROM visits, pages WHERE visits.url = pages.url SELECT p1.header, p2.header alias FROM pages p1, pages p2, links l1, links l2 WHERE p1.url=l1.from AND p2.url=l2.to AND l1. id = l2.id

15 SQL Primitives Aggregation SELECT url, COUNT(url) FROM visits GROUP BY url HAVING COUNT(url) > 1000 COUNT, SUM, AVG, STDDEV, MIN, MAX

16 SQL Primitives Sort SELECT cookie, query FROM querylog WHERE date < 1/1/2013 AND date > 1/12/2012 ORDER BY cookie, date

17 SQL Implementation Specific programming language type Machine = [distributed] runtime environment Instruction set = relational operators Receive and return tuple sets Query plan = compiler-generated program Operators + flow of control

set = relational operators Receive and return tuple sets

18 Query Plan - Aggregation SELECT url, COUNT(url) FROM visits GROUP BY url HAVING COUNT(url) > 1000 Filter by count Aggregate by count Group by url Load visits visits

19 Query Plan - Join SELECT p1.header, p2.header FROM pages p1, pages p2, links l1, links l2 WHERE p1.url=l1.from AND p2.url=l2.to AND l1. id = l2.id Join l1.id = l2.id Join l1.from=p1.url Join l2.to=p2.url Load Load Load Load pages links pages links

20 Dataflow Architectures Query plan = DAG Nodes = relational operators Links = queues Performance boosted through Data parallelism Compute parallelism Pipelining

21 Query Optimization Multiple plans can be logically equivalent Relation algebra allows operator re-ordering Multiple operator implementations possible Goal: pick a plan that minimizes query latency Minimize I/O, communication and computation No-brainer optimizations: push the filters deep Nontrivial optimizations: order of joins Challenge: exponential-size search space

22 SQL versus Map-Reduce SQL Good for structured data Rich declarative API (tool for analysts) Machine optimization, non-transparent to users Map-Reduce Good for structured and unstructured data Simple programmatic API (tool for engineers) Users can optimize their jobs manually Can we enjoy the best of both worlds?

23 Implementation atop Map-Reduce Project, Filter easy (how?) Sort for free (not always required) Aggregation mostly easy (how?) Stream processing (O(1) intermediate state) How to handle TOP-K, AVG, STDDEV? Join? Hint: use multiple Map inputs

24 Join over MR (1) L M-Left M-Left M-Left Sort by key + L Partition by key R R R R Select L.x, R.y From L, R Where L.key = R.key M-Right M-Right M-Right Sort by key + R Partition by key key L key R key R key R key R

25 Join over MR (2) Typical situation: Big Small E.g., Page Accesses (B) User Details (S) Idea: avoid reduce altogether Approach: replicate S to all mappers Use shared files (distributed thru the MR cache) At each mapper: Store in RAM, hashed by join key Scan the B-partition, compute match per record

26 Apache Hive Data warehouse atop Hadoop MR Structured (schema-based) data SQL dialect - HiveQL Select, project, join (2-way), aggregation Accommodates user-defined functions (UDF) [As of recently] fairly weak compiler optimization

27 Apache HCatalog Table and storage management service Shared schema and data type mechanism Table abstraction Users unaware of where/how their data is stored Interoperability across tools Map-Reduce, Hive, Pig (next class)

28 Summary SQL atop MR Good for data warehouses, batch queries The less legacy semantics, the better scalability Bad for interactive ad-hoc queries Batch-oriented (high launch overhead) Scan-oriented, lookups are expensive Intermediate results materialized on DFS Dataflow underexploited (limited pipelining)

29 Next Class Pig Latin a procedural query language Real-time query processing

30 Further Reading A comparison of approaches to large-scale data analysis Map Reduce: A Major Step Backwards

Big Data Technology Pig: Query Language atop Map-Reduce

Big Data Technology Pig: Query Language atop Map-Reduce Eshcar Hillel Yahoo! Ronny Lempel Outbrain *Based on slides by Edward Bortnikov & Ronny Lempel Roadmap Previous class MR Implementation This class