Big Data Technology CS 236620, Technion, Spring 2013 Structured Databases atop Map-Reduce Edward Bortnikov & Ronny Lempel Yahoo! Labs, Haifa
Roadmap Previous class MR Implementation This class Query Languages atop MR Beyond MR Database Theory in a Nutshell Query Language Implementation Apache Hive and HCatalog
Map-Reduce Critique Too low-level (only engineers can program) Hand-coding for many common operations Data flows extremely rigid (linear) Hard to maintain, extend, and optimize This is exactly what SQL databases have been designed for! Can we reuse some of the good stuff?
Database Applications Dichotomy Online Transaction Processing (OLTP) Example: online e-commerce Write-intensive workloads Concurrency control, transactions Optimization goal: latency Data Warehousing/Data Analytics (DW/DA) Example: fraud analysis Read-dominated workloads Optimization goal: throughput
80 s-90 s: One-SQL-Fits-All Online Transaction Processing (OLTP) Read-Write Workload Real-Time Latency-Sensitive Transaction-Oriented Benchmark: TPC-C Analytics BI Tools Latency-Oriented Data Cubes Read-Only Workload Batch Processing Non-Transactional ETL Benchmark: TPC-H SQL DBMS (Oracle, DB2, SQL Server, MySQL, ) ACID transactions Moderate scale (TBs not PBs)
The NoSQL Revolution Split between Analytics and OLTP Google (early 2000 s): Map-Reduce vs Bigtable Open-source: Hadoop MR vs Hbase Started from breaking many things (overhead) Simplified semantics Eliminated transactions Nowadays, returning to basics in many areas Focus for the next 2 weeks: analytics systems
Relational Databases in a Nutshell The data is structured Reflects Entities and Relationships ERD = Entity-Relationship Diagram ERD captured by schema Data set = set of tables Table = set of tuples (rows) Row = set of items (columns, attributes) Typically, uniquely addressable by primary key Relational Algebra Theoretical foundation for relational databases
Working with Relational Data Create the schema (1 or more tables) DDL Data Definition Language Load data into the tables Often requires ETL (Extract/Transform/Load) tools Query the data DML Data Manipulation Language SQL (Structured Query Language) Implementation of DDL and DML
Capturing Relationships De-normalized design Nested data, single table Normalized design Flat data, multiple tables with cross-references Customer Birthday Transactions Customer Id Date Amount Jones 1/1/1980 Id Date Amount 12890 14/10/2003 87 12904 15/10/2003 50 Wilkinson 1/2/1980 Id Date Amount 12898 14/10/2003 21 Stevens 1/3/1980 Id Date Amount 12907 15/10/2003 18 14920 20/11/2003 70 Jones 12890 14/10/2003 87 Jones 12904 15/10/2003 50 Wilkinson 12898 14/10/2003 21 Stevens 12907 15/10/2003 18 Stevens 14920 20/11/2003 70 Customer Birthday Jones 1/1/1980 Wilkinson 1/2/1980 Stevens 1/3/1980
Database Designer s Dilemma Normalized Design Transparent, easy to maintain and extend Amenable to horizontal scalability Easy to verify referential integrity Non-biased towards specific query workloads Runtime cost: cross-table joins De-normalized design Constrains application semantics Might suffer from modification anomalies Can be optimized for specific workloads, no joins
Example: TPC-H Schema
SQL (Structured Query Language) Declarative programming language Focus on what you want, not how to retrieve Flexible, Great for non-engineers Designed for relational databases Select (read) Insert/Update/Delete (write) Costs Administration (schema management) Query optimization (compiler or manual)
SQL Primitives Project SELECT url FROM pages Filter (Select) SELECT url FROM pages WHERE pagerank > 0.95
SQL Primitives Join ( ) SELECT visits.user, pages.category FROM visits, pages WHERE visits.url = pages.url SELECT p1.header, p2.header alias FROM pages p1, pages p2, links l1, links l2 WHERE p1.url=l1.from AND p2.url=l2.to AND l1. id = l2.id
SQL Primitives Aggregation SELECT url, COUNT(url) FROM visits GROUP BY url HAVING COUNT(url) > 1000 COUNT, SUM, AVG, STDDEV, MIN, MAX
SQL Primitives Sort SELECT cookie, query FROM querylog WHERE date < 1/1/2013 AND date > 1/12/2012 ORDER BY cookie, date
SQL Implementation Specific programming language type Machine = [distributed] runtime environment Instruction set = relational operators Receive and return tuple sets Query plan = compiler-generated program Operators + flow of control
Query Plan - Aggregation SELECT url, COUNT(url) FROM visits GROUP BY url HAVING COUNT(url) > 1000 Filter by count Aggregate by count Group by url Load visits visits
Query Plan - Join SELECT p1.header, p2.header FROM pages p1, pages p2, links l1, links l2 WHERE p1.url=l1.from AND p2.url=l2.to AND l1. id = l2.id Join l1.id = l2.id Join l1.from=p1.url Join l2.to=p2.url Load Load Load Load pages links pages links
Dataflow Architectures Query plan = DAG Nodes = relational operators Links = queues Performance boosted through Data parallelism Compute parallelism Pipelining
Query Optimization Multiple plans can be logically equivalent Relation algebra allows operator re-ordering Multiple operator implementations possible Goal: pick a plan that minimizes query latency Minimize I/O, communication and computation No-brainer optimizations: push the filters deep Nontrivial optimizations: order of joins Challenge: exponential-size search space
SQL versus Map-Reduce SQL Good for structured data Rich declarative API (tool for analysts) Machine optimization, non-transparent to users Map-Reduce Good for structured and unstructured data Simple programmatic API (tool for engineers) Users can optimize their jobs manually Can we enjoy the best of both worlds?
Implementation atop Map-Reduce Project, Filter easy (how?) Sort for free (not always required) Aggregation mostly easy (how?) Stream processing (O(1) intermediate state) How to handle TOP-K, AVG, STDDEV? Join? Hint: use multiple Map inputs
Join over MR (1) L M-Left M-Left M-Left Sort by key + L Partition by key R R R R Select L.x, R.y From L, R Where L.key = R.key M-Right M-Right M-Right Sort by key + R Partition by key key L key R key R key R key R
Join over MR (2) Typical situation: Big Small E.g., Page Accesses (B) User Details (S) Idea: avoid reduce altogether Approach: replicate S to all mappers Use shared files (distributed thru the MR cache) At each mapper: Store in RAM, hashed by join key Scan the B-partition, compute match per record
Apache Hive Data warehouse atop Hadoop MR Structured (schema-based) data SQL dialect - HiveQL Select, project, join (2-way), aggregation Accommodates user-defined functions (UDF) [As of recently] fairly weak compiler optimization
Apache HCatalog Table and storage management service Shared schema and data type mechanism Table abstraction Users unaware of where/how their data is stored Interoperability across tools Map-Reduce, Hive, Pig (next class)
Summary SQL atop MR Good for data warehouses, batch queries The less legacy semantics, the better scalability Bad for interactive ad-hoc queries Batch-oriented (high launch overhead) Scan-oriented, lookups are expensive Intermediate results materialized on DFS Dataflow underexploited (limited pipelining)
Next Class Pig Latin a procedural query language Real-time query processing
Further Reading A comparison of approaches to large-scale data analysis Map Reduce: A Major Step Backwards