Session# - AaS 2.1 Title SQL On Big Data - Technology, Architecture and Roadmap

Transcription

1 Session# - AaS 2.1 Title SQL On Big Data - Technology, Architecture and Roadmap Sumit Pal Independent Big Data and Data Science Consultant, Boston 1 Data Center World Certified Vendor Neutral Each presenter is required to certify that their presentation will be vendor-neutral. As an attendee you have a right to enforce this policy of having no sales pitch within a session by alerting the speaker if you feel the session is not being presented in a vendor neutral fashion. If the issue continues to be a problem, please alert Data Center World staff after the session is complete. 2 1

2 SQL On Big Data - Technology, Architecture and Roadmap What is Big Data What is SQL SQL on Big Data Why SQL on Hadoop Limitations of SQL on Hadoop, Challenges & Solution Types of SQL (Volume, Velocity & Variety) Architectures Batch, Interactive, Streaming Innovations happening in this space 3 What is Big Data 4 2

3 What is SQL (Structured Query Language) - ANSI Standard - Manipulate and work with Relational Databases - Create Database - Insert / Update / Delete data into Databases - Retrieve data from Databases with filtering criteria 5 What is OLAP (On Line Analytical Processing) - Used in BI Tools - Calculating Trends from Historical Data - Building Business Scenarios - Building Aggregates and View Data Dimensionally ( Time, Geography etc.) - Navigate and Explore Adhoc Analysis - Drill Down 6 3

4 What is Hadoop New Approach to Distributed Data Processing Data is Local, NOT move across Network Shared Nothing Architecture 1. No synchronization requirement among the nodes 2. Designed for failure Multiple copies of data 3. Consistent Architecture individual failures does not fail the job 7 What is Map Reduce 8 4

5 What is HDFS 9 Why SQL on Hadoop More and more data is getting available on Hadoop SQL is an incredibly popular data querying language Is a bridge between business analysts and organizations big data Analysts would like to query data on Hadoop without using MapReduce Seamless Integration with BI Tools Scale that traditional DBMS cannot offer 10 5

6 SQL on Hadoop Goals Distributed, Scale Out Architecture Avoid Expensive Analytic DBs and Appliances Avoid Data Movement from HDFS to Analytic DBs High Concurrency of end users 11 Why SQL on Hadoop Appliances 12 6

7 SQL in Hadoop Landscape 13 Challenges of SQL on Hadoop Map Reduce and HDFS traditionally meant to solve Batch Oriented Data Map Reduce is high latency Map Reduce Not designed for long Data Pipelines (Complex SQL is inefficiently expressed as many MR stages) Disk IO between Map & Reduce do lot of data Shuffling & Sort 14 7

8 Approaches to solve the challenges 15 Approaches to solve the challenges Storage layer optimizations Data retrieval data locality, storage layout / formats & indexing Indexing (JethroData later slide) File Formats Avro, ORC, Parquet, Sequence Files Choosing the optimal file format in Hadoop is one of the most essential drivers of functionality and performance for big data processing and query Data Compression ( Reduce IO) GZIP, BZIP2, LZO, Snappy, LZ4 Workloads are IO Bound Reduce IO Compression Algorithms Compression has a gotcha Must be Splittable for Hadoop Tradeoff Storage / Network Bandwidth / CPU 16 8

9 Analytic Types 17 Batch SQL on Hadoop Hive is designed for batch queries Uses Map Reduce in the background Primarily for queries on large data sets and large ETL jobs for batch workloads Not designed for OLTP OR real time What Hive values most are scalability (scale out add machines to cluster) Extensibility (with Map Reduce framework and UDF/UDAF/UDTF) Fault tolerance and loose coupling with its input formats and Ser/De 18 9

10 Batch SQL on Hadoop Slow multiple stages of Map Reduce Complex SQL Queries need MR job Chaining, incurs Multiple Shuffles and Disk Writes 19 Optimizations Hive ~ Interactive SQL Hints ( /*MAP JOIN */) ( /*STREAMABLE */) Partitioning & Bucketing Done in relational DBs, Bucketing takes care of Skewness Vectorization Reduces the CPU usage, for query scans, filters, aggregates, and joins. Processing a block of 1024 rows at a time Within block, each column is stored as a vector Uses few instructions and finishes each instruction in fewer clock cycles, by using processor pipeline and cache memory

11 Interactive SQL 21 MPP / Full Scan Architecture 22 11

12 MPP / Full Scan Architecture Client: SELECT day, sum(sales) FROM t1 WHERE prod= abc GROUP BY day Query Planner/ Mgr Query Planner /Mgr Query Planner/ Mgr Query Planner/ Mgr Query Planner/ Mgr Query Executor Query Executor Query Executor Query Executor Query Executor Data Node Data Node Data Node Data Node Data Node Performance and resources based on the size of the dataset 23 Impala Architecture Biggest Advantage Dis-Advantage No Data Movement out of the clusters, No SPOF Processes/Deamon on the each Data Node Reasons for High Performance C++ Instead of Java Runtime Code Generation New Execution Engine ( Not Map Reduce ) Work on Optimized Data/File Format Parquet (Columnar Compressed) 24 12

13 Impala Architecture Fast and Efficient IO manager handle large data spread across array of hard drives (rotational, or SSD) Designed to run on modern architecture, recommended chipsets (i.e. Sandy Bridge, Bulldozer), as the LLVM IR compiler will use newer hardware instructions to help maximize IO throughput Impala s execution engine is decoupled from the storage engine, allowing it to plug other storage engines underneath 25 Impala Architecture 26 13

14 Impala Architecture 27 Impala Architecture 28 14

15 Impala Architecture 29 Impala Architecture LLVM & Others Runtime code generation to improve execution times Perform just in time (JIT) compilation to generate machine code Produce query specific versions of functions critical to performance Virtual function calls incur a large performance penalty. If object type is known, use code generation to replace the virtual function call with inline HDFS feature short circuit local reads to bypass the DataNode protocol when reading from local disk 30 15

16 ImpalaToGo New Kid on the Block Thin Layer on top of Impala ( C++ ) Moves out Impala to its own Cluster (DeCouples Storage From Compute ) ImpalaToGo can now reside in its own Cluster ( Separate from Hadoop Cluster ) Challenge deploying the MPP/full scan compute architecture against data that is now separated by network Deals with it by copying relevant data to local cluster Uses Tachyon to optimize it. This is great when all the data used by all current users can fit fully in cache. Once it doesn t the network cost kicks back in 31 Vertica MPP Based Analytic DB Engine MPP Columnar RDBMS 10x 100x compared to RDBMS on Analytic Queries Scales Linearly Commodity Hardware Built in Fault Tolerance Highly Compressed data for low IO 32 16

17 Vertica with Hadoop (2 Use Cases) 33 Vertica with Hadoop (2 Use Cases) 34 17

18 Spark SQL 35 Spark SQL 36 18

19 JethroData Indexes in Hadoop 37 Indexed Architecture Client: SELECT day, sum(sales) FROM t1 WHERE prod= abc GROUP BY day Jethro Query Node Query Node 1. Index Access 2. Read data only for require rows Data Node Data Node Data Node Data Node Data Node Performance and resources based on the size of the result set 38 19

20 Streaming SQL - Architecture 39 Streaming SQL - Architecture Store first, process second are unable to scale for real time Big Data Hadoop unable to offer latency and throughput for real time applications Telecoms, IOT and Cybersecurity Streams are infinite tables Standing query that executes over data Operating continuously on data as they arrive and by updating results incrementally in real time 40 20

21 Streaming SQL - Architecture Streaming Aggregation AVG, COUNT, MAX, MIN, SUM, STDDEV_POP, STDDEV_SAMP, VAR_POP, VAR_SAMP Sliding Windowed analytics Streaming analytics and sliding windows Output record (or row) is produced for each new input record. Each field (or column) in the output record may be calculated using a different window or partition. Windows can be time or row based. 41 Streaming SQL - Architecture In Memory Processing Lock free Data Structures Stateless implementation enables SQL queries to be distributed over Nodes 42 21

22 Innovations SQL Improvements Big Data Analytics (Hive Specific) SQL on Un Structured Data OLAP on Hadoop with SQL Probabilistic Query Engines BlinkDB 43 ACID transactions (HIVE-5317) INSERT INTO tbl SELECT INSERT INTO tbl VALUES UPDATE tbl SET WHERE DELETE FROM tbl WHERE MERGE INTO tbl USING src ON WHEN MATCHED THEN... WHEN NOT MATCHED THEN... SET TRANSACTION LEVEL BEGIN/END TRANSACTION How is it being done (look at the paper in reference) The heart of the approach is the client side merge of the HDFS files and directories 44 22

23 Window/Analytic Query (HIVE-4197) Windowing functions LEAD, LAG, FIRST_VALUE, LAST_VALUE, OVER with standard aggregates: COUNT, SUM, MIN, MAX, AVG OVER with a PARTITION BY OVER with PARTITION BY and ORDER BY OVER with a window specification Window specifications support these standard options: ROWS ((CURRENT ROW) (UNBOUNDED [num]) PRECEDING) AND (UNBOUNDED [num]) FOLLOWING Ranking functions: Rank, NTile, DenseRank, CumeDist, PercentRank, NTILE 45 Innovations SQL on UnStructured Data 46 23

24 OLAP on Hadoop Apache Kylin Extremely Fast OLAP Engine at Scale ANSI SQL Interface on Hadoop Interactive Query Capability MOLAP Cube Seamless Integration with BI Tools 47 OLAP on Hadoop Apache Kylin 48 24

25 Choose Two Fast Response Low Latency SQL Engines Accurate Big Volume 49 Probabilistic SQL Query Engine Blink DB 50 25

26 Probabilistic SQL Query Engine Blink DB 51 Hybrid Transactional Analytical Processing Coined in early 2014 by Gartner New generation of in-memory data platforms OLTP & OLAP No data duplication/movement Hybrid Transaction/Analytical Processing In Memory technology Enabler Analytics over changing data ETL is not used any more SAP with the SAPHANA platform VoltDB, NuoDB, Clustrix, and MemSQLdatabase 52 26

27 Hybrid Transactional Analytical Processing

28 Why SQL on Hadoop is a Bad Idea Stefan Groschupf CEO and Chairman of Datameer When you put data into a structure, like SQL, you limit what you can do with the data in future. A question of Waterfall versus Agile in Data Analytics. SQL requires a waterfall design approach and is limited. Hadoop truly enables an agile analytics approach. You first collect all the data, and then you pull out whatever you want. stefan groschupf sqlhadoop.html#.vcitfxctr9c.linkedin 55 References JethroData Hive Paper icde2010.pdf Hive Transactions SQL to MR Translator state.edu/hpcs/www/html/publications/papers/tr 11 7.pdf Impala in Action Manning I was one of the reviewers Hive Cost Based Optimization 2.pdf Hive on Spark Hive Speed on spark is blazing fast or is itfinal Apache Kylin BlinkDB BlinkDB p212agarwalv2 Presto hadoop conference japan 2014?related=

29 References Hive Vectorization Hive LLAP longlived execution in hive Impala Paper Requiremtns of Stream Processing 57 ( Q& A ) Select Questions from Audience I will try to optimize the Queries 58 29

30 3 Key Things You Have Learned During this Session 1. What is Big Data 2. What is SQL 3. SQL on Big Data 59 Thank you 60 30