Processing Big Data With SQL on Hadoop. Jens Albrecht jens.albrecht@th-nuernberg.de

Processing Big Data With SQL on Hadoop Jens Albrecht jens.albrecht@th-nuernberg.de

Why SQL for Big Data? Mature technology Broad knowledge available Powerful query language High interactive performance Many third party tools for data analysis and visualization Flexible data structures Semi-structured data Changing schemas Self-Service Data integration on-the-fly Scalability Analysis, integration, volumen (Relatively) Low Cost CommodityHardware, Open Source Prof. Dr. Jens Albrecht SQL on Hadoop 3

Why SQL for Big Data? + Extended DWH Data Lake Agility Prof. Dr. Jens Albrecht SQL on Hadoop 4

Hadoop-SQL Integration Hive(Native Hadoop) HiveQL MR / Tez HDFS Pure Hadoop SQL Engines Distributed SQL Engine HDFS Format-agnostic SQL Engines RDBMS with Hadoop Access HDFS Distributed SQL Engine Hive NoSQL RDBMS Relational Hadoop Prof. Dr. Jens Albrecht SQL on Hadoop 5

Hive Hadoop Meta Store Hive HiveQL Execution SerDe General Developed initially by Facebook SQL-processing for HDFS and HBase Table definitions in Hive Meta Store Generation of MapReduce Code Schema-on-Read via SerDe MapReduce Advantages Mature part of every Hadoop distribution Simple setup Java-API for UDFs Usage of many data formats via SerDe HDFS/HBase Disadvantages Batch-oriented, slow Prof. Dr. Jens Albrecht SQL on Hadoop 7

Schema-on-Write vs. Schema-on-Read Relational Database: Schema-on-Write Multi-structured Source Data Relational DBMS ETL SQL Big Data Processing: Schema-on-Read Multi-structured Source Data Load as-is Hadoop Schema mapped to original files SQL Prof. Dr. Jens Albrecht SQL on Hadoop 8

Schema-on-Read: Hive & CSV CREATE TABLE gps_data( userid INT, deviceid INT, longitude STRING, latitude STRING, utctime STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE; -- load data = copy LOAD DATA LOCAL INPATH 'new_data/gps.dat' OVERWRITE INTO TABLE gps_data; SELECT COUNT(*) FROM gps_data; Sample Data 4711 542815 49.454N 11.077E 10/01/2014@10:00:00UTC Prof. Dr. Jens Albrecht SQL on Hadoop 9

Schema-on-Read: Hive& Regexp CREATE TABLE weblog ( host STRING, identity STRING, user STRING, time STRING, request STRING, status STRING, size STRING, referer STRING, agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.regexserde' WITH SERDEPROPERTIES ( "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (- \\[[^\\]]*\\]) ([^ \"]* \"[^\"]*\") (- [0-9]*) (- [0-9]*)(?: ([^ \"]* \"[^\"]*\") ([^ \"]* \"[^\"]*\"))?" ) STORED AS TEXTFILE; Prof. Dr. Jens Albrecht SQL on Hadoop 10

Schema-Read: Hive & JSON JSON Data Relational Mapping in Hive Source: http://thornydev.blogspot.de/2013/07/querying-json-records-via-hive.html Prof. Dr. Jens Albrecht SQL on Hadoop 11

Hive on Tez (Stinger) Stinger-Initiative Hadoop Meta Store Hive HiveQL Execution SerDe "Make Hive 100x faster" Finished with Hive 0.13 (April 2014) Replace MapReduce with Tez Native columnar data format (ORC) Stinger.Next MapReduce Tez/ Yarn HDFS/HBase Phase 1: Hive 0.14 (November 2014) ACID transactions Phase 2: (Q2 2015) Subsecond Queries mit LLAP Machine Learning Integration Phase 3: (Q4 2015) SQL:2011 Analytic Functions Materialized Views Prof. Dr. Jens Albrecht SQL on Hadoop 12

Pure Hadoop SQL Engines Distributed SQL Engine SQL Query Coordination Local Agent Local Agent Local Agent Local Agent Approach Distributed, parallel SQL engine Often usage of Hive Metadata Support of optimized data formats Hadoop as mandatory basis Advantages and Disadvantages Significantly faster as Hive Low latency through dedicated engine Operator pipelining and result caching Data Files Data Files Data Files HDFS/HBase Data Files Diffentiation of solutions Supported SQL functionality Point querying Cost-based optimizer / performance Transaction support Prof. Dr. Jens Albrecht SQL on Hadoop 14

Pure Hadoop SQL Engines Distributed SQL Engine SQL Query Coordination Big Insights Local Agent Local Agent Local Agent Local Agent Data Files Data Files Data Files Data Files HDFS/HBase Prof. Dr. Jens Albrecht SQL on Hadoop 15

Pure Hadoop SQL Engines Example: IBM Big SQL Source: IBM Prof. Dr. Jens Albrecht SQL on Hadoop 16

Apache Spark & Spark SQL Schema RDD Spark SQL SQL Execution SerDe General SQL engine based on Spark Data access via data frames (former SchemaRDD) In-Memory columnar format HDFS / HBase as file format Apache Spark Advantages Spark as general-purpose parallel computing framework Support of Hive extensions like UDFs and SerDes and Hive metadata HDFS/HBase Disadvantages Not yet fully mature Not yet as fast as competitors Prof. Dr. Jens Albrecht SQL on Hadoop 17

Apache Spark Distributed In-Memory Computing Framework Data caching General framework for all kinds of SQL and non-sql analytics Support for out-of-the box libraries as well as Java, Python and Scala in the same engine and for the same data New datasources API allows to write plugins for non-hadoop sources Spark SQL Spark Streaming Machine Learning (MLlib) Graph Computation (GraphX) Spark Execution Engine ZooKeeper Hadoop YARN (optional) HDFS Prof. Dr. Jens Albrecht SQL on Hadoop 18

Apache Spark lines = spark.textfile("hdfs://...") errors = lines.filter(_.startswith(error)) errors.persist() // Return the time fields of errors mentioning // assuming time is field number 3 in a tsv file hdfs_errors = errors.filter(_.contains(hdfs)) time_fields = hdfs_errors.map(_.split( \t )(3)).collect() Action Resilient Distributed Data Sets (RDDs) Transformations Source: http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf Prof. Dr. Jens Albrecht SQL on Hadoop 19

Spark Transformations Transformations: Create a new RDD from an existing one Lazy evaluation results are not materialized Much more than map reduce map filter sample groupbykey sortbykey reducebykey union pipe repartition join leftouterjoin rightouterjoin Actions: Return a value or dataset to the calling program reduce collect count first take(n) saveastextfile Prof. Dr. Jens Albrecht SQL on Hadoop 20

Adding Schema to RDDs Source: http://de.slideshare.net/jeykottalam/spark-sqlamp-camp2014 Prof. Dr. Jens Albrecht SQL on Hadoop 22

Hadoop-SQL Integration Hive(Native Hadoop) HiveQL MR / Tez HDFS Pure Hadoop SQL Engines Distributed SQL Engine HDFS Format-agnostic SQL Engines RDBMS with Hadoop Access Distributed SQL Engine HDFS/ SQL NoSQL Hive RDBMS Relational Hadoop Prof. Dr. Jens Albrecht SQL on Hadoop 23

SQL-Engine with Pluggable Storage Distributed SQL Engine SQL Query Coordination Local Agent Local Agent Local Agent Local Agent HDFS Hive JSON Parquet Cassandra Connector Plugins MySQL Oracle Ansatz Verteilte, parallele SQL-Engine Oftmals Nutzung von Hive Metadaten Unterstützung optimierter Dateiformate Hadoop obligatorisch als Basis Allgemeine Vorteile Deutlich schneller als Hive Geringe Latenzzeiten durch Vermeidung von Map-Reduce Operatoren-Pipelining und Caching Skalierbar Allgemeine Nachteile In der Regel keine Transaktionsunterstützung Prof. Dr. Jens Albrecht SQL on Hadoop 24

Google Dremel Dremel Prof. Dr. Jens Albrecht SQL on Hadoop 25

Apache Drill Quelle: http://www.mapr.com Self-Service Data Integration No metadata repository required Dynamic schema discovery: Metadata automatically extracted for data sources RDBMS and Hive (comprehensive), HBase (partial) or files (on-the-fly) Utilizes self-describing data formats (Parquet, JSON, AVRO) SQL-DDL can be used to create metadata explicitly ANSI-SQL plus Flexible Data Model Fully ANSI compliant SQL DrQL with SQL extensions for nested data structures (like JSON) Prof. Dr. Jens Albrecht SQL on Hadoop 26

Apache Drill: SQL for JSON select name, flatten(fillings) as f from dfs.users.`/donuts.json` where f.cal < 300; Source: https://www.mapr.com/products/apache-drill Prof. Dr. Jens Albrecht SQL on Hadoop 27

Apache Drill: SQL for Heterogeneous Data Formats JSON CSV ORC Parquet HBasetables MongoDB select USERS.name, USERS.emails.work from dfs.logs.`/data/logs` LOGS, dfs.users.`/profiles.json` USERS, where LOGS.uid = USERS.uid and errorlevel > 5 order by count(*); Prof. Dr. Jens Albrecht SQL on Hadoop 28

External Tables in HDFS RDBMS Logical Mapping HDFS CREATE TABLE SCOTT.SALES_HDFS_EXT_TAB ( PROD_ID NUMBER(6), CUST_ID NUMBER, TIME_ID DATE, CHANNEL_ID CHAR(1), PROMO_ID NUMBER(6), QUANTITY_SOLD NUMBER(3), AMOUNT_SOLD NUMBER(10,2) ) ORGANIZATION EXTERNAL ( TYPE ORACLE_LOADER DEFAULT DIRECTORY SALES_EXT_DIR ACCESS PARAMETERS ( RECORDS DELIMITED BY NEWLINE FIELDS TERMINATED BY ',' ( PROD_ID DECIMAL EXTERNAL, ) PREPROCESSOR HDFS_BIN_PATH:hdfs_stream ) LOCATION ( 'file_sales_1', 'file_sales_2', 'files_sales_3') ); Prof. Dr. Jens Albrecht SQL on Hadoop 30

RDBMS with Hadoop Integration Parallel Database SQL Query Coordination Approach Towards genuine integration of Hadoop into RDBMS Utilize Hadoop's computational power Cost-based choice External Tables MR Loader Map Reduce HDFS/HBase Relational Tables Advantages Easiest way to use Hadoop as data source Combined access to traditional and new data sources Disadvantages Cost Limited data sources Vendor lock-in Prof. Dr. Jens Albrecht SQL on Hadoop 31

RDBMS with Hadoop Integration External Tables HDFS/HBase Parallel Database SQL Query Coordination MR Loader Map Reduce Relational Tables Products Microsoft Polybase (part of MS Analytics Platform) Oracle Big Data SQL (part of Oracle Big Data Appliance in combination with Exadata) Use Cases Extension of traditional BI System Data-lake scenario with RDBMS as primary system and Hadoop for mass data Mix of analytic and transactional load Prof. Dr. Jens Albrecht SQL on Hadoop 32

Oracle Big Data SQL Integrates Hive Metadata Allows hybrid queries Include Hadoop and NoSQL in relational queries Use Exadata Smartscan- Technology Source: http://www.oracle.com Prof. Dr. Jens Albrecht SQL on Hadoop 33

Hadoop-SQL Integration Hive(Native Hadoop) Pure Hadoop SQL Engines HiveQL MR / Tez HDFS Stinger Distributed SQL Engine HDFS Big Insights Format-agnostic SQL Engines RDBMS with Hadoop Access HDFS Distributed SQL Engine Hive NoSQL RDBMS Relational Hadoop Prof. Dr. Jens Albrecht SQL on Hadoop 34

> Hadoop File Formats Prof. Dr. Jens Albrecht SQL on Hadoop 35

File formats for Big Data Text Formats High storage usage bad scan performance Low compression bad scan performance Dedicated Formats (e.g. DB internal) Not open, no interoperability Requirements for Big Data Interoperability Low storage / good compression High performance Flexible schema A schema for a file format?? Query performance for a file format?? Big Data often have a nested structure and multiple schema variants and versions Prof. Dr. Jens Albrecht SQL on Hadoop 36

Motivation for file formats Quelle: http://de.slideshare.net/cloudera/hadoop-summit-36479635 Prof. Dr. Jens Albrecht SQL on Hadoop 37

Considerations for File Formats Query Tools none Frameworks like MapReduce, Spark, Cascading Query Engines like Pig, Hive, Impala Schema Versioning Schema present? If so, can it change? Splittability Partitioning Splitting of files possible for distributed processing? Example: CSV: yes, XML: partial, MP4: no Block Compression Can blocks be independently compressed and distributed? Block compression is a prerequisite for partition compresssion! File Size Size in bytes and number of files? Hadoop likes big, splittable files! Lots of small files cost performance Load Profile Write Performance Filter operations Reading of single columns Full scans Source: http://inquidia.com/news-and-info/hadoop-file-formats-its-not-just-csv-anymore Prof. Dr. Jens Albrecht SQL on Hadoop 38

Example Row-Format: Avro Schema specification internally stored in binary format self-describing files record Person { string username; union { null, long } favouritenumber; array<string> interests; } Reader- vs. Writer Schema Allows different"views" on files Read Avro Parser Write Reader Schema Resolution Rules Writer Schema Writer Schema Avro Data Prof. Dr. Jens Albrecht SQL on Hadoop 39

Example Column Format: Columnar formats in general Trade faster reads for slower writes Very good compression Parquet Files Hybrid Partitioning sets of records in blocks, columnar within blocks Zone maps per block as kind of index (min/max values per column) Image: http://de.slideshare.net/cloudera/hadoop-summit-36479635 Prof. Dr. Jens Albrecht SQL on Hadoop 40

Databases as Lego Construction Kit!? Traditional monolithic RDBMS SQL Hadoop DB building blocks SQL SQL Prozessor SQL Prozessor Data Dictionary Verteilte Ausführung Speicherverwaltung Map Reduce Spark CSV Seq Avro JSON ORC Parquet Generic Execution Engine Metadata sharing in Hive Repository or self-describing file formats Operator push-down to intelligente file interfaces Prof. Dr. Jens Albrecht SQL on Hadoop 41

> Summary Prof. Dr. Jens Albrecht SQL on Hadoop 42

Considerations for SQL on Hadoop Solutions SQL Functionality Coverage of SQL standard User-defined functions Transactional Safety Performance and Stability Multi-user workloads Efficiency of joins and aggregations (I/O problems? Size limits?) Supported Data and Storage Formats Logical Format: relational, JSON, none, Physical Formats: CSV, Parquet, ORC, Avro, Intelligent Storage Plugins / Data Federation Access to various data sources beyond Hadoop Pushdown predicates, access selected columns only Support for your Hadoop Distribution Prof. Dr. Jens Albrecht SQL on Hadoop 43

Hadoop vs. SQL Hadoop SQL Technologies Supplement traditional RDBMS Extend traditional RDBMS Develop new RDBMS SQL Hadoop Hadoopand SQL move closely together. SQL universe gets wider. Database systems become open and modular. Prof. Dr. Jens Albrecht SQL on Hadoop 44

References L. Chang et al.: HAWQ: A Massively Parallel Processing SQL Engine in Hadoop. In: Proceedings of the 2014 ACM SIGMOD international conference on Management of Data, Pages 1223-1234 A. Floratour, U. Minhas, F. Özcan: SQL-on-Hadoop: Full Circle Back to Shared-Nothing Database Architectures. Proceedings of the VLDB Endowment, Vol. 7, No. 12, 2014 M. Hausenblas, J. Nadeau: Apache Drill: Interactive Ad-Hoc Analysis at Scale. Big Data Magazine, June 2013 M. Kornacker, e.a.: Impala: A Modern, Open-Source SQL Engine for Hadoop. 7th Biennial Conference on Innovative Data Systems Research (CIDR 15) D. J. DeWitt, e.a.: Split Query Processing in Polybase. Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, Pages 1255-1266 S. Melnik, e.a.: Dremel: Interactive Analysis of Web-scale Datasets. PVLDB, 3(1-2):330 339, 2010 M. Zaharia, e.a.: Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, NSDI 2012 J. Albrecht, S. Alexander: Hadoop und SQL rücken enger zusammen. Computerwoche, Nov. 2013, http://www.computerwoche.de/a/hadoop-und-sql-ruecken-enger-zusammen,2549475 C. Deptula: Hadoop File Formats: Ist not Just CSV Anymore. Blog Eintrag, 2014, http://inquidia.com/news-and-info/hadoop-file-formats-its-not-just-csv-anymore J. Le Dem: Efficient Data Storage Analytics with Apache Parquet 2.0, Hadoop Summit 2014, http://de.slideshare.net/cloudera/hadoop-summit-36479635 M. Rathbone: 8 SQL-on-Hadoop frameworks worth checking out. Blog Eintrag, 2014, http://blog.matthewrathbone.com/2014/06/08/sql-engines-for-hadoop.html P. Srivati: Resilient Distributed Datasets (RDD) for the impatient. Blog Eintrag, 2014, http://www.thecloudavenue.com/2014/01/resilient-distributed-datasets-rdd.html S. Yegulalp: 10 ways to query Hadoop with SQL. Infoworld, 2014, http://www.infoworld.com/article/2683729/hadoop/10-ways-to-query-hadoop-with-sql.html Prof. Dr. Jens Albrecht SQL on Hadoop 45