Ad-hoc Query Brown Bag Session 07/11/2014 Julien Poorna Andreas
User Story Procedures are only developer friendly and not ad-hoc Open datasets to broader audience of non developers Introduce schema to datasets Use SQL as query paradigm Open to BI tools and analysts
Hive - Introduction Manage and query structured data SQL-like language: Hive QL Built on top of Hadoop Create Hive tables based on files from: HDFS, HBase, any custom file format; Graph of Map-Reduce jobs for execution
Hive - How it works Hive Client Thrift Hive Server 2 Thrift Hive Metastore DB SELECT a,b FROM mytable; Launch mytable schema MR MR MR YARN Cluster
Hive Integration with External Systems External tables VS managed tables table definition managed externally data storage not handled by Hive Non-native tables VS native tables need an external component - Storage Handler - to access data Set hive.aux.jars configuration with component s jar
Hive Storage Handlers Access data stored and managed by other systems. Need to implement: Input format - get data splits for MR jobs and records from splits Output format - write records to external storage system SerDe - Serialize / Deserialize records Object Inspector, analyze internal structure of record For Datasets: DatasetInputFormat, DatasetSerDe, reuse Hive ObjectInspector
From Datasets to Hive Hive Server MR Job Dataset data Dataset Split Dataset Record Record Fields Hive Interfaces DatasetInput Format getsplits() DatasetInput Format getrecordre ader() DatasetSerD e deserialize() DatasetSerD e getobjectins pector() Record Scannable RecordScan nable getsplits() RecordScan nable createsplitre cordscanner() RecordScan nable getrecordtyp e()
Create a RecordScannable Most Datasets already implement BatchReadable Shared by BatchReadable and RecordScannable Utility methods to use BatchReadable methods to implement RecordScannable ones
Explore Module Explore Executor Async HTTP Service Explore Service Hive CLI Service Public endpoints to sendquery, getstatus, getschema, getresults, cancel, close Hive CLI Service - not HiveServer2: HiveServer2 only fully supports thrift Wrap execution of a query in a long running tx Run in a Twill Container User code executed by Storage Handler Different implementations of Explore Service for different versions of Hive
Starting Explore In ReactorMaster startup script, create Explore class path In ReactorMaster: Check reactor.explore.enabled setting Use separate class loader to check that Hive jars are present with a supported version: Hive12 / Hive13 / Hive distributed with CDH 4.3-4.7, 5.0 Create Explore Twill container and ship Hive classes and related Reactor classes: HBase classes, DatasetFramework.class, DatasetStorageHandler.class Ship Hive conf files as resources to the container In Container: Set hive.aux.jars configuration with related Reactor classes CLI Service connects to existing Metastore Register HTTP service with Zookeeper
Deploying a Dataset REST Client DS.jar R o u t e r DS Manager enable Explore Executor Hive CLI Create table Hive Meta Service add copy Add schema MDS DS.jar HDFS mysql
Executing a Query Start tx REST Client Poll to get results SQL R o u t e r SQL DS Manager query start MR Storage Handler Explore Executor Hive CLI getsplits Retrieve metadata Hive Meta Service read load read MDS DS.jar HDFS mysql
JDBC Connector Eventually open Reactor Datasets to external systems Right now limited version to use ad-hoc queries in unit tests Class.forName("com.continuuity.explore.jdbc.ExploreDriver"); Connection con = DriverManager.getConnection("jdbc:reactor://localhost:10000"); ResultSet res = stmt.preparestatement( select * from continuuity_user_mytable ).executequery(); res.next(); String firstcolumn = res.getstring(1);
Troubles Along the Way Class loading issues Hive packages libraries with different versions than Reactor Dataset class loading issues DatasetFramework does not cache class loaders used to instantiate datasets Instantiating a system dataset that requires user types is not supported Different Hive components read Hive conf from different places Hive in-memory map reduce logic is broken: patch for Hive-exec.jar Schema with recursive data types make object inspector go into infinite loop CLIService implementation slightly different for different versions of Hive Hive CLI Service does not have a startandwait method
Future Work JDBC connector with more functionalities, to be used by BI tools Support secure Hive Support for UDFs Pick up record scannable datasets when explore turns to enabled UI Hive on Tez Hive on Spark?
Thank you