The Role Polybase in the MDW Brian Mitchell Microsoft Big Data Center of Expertise
Program Polybase Basics Polybase Scenarios Hadoop for Staging Ambient data from Hadoop Export Dimensions to Hadoop Hadoop as a Data Archive Demos Throughout
The traditional data warehouse data warehousing has reached the most significant tipping point since its inception. The biggest, possibly most elaborate data management system in IT is changing. Gartner, The State of Data Warehousing in 2012 Data sources
The traditional data warehouse 2 Real-time data 1 Increasing data volumes Data sources Non-relational data 3 New data sources and types 4 Cloud-born data
Data sources Non-relational data
Introducing the Microsoft Analytics Platform System The turnkey modern data warehouse appliance Relational and non-relational data in a single appliance Enterprise-ready Hadoop Integrated querying across Hadoop and PDW using T-SQL Direct integration with Microsoft BI tools such as Microsoft Excel Near real-time performance with In-Memory Columnstore Ability to scale out to accommodate growing data Removal of data warehouse bottlenecks with MPP SQL Server Concurrency that fuels rapid adoption Industry s lowest data warehouse appliance price per terabyte Value through a single appliance solution Value with flexible hardware options using commodity hardware
Hardware and software engineered together The ease of an appliance Analytics Platform System Pre-built hardware + software appliance Co-engineered with Dell, HP, and Quanta SQL Server Parallel Data Warehouse PolyBase Pre-built hardware Pre-installed software Plug and play Built-in best practices Microsoft HDInsight Time savings Built for Big Data
HDInsight Region
APS delivers enterprise-ready Hadoop with HDInsight Manageable, secured, and highly available Hadoop integrated into the appliance SQL Server Parallel Data Warehouse High performance and tuned within the appliance End-user authentication with Active Directory PolyBase Microsoft HDInsight 100-percent Apache Hadoop Managed and monitored using System Center Accessible insights for everyone with Microsoft BI tools
Appliance APS appliance overview A region is a logical container within an appliance Each workload contains the following boundaries: Parallel Data Warehouse workload Fabric HDInsight workload Security Metering Servicing Hardware
HDInsight Overview It s HDI running on an appliance as a workload HDInsight is Microsoft branded Hortonworks distro (HDP1.3) For AU1 An integrated appliance for running PDW region and HDI region PDW is offered as a stand-alone workload on the appliance HDI is offered only as an add-on to PDW Only supported on V2 hardware H/A for the Head Node is Failover Clustering Data Node H/A is HDFS
What s included?
Hardware Topology Uses PDW HW and topology No new SKUs for the HDI region 2 additional servers on rack 1 for HDI Head Node 1 active/1 failover PDW PDW failover/spare Hadoop Hadoop failover/spare Passive scale unit for PDW PDW Control Node HDI Head Node HDI Data Nodes (1 scale unit) PDW Compute Nodes (1 scale unit) IB switch 670769-B21 IB switch 670769-B21 Ethernet switch JE068A Ethernet switch JE068A DL360G8 Server 670769-B21 DL360G8 Server 670769-B21 DL360G8 Server 670769-B21 DL360G8 Server 670769-B21 DL360G8 Server 654081-B21 DL360G8 Server 654081-B21 D6000 JBOD DL360G8 Server 670769-B21 DL360G8 Server 670769-B21 D6000 JBOD u42 u41 u40 u39 u38 u37 u36 u35 u34 u33 u32 u31 u30 u29 u28 u27 u26 u25 u24 u23 u22 u21 u20 u19 u18 u17 u16 u15 u14 u13 u12 u11 u10 u9 u8 u7 u6 u5 u4 u3 u2 u1
Connecting islands of data with PolyBase Bringing Hadoop point solutions and the data warehouse together for users and IT Select Result set Microsoft Azure HDInsight Hortonworks for Windows and Linux Cloudera SQL Server Parallel Data Warehouse PolyBase Microsoft HDInsight Provides a single T-SQL query model for PDW and Hadoop with rich features of T-SQL, including joins without ETL Uses the power of MPP to enhance query execution performance Supports Windows Azure HDInsight to enable new hybrid cloud scenarios Provides the ability to query non-microsoft Hadoop distributions, such as Hortonworks and Cloudera
Polybase APS AU1 New versions of Hadoop New file types Multiple Hadoop Connections Predicate Pushdown
How to query any data, in any location, in any format? External Tables External Data Sources External File Format
Concept of External Tables, Data Sources & File Formats
Polybase Enhancing PDW query engine Data Scientists, BI Users, DB Admins Your Apps Power BI Microsoft APS Polybase External Table External Data source External File Format APS control & data nodes Polybase/APS query engine Web Apps Social Apps Mobile Apps Sensor & RFID
External tables Internal representation of data residing outside of appliance Introducing modified syntax (compared to PolyBase v1) o Seamless upgrade of existing v1 external tables SQL permissions required for creating external tables o ADMINISTER BULK OPERATIONS, CREATE TABLE, and ALTER ON SCHEMA permission o ALTER ANY EXTERNAL DATA SOURCE and FILE FORMAT permission CREATE EXTERNAL TABLE table_name ({<column_definition>}[,..n ]) {WITH (DATA_SOURCE = <data_source>, FILE_FORMAT = <file_format>, LOCATION = <file_path>, Referencing external data source Referencing external file format Path of the Hadoop file/folder [;] [REJECT_VALUE = <value>], } (Optional) Reject parameters
External data sources Internal representation o an external data source Enabling and disabling of split-based query processing Alter any external data source permission required Support of Hadoop as a data source and Windows Azure Blob Storage (WASB, formerly known as ASV) Generation of MapReduce jobs on-the-fly [fully transparent for end user] CREATE EXTERNAL DATA SOURCE datasource_name {WITH (TYPE = <data_source>, LOCATION = <location>, Type of external data source Location of external data source } [;] [JOB_TRACKER_LOCATION = <jb_location> ] Enabling or disabling of MapReduce job generation
External file format Internal representation of an external file format Enabling and disabling of split-based query processing Alter any external file format permission required Support of delimited text files and Hive RCFiles Generation of MapReduce jobs on-the-fly CREATE EXTERNAL FILE FORMAT fileformat_name {WITH ( FORMAT_TYPE = <type>, [SERDE_METHOD = <sede_method> ] [DATA_COMPRESSION = <compr_method> ] Type of external data source (De)Serialization method [Hive RCFile] Compression method } [;] [FORMAT_OPTIONS (<format_options>)] (Optional) Format Options [Text Files]
Support of additional HDFS file formats: Hive RCFiles Hadoop/Hive users prefer RCFile due to better compression and performance benefits Record Columnar File consisting of binary key/value pairs RCFile stores columns of a table in a record columnar way User has to specify serialization/deserializ ation method (SERDE_METHOD) CREATE EXTERNAL FILE FORMAT MyRCFile WITH ( FORMAT_TYPE = RCFile, [SERDE_METHOD = LazyBinarySerDe ] ) Some performance observations in-house o o LazyBinaryColumnarSerDe significantly faster and more efficient than ColumnarSerDe Data compression is not very beneficial in the case of IB connectivity between Hadoop and PDW (If low-speed networking is used, compression is expected to help)
Format options for delimited text files <Format Options> :: = [,FIELD_TERMINATOR= Value ], [,STRING_DELIMITER = Value ], [,DATE_FORMAT = Value ], [USE_TYPE_DEFAULT = Value ] FIELD_TERMINATOR STRING_DELIMITER DATE_FORMAT USE_TYPE_DEFAULT To indicate a column delimiter To specify the delimiter for string data type fields To specify a particular date format To specify how missing entries in text files are treated
(HDFS) Bridge Direct and parallelized HDFS access Enhancing the Data Movement Service (DMS) of APS to allow direct communication between HDFS data nodes and PDW compute nodes Non-relational data Social apps Mobile apps Sensor and RFID Web apps Regular T-SQL External table External data source Results External file format Relational data Traditional schema-based data warehouse applications Hadoop Enhanced PDW query engine HDFS bridge PDW
Querying external Hadoop data via T-SQL
Predicate Pushdown Reduce Data Movement Reduce the number of rows moved Reduce the number of columns moved Subset of expressions and operators
Querying Hadoop data via T-SQL I. Query data in HDFS and display results in table form (via external tables) II. Join data from HDFS with relational APS/PDW data Running example Creating external table ClickStream : CREATE EXTERNAL TABLE ClickStream(url varchar(50), event_date date, user_ip varchar(50)), WITH (LOCATION='//Hadoop_files/clickstream.tbl', DATA_SOURCE=MY_HDP2.0,FILE_FORMAT= MyDelimitedText) 1. External data source & file format Polybase query examples SELECT top 10 (url) FROM ClickStream where user_ip = 192.168.0.1 Filter query against data in HDFS 2. 3. SELECT url.description FROM ClickStream cs, Url_Descr* url WHERE cs.url = url.name and cs.url= www.cars.com ; SELECT user_name FROM ClickStream cs, User* u WHERE cs.user_ip = u.user_ip and cs.url= www.microsoft.com ; Join data from various files in HDFS (*Url_Descr is a second text file) Join data from HDFS with data in PDW (*User is a distributed PDW table)
Split-based query execution through Polybase Your App PowerBI 1. (HDFS/WASB) Bridge Component Connecting and retrieving/wrting data from/to Hadoop s distributed file system or Azure s storage (containers) (HDFS/WASB) Bridge External Table External Data source External File Format APS/Polybase Query Engine M-R Job Submitter Polybase Storage Layer (PPAX) 2. 3. Job Submitter Component Generating map/reduce jobs on-the-fly for in-situ processing Transparent for end-user no need to learn map/reduce M/R jobs executed by Hadoop s job tracker Cost-based decision when to push computation vs. direct import of data (based on statistics) Optimized Storage Layer PPAX hybrid columnar-row storage All HDFS file formats transformed into optimized PPAX
Cost-based Decision I (for split-based query execution) Your App External Table External Data source PowerBI External File Format APS/Polybase Query Engine Distributed query plan SQL Server on control node Leveraging SQL Server as query compilation aid User can create statistics on external table Full scan vs. sampling Cost-based decision on push-down APS/Polybase query engine uses stats to determine the data volume to be transferred Cost factors > IO and data transfer cost Assuming high-speed networking (>10G Ethernet) (HDFS/WASB) Bridge M-R Job Submitter Polybase create statistics example Polybase Storage Layer (PPAX) CREATE STATISTICS UserIP_Stats ON ClickStream(user_IP) WITH FULLSCAN
Cost-based Decision II (for split-based query execution) Your App External Table External Data source (HDFS/WASB) Bridge PowerBI External File Format APS/Polybase Query Engine M-R Job Submitter Polybase Storage Layer (PPAX) Major factor for decision is data volume reduction Spin-up time for map-reduce is around 20-30 seconds o Spin-up time varies depending on Hadoop distribution and underlying OS Cardinality of predicate matters o creating statistics crucial for quality of Polybase query plans o No push-down for scenarios where APS can execute under 20-30 seconds w/o push-down Rough rule of thumb o Don t consider pushdown for inputs that results in less than 1 GB per *PDW distribution* Example: For 2 compute nodes, file size > 16 GB o Transfer, write, and process 1 GB per distribution faster than spinning up an m/r jobs
Cost-based Decision III (for split-based query execution) Your App External Table External Data source (HDFS/WASB) Bridge PowerBI External File Format APS/Polybase Query Engine M-R Job Submitter Queries can have push-able & non push-able expressions Push-able ones will be evaluated on Hadoop side (if possible) Processing of non-push- able ones will be done on PDW side Joins in general will be always executed on APS Predicates may be push-downed (if possible) Aggregations (partial or full) will be performed in PDW Partial aggregation on Hadoop envisioned for future APS releases Polybase Storage Layer (PPAX)
Supported Configurations for AU1 HDInsight on Analytics Platform System HDInsight s Windows Azure blob storage (WASB[S]) Hortonworks on Windows Server (HDP 1.3, 2.0) Hortonworks on Linux (HDP 1.3, 2.0) Cloudera on Linux (CDH 4.3)
Applications Data Sources Applications A Traditional Approach Under Pressure Business RDBMS EDW Repositories Existing Sources (CRM, ERP, ClickStream, Logs) Emerging Sources (Sensor, Sentiment, Geo, Logs, Unstr.)
Why Polybase? PDW PDW with Polybase
An Emerging Data Architecture
Integrating Big Data with Microsoft Data Warehousing and Business Intelligence ETL Processing
Using Hadoop for Staging
Traditional ETL Data Warehousing and Business Intelligence ETL Processing (SSIS, etc)
Long Term Raw Data Archiving
Long Term Raw Data Archiving
Transforming Data
New Data Types
Let s get Technical
Create External Table
CTAS Create Table AS Select CREATE TABLE mytable WITH ( CLUSTERED COLUMNSTORE INDEX, DISTRIBUTION = HASH (CustomerKey) ) AS SELECT * FROM ClickStream;
Demo
Using Polybase to export from PDW to Hadoop
Exporting Conformed Dimensions to Hadoop
Export your Conformed Dimensions
Data Archiving
Hadoop as a Data Archive ETL Processing
CETAS Create External Table AS Select CREATE EXTERNAL TABLE hdfsfactalldataarchive WITH (LOCATION = 'user/administrator/passbac/all_data/', DATA_SOURCE = f14790hdp, FILE_FORMAT = pipedelimited ) AS SELECT * FROM FactAllData WHERE transaction_year < 2000;
Demo
Join Data on the Fly
Joining Data Store your Dimensional data on PDW and your Fact data on Hadoop
Join PDW & External Tables No Different from any other join you do today SELECT c.name, d.year, sum(sales) FROM FactSales s External Table JOIN dimcustomer c Internal Table ON c.customerid = s.customerid JOIN dimdate d Internal Table ON s.dateid = d.dateid WHERE d.year = 2008 AND c.name = Albertson & Brothers
Demo
Wrap-up