The Role Polybase in the MDW. Brian Mitchell Microsoft Big Data Center of Expertise

Similar documents
How To Create A Fact Table On Hadoop (Hadoop) On A Microsoft Powerbook (Powerbook) On An Ipa 2.2 (Powerpoint) On Microsoft Microsoft 2.3

Microsoft Analytics Platform System. Solution Brief

Structured data meets unstructured data in Azure and Hadoop

Please give me your feedback

Big Data Processing: Past, Present and Future

Modernizing Your Data Warehouse for Hadoop

Modern Data Warehousing

Agenda. Modern Data Warehouse Big Data Application examples. Analytic Platform Systems. Integration of Hadoop and APS. Architecture Hadoop

Microsoft technológie pre BigData. Ľubomír Goryl Solution Professional

SQL Server 2014 Faster Insights from any Data Level 300

Bringing Big Data to People

Parallel Data Warehouse

Polybase for SQL Server 2016

The Inside Scoop on Hadoop

How to make BIG DATA work for you. Faster results with Microsoft SQL Server PDW

A Breakthrough Platform for Next-Generation Data Warehousing and Big Data Solutions

SQL Server 2012 Parallel Data Warehouse. Solution Brief

The Microsoft Modern Data Warehouse

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

SQL Server Parallel Data Warehouse: Architecture Overview. José Blakeley Database Systems Group, Microsoft Corporation

Big Data Technologies Compared June 2014

SELLING PROJECTS ON THE MICROSOFT BUSINESS ANALYTICS PLATFORM

Why Big Data in the Cloud?

SQL Server 2016 New Features!

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Oracle Database 12c Plug In. Switch On. Get SMART.

The Future of Data Management

Einsatzfelder von IBM PureData Systems und Ihre Vorteile.

Building a BI Solution in the Cloud

TAMING THE BIG CHALLENGE OF BIG DATA MICROSOFT HADOOP

Best Practices for Hadoop Data Analysis with Tableau

Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard

HDP Hadoop From concept to deployment.

SQL Server 2012 Performance White Paper

Investor Presentation. Second Quarter 2015

James Serra Sr BI Architect

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

CitusDB Architecture for Real-Time Big Data

SAP HANA SAP s In-Memory Database. Dr. Martin Kittel, SAP HANA Development January 16, 2013

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

Native Connectivity to Big Data Sources in MSTR 10

SQL Server 2012 PDW. Ryan Simpson Technical Solution Professional PDW Microsoft. Microsoft SQL Server 2012 Parallel Data Warehouse

Cost-Effective Business Intelligence with Red Hat and Open Source

Dell Cloudera Syncsort Data Warehouse Optimization ETL Offload

SQL Server PDW. Artur Vieira Premier Field Engineer

SQL Server What s New? Christopher Speer. Technology Solution Specialist (SQL Server, BizTalk Server, Power BI, Azure) v-cspeer@microsoft.

Exploring the Synergistic Relationships Between BPC, BW and HANA

Tap into Hadoop and Other No SQL Sources

Microsoft Big Data and Analytics. Server, an on-premises solution, and Windows Azure HDInsight Service*, a completely cloud-based solution.

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

IBM BigInsights for Apache Hadoop

Azure Data Lake Analytics

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

HDP Enabling the Modern Data Architecture

Protecting Big Data Data Protection Solutions for the Business Data Lake

Deeper Insights across Data

Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc.

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc.

Using Tableau Software with Hortonworks Data Platform

SQL Server to SQL Server PDW. Migration Guide (AU3)

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Constructing a Data Lake: Hadoop and Oracle Database United!

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Big Data on Microsoft Platform

Virtualizing Apache Hadoop. June, 2012

Next-Generation Cloud Analytics with Amazon Redshift

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

SQL Server Point of View. Overview on Key Enhancements and Updates

Hadoop and Map-Reduce. Swati Gore

<Insert Picture Here> Oracle and/or Hadoop And what you need to know

Teradata s Big Data Technology Strategy & Roadmap

WINDOWS AZURE DATA MANAGEMENT AND BUSINESS ANALYTICS

SAP and Hortonworks Reference Architecture

Traditional BI vs. Business Data Lake A comparison

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Big Data Can Drive the Business and IT to Evolve and Adapt

Oracle Database - Engineered for Innovation. Sedat Zencirci Teknoloji Satış Danışmanlığı Direktörü Türkiye ve Orta Asya

Ganzheitliches Datenmanagement

The Enterprise Data Hub and The Modern Information Architecture

Il mondo dei DB Cambia : Tecnologie e opportunita`

Cloudera Certified Developer for Apache Hadoop

SQL Server and MicroStrategy: Functional Overview Including Recommendations for Performance Optimization. MicroStrategy World 2016

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

Sentimental Analysis using Hadoop Phase 2: Week 2

Oracle Big Data SQL Technical Update

2009 Oracle Corporation 1

A Next-Generation Analytics Ecosystem for Big Data. Colin White, BI Research September 2012 Sponsored by ParAccel

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

Leveraging SAP HANA & Hortonworks Data Platform to analyze Wikipedia Page Hit Data

An Oracle White Paper June Oracle: Big Data for the Enterprise

Transcription:

The Role Polybase in the MDW Brian Mitchell Microsoft Big Data Center of Expertise

Program Polybase Basics Polybase Scenarios Hadoop for Staging Ambient data from Hadoop Export Dimensions to Hadoop Hadoop as a Data Archive Demos Throughout

The traditional data warehouse data warehousing has reached the most significant tipping point since its inception. The biggest, possibly most elaborate data management system in IT is changing. Gartner, The State of Data Warehousing in 2012 Data sources

The traditional data warehouse 2 Real-time data 1 Increasing data volumes Data sources Non-relational data 3 New data sources and types 4 Cloud-born data

Data sources Non-relational data

Introducing the Microsoft Analytics Platform System The turnkey modern data warehouse appliance Relational and non-relational data in a single appliance Enterprise-ready Hadoop Integrated querying across Hadoop and PDW using T-SQL Direct integration with Microsoft BI tools such as Microsoft Excel Near real-time performance with In-Memory Columnstore Ability to scale out to accommodate growing data Removal of data warehouse bottlenecks with MPP SQL Server Concurrency that fuels rapid adoption Industry s lowest data warehouse appliance price per terabyte Value through a single appliance solution Value with flexible hardware options using commodity hardware

Hardware and software engineered together The ease of an appliance Analytics Platform System Pre-built hardware + software appliance Co-engineered with Dell, HP, and Quanta SQL Server Parallel Data Warehouse PolyBase Pre-built hardware Pre-installed software Plug and play Built-in best practices Microsoft HDInsight Time savings Built for Big Data

HDInsight Region

APS delivers enterprise-ready Hadoop with HDInsight Manageable, secured, and highly available Hadoop integrated into the appliance SQL Server Parallel Data Warehouse High performance and tuned within the appliance End-user authentication with Active Directory PolyBase Microsoft HDInsight 100-percent Apache Hadoop Managed and monitored using System Center Accessible insights for everyone with Microsoft BI tools

Appliance APS appliance overview A region is a logical container within an appliance Each workload contains the following boundaries: Parallel Data Warehouse workload Fabric HDInsight workload Security Metering Servicing Hardware

HDInsight Overview It s HDI running on an appliance as a workload HDInsight is Microsoft branded Hortonworks distro (HDP1.3) For AU1 An integrated appliance for running PDW region and HDI region PDW is offered as a stand-alone workload on the appliance HDI is offered only as an add-on to PDW Only supported on V2 hardware H/A for the Head Node is Failover Clustering Data Node H/A is HDFS

What s included?

Hardware Topology Uses PDW HW and topology No new SKUs for the HDI region 2 additional servers on rack 1 for HDI Head Node 1 active/1 failover PDW PDW failover/spare Hadoop Hadoop failover/spare Passive scale unit for PDW PDW Control Node HDI Head Node HDI Data Nodes (1 scale unit) PDW Compute Nodes (1 scale unit) IB switch 670769-B21 IB switch 670769-B21 Ethernet switch JE068A Ethernet switch JE068A DL360G8 Server 670769-B21 DL360G8 Server 670769-B21 DL360G8 Server 670769-B21 DL360G8 Server 670769-B21 DL360G8 Server 654081-B21 DL360G8 Server 654081-B21 D6000 JBOD DL360G8 Server 670769-B21 DL360G8 Server 670769-B21 D6000 JBOD u42 u41 u40 u39 u38 u37 u36 u35 u34 u33 u32 u31 u30 u29 u28 u27 u26 u25 u24 u23 u22 u21 u20 u19 u18 u17 u16 u15 u14 u13 u12 u11 u10 u9 u8 u7 u6 u5 u4 u3 u2 u1

Connecting islands of data with PolyBase Bringing Hadoop point solutions and the data warehouse together for users and IT Select Result set Microsoft Azure HDInsight Hortonworks for Windows and Linux Cloudera SQL Server Parallel Data Warehouse PolyBase Microsoft HDInsight Provides a single T-SQL query model for PDW and Hadoop with rich features of T-SQL, including joins without ETL Uses the power of MPP to enhance query execution performance Supports Windows Azure HDInsight to enable new hybrid cloud scenarios Provides the ability to query non-microsoft Hadoop distributions, such as Hortonworks and Cloudera

Polybase APS AU1 New versions of Hadoop New file types Multiple Hadoop Connections Predicate Pushdown

How to query any data, in any location, in any format? External Tables External Data Sources External File Format

Concept of External Tables, Data Sources & File Formats

Polybase Enhancing PDW query engine Data Scientists, BI Users, DB Admins Your Apps Power BI Microsoft APS Polybase External Table External Data source External File Format APS control & data nodes Polybase/APS query engine Web Apps Social Apps Mobile Apps Sensor & RFID

External tables Internal representation of data residing outside of appliance Introducing modified syntax (compared to PolyBase v1) o Seamless upgrade of existing v1 external tables SQL permissions required for creating external tables o ADMINISTER BULK OPERATIONS, CREATE TABLE, and ALTER ON SCHEMA permission o ALTER ANY EXTERNAL DATA SOURCE and FILE FORMAT permission CREATE EXTERNAL TABLE table_name ({<column_definition>}[,..n ]) {WITH (DATA_SOURCE = <data_source>, FILE_FORMAT = <file_format>, LOCATION = <file_path>, Referencing external data source Referencing external file format Path of the Hadoop file/folder [;] [REJECT_VALUE = <value>], } (Optional) Reject parameters

External data sources Internal representation o an external data source Enabling and disabling of split-based query processing Alter any external data source permission required Support of Hadoop as a data source and Windows Azure Blob Storage (WASB, formerly known as ASV) Generation of MapReduce jobs on-the-fly [fully transparent for end user] CREATE EXTERNAL DATA SOURCE datasource_name {WITH (TYPE = <data_source>, LOCATION = <location>, Type of external data source Location of external data source } [;] [JOB_TRACKER_LOCATION = <jb_location> ] Enabling or disabling of MapReduce job generation

External file format Internal representation of an external file format Enabling and disabling of split-based query processing Alter any external file format permission required Support of delimited text files and Hive RCFiles Generation of MapReduce jobs on-the-fly CREATE EXTERNAL FILE FORMAT fileformat_name {WITH ( FORMAT_TYPE = <type>, [SERDE_METHOD = <sede_method> ] [DATA_COMPRESSION = <compr_method> ] Type of external data source (De)Serialization method [Hive RCFile] Compression method } [;] [FORMAT_OPTIONS (<format_options>)] (Optional) Format Options [Text Files]

Support of additional HDFS file formats: Hive RCFiles Hadoop/Hive users prefer RCFile due to better compression and performance benefits Record Columnar File consisting of binary key/value pairs RCFile stores columns of a table in a record columnar way User has to specify serialization/deserializ ation method (SERDE_METHOD) CREATE EXTERNAL FILE FORMAT MyRCFile WITH ( FORMAT_TYPE = RCFile, [SERDE_METHOD = LazyBinarySerDe ] ) Some performance observations in-house o o LazyBinaryColumnarSerDe significantly faster and more efficient than ColumnarSerDe Data compression is not very beneficial in the case of IB connectivity between Hadoop and PDW (If low-speed networking is used, compression is expected to help)

Format options for delimited text files <Format Options> :: = [,FIELD_TERMINATOR= Value ], [,STRING_DELIMITER = Value ], [,DATE_FORMAT = Value ], [USE_TYPE_DEFAULT = Value ] FIELD_TERMINATOR STRING_DELIMITER DATE_FORMAT USE_TYPE_DEFAULT To indicate a column delimiter To specify the delimiter for string data type fields To specify a particular date format To specify how missing entries in text files are treated

(HDFS) Bridge Direct and parallelized HDFS access Enhancing the Data Movement Service (DMS) of APS to allow direct communication between HDFS data nodes and PDW compute nodes Non-relational data Social apps Mobile apps Sensor and RFID Web apps Regular T-SQL External table External data source Results External file format Relational data Traditional schema-based data warehouse applications Hadoop Enhanced PDW query engine HDFS bridge PDW

Querying external Hadoop data via T-SQL

Predicate Pushdown Reduce Data Movement Reduce the number of rows moved Reduce the number of columns moved Subset of expressions and operators

Querying Hadoop data via T-SQL I. Query data in HDFS and display results in table form (via external tables) II. Join data from HDFS with relational APS/PDW data Running example Creating external table ClickStream : CREATE EXTERNAL TABLE ClickStream(url varchar(50), event_date date, user_ip varchar(50)), WITH (LOCATION='//Hadoop_files/clickstream.tbl', DATA_SOURCE=MY_HDP2.0,FILE_FORMAT= MyDelimitedText) 1. External data source & file format Polybase query examples SELECT top 10 (url) FROM ClickStream where user_ip = 192.168.0.1 Filter query against data in HDFS 2. 3. SELECT url.description FROM ClickStream cs, Url_Descr* url WHERE cs.url = url.name and cs.url= www.cars.com ; SELECT user_name FROM ClickStream cs, User* u WHERE cs.user_ip = u.user_ip and cs.url= www.microsoft.com ; Join data from various files in HDFS (*Url_Descr is a second text file) Join data from HDFS with data in PDW (*User is a distributed PDW table)

Split-based query execution through Polybase Your App PowerBI 1. (HDFS/WASB) Bridge Component Connecting and retrieving/wrting data from/to Hadoop s distributed file system or Azure s storage (containers) (HDFS/WASB) Bridge External Table External Data source External File Format APS/Polybase Query Engine M-R Job Submitter Polybase Storage Layer (PPAX) 2. 3. Job Submitter Component Generating map/reduce jobs on-the-fly for in-situ processing Transparent for end-user no need to learn map/reduce M/R jobs executed by Hadoop s job tracker Cost-based decision when to push computation vs. direct import of data (based on statistics) Optimized Storage Layer PPAX hybrid columnar-row storage All HDFS file formats transformed into optimized PPAX

Cost-based Decision I (for split-based query execution) Your App External Table External Data source PowerBI External File Format APS/Polybase Query Engine Distributed query plan SQL Server on control node Leveraging SQL Server as query compilation aid User can create statistics on external table Full scan vs. sampling Cost-based decision on push-down APS/Polybase query engine uses stats to determine the data volume to be transferred Cost factors > IO and data transfer cost Assuming high-speed networking (>10G Ethernet) (HDFS/WASB) Bridge M-R Job Submitter Polybase create statistics example Polybase Storage Layer (PPAX) CREATE STATISTICS UserIP_Stats ON ClickStream(user_IP) WITH FULLSCAN

Cost-based Decision II (for split-based query execution) Your App External Table External Data source (HDFS/WASB) Bridge PowerBI External File Format APS/Polybase Query Engine M-R Job Submitter Polybase Storage Layer (PPAX) Major factor for decision is data volume reduction Spin-up time for map-reduce is around 20-30 seconds o Spin-up time varies depending on Hadoop distribution and underlying OS Cardinality of predicate matters o creating statistics crucial for quality of Polybase query plans o No push-down for scenarios where APS can execute under 20-30 seconds w/o push-down Rough rule of thumb o Don t consider pushdown for inputs that results in less than 1 GB per *PDW distribution* Example: For 2 compute nodes, file size > 16 GB o Transfer, write, and process 1 GB per distribution faster than spinning up an m/r jobs

Cost-based Decision III (for split-based query execution) Your App External Table External Data source (HDFS/WASB) Bridge PowerBI External File Format APS/Polybase Query Engine M-R Job Submitter Queries can have push-able & non push-able expressions Push-able ones will be evaluated on Hadoop side (if possible) Processing of non-push- able ones will be done on PDW side Joins in general will be always executed on APS Predicates may be push-downed (if possible) Aggregations (partial or full) will be performed in PDW Partial aggregation on Hadoop envisioned for future APS releases Polybase Storage Layer (PPAX)

Supported Configurations for AU1 HDInsight on Analytics Platform System HDInsight s Windows Azure blob storage (WASB[S]) Hortonworks on Windows Server (HDP 1.3, 2.0) Hortonworks on Linux (HDP 1.3, 2.0) Cloudera on Linux (CDH 4.3)

Applications Data Sources Applications A Traditional Approach Under Pressure Business RDBMS EDW Repositories Existing Sources (CRM, ERP, ClickStream, Logs) Emerging Sources (Sensor, Sentiment, Geo, Logs, Unstr.)

Why Polybase? PDW PDW with Polybase

An Emerging Data Architecture

Integrating Big Data with Microsoft Data Warehousing and Business Intelligence ETL Processing

Using Hadoop for Staging

Traditional ETL Data Warehousing and Business Intelligence ETL Processing (SSIS, etc)

Long Term Raw Data Archiving

Long Term Raw Data Archiving

Transforming Data

New Data Types

Let s get Technical

Create External Table

CTAS Create Table AS Select CREATE TABLE mytable WITH ( CLUSTERED COLUMNSTORE INDEX, DISTRIBUTION = HASH (CustomerKey) ) AS SELECT * FROM ClickStream;

Demo

Using Polybase to export from PDW to Hadoop

Exporting Conformed Dimensions to Hadoop

Export your Conformed Dimensions

Data Archiving

Hadoop as a Data Archive ETL Processing

CETAS Create External Table AS Select CREATE EXTERNAL TABLE hdfsfactalldataarchive WITH (LOCATION = 'user/administrator/passbac/all_data/', DATA_SOURCE = f14790hdp, FILE_FORMAT = pipedelimited ) AS SELECT * FROM FactAllData WHERE transaction_year < 2000;

Demo

Join Data on the Fly

Joining Data Store your Dimensional data on PDW and your Fact data on Hadoop

Join PDW & External Tables No Different from any other join you do today SELECT c.name, d.year, sum(sales) FROM FactSales s External Table JOIN dimcustomer c Internal Table ON c.customerid = s.customerid JOIN dimdate d Internal Table ON s.dateid = d.dateid WHERE d.year = 2008 AND c.name = Albertson & Brothers

Demo

Wrap-up