SQL Server Parallel Data Warehouse: Architecture Overview. José Blakeley Database Systems Group, Microsoft Corporation

Similar documents
Hadoop and MySQL for Big Data

Alexander Rubin Principle Architect, Percona April 18, Using Hadoop Together with MySQL for Data Analysis

James Serra Sr BI Architect

SQL Server PDW. Artur Vieira Premier Field Engineer

Modern Data Warehousing

Structured data meets unstructured data in Azure and Hadoop

PENTAHO DATA INTEGRATION WITH GREENPLUM LOADER

Dell Microsoft SQL Server 2008 Fast Track Data Warehouse Performance Characterization

Microsoft Analytics Platform System. Solution Brief

Handling Big Dimensions in Distributed Data Warehouses using the DWS Technique

Parallel Data Warehouse

Data Warehouse Performance Enhancements with Oracle9i. An Oracle White Paper April 2001

Business Intelligence Extensions for SPARQL

Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc.

HP Enterprise Data Warehouse Deep Dive. Steve Tramack, Sr. Engineering Manager, I2A Solutions, HP

Greenplum Database: Critical Mass Innovation. Architecture White Paper August 2010

Microsoft SQL Database Administrator Certification

2009 Oracle Corporation 1

Introduction to Decision Support, Data Warehousing, Business Intelligence, and Analytical Load Testing for all Databases

A Data Warehouse Approach to Analyzing All the Data All the Time. Bill Blake Netezza Corporation April 2006

Please give me your feedback

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

Query Optimization in Microsoft SQL Server PDW The article done by: Srinath Shankar, Rimma Nehme, Josep Aguilar-Saborit, Andrew Chung, Mostafa

A Breakthrough Platform for Next-Generation Data Warehousing and Big Data Solutions

SQL Server to SQL Server PDW. Migration Guide (AU3)

SQL Server Administrator Introduction - 3 Days Objectives

SQL Server 2012 Performance White Paper

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

SQL Server What s New? Christopher Speer. Technology Solution Specialist (SQL Server, BizTalk Server, Power BI, Azure) v-cspeer@microsoft.

SQL Server 2016 New Features!

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

Maximum performance, minimal risk for data warehousing

UNISYS. SQL Server Day 2009 Partners

Microsoft technológie pre BigData. Ľubomír Goryl Solution Professional

HP Vertica and MicroStrategy 10: a functional overview including recommendations for performance optimization. Presented by: Ritika Rahate

Einsatzfelder von IBM PureData Systems und Ihre Vorteile.

Real-Time Data Analytics and Visualization

MS 20467: Designing Business Intelligence Solutions with Microsoft SQL Server 2012

Course Outline. Module 1: Introduction to Data Warehousing

IBM WebSphere DataStage Online training from Yes-M Systems

Course Outline: Course: Implementing a Data Warehouse with Microsoft SQL Server 2012 Learning Method: Instructor-led Classroom Learning

HP Enterprise Data Warehouse Appliance architecture overview and performance guide

LEARNING SOLUTIONS website milner.com/learning phone

Introduction to Decision Support, Data Warehousing, Business Intelligence, and Analytical Load Testing for all Databases

Implementing a Data Warehouse with Microsoft SQL Server 2012 MOC 10777

Dell Microsoft Business Intelligence and Data Warehousing Reference Configuration Performance Results Phase III

Can the Elephants Handle the NoSQL Onslaught?

SQL Server to SQL Server PDW Migration Guide

Building a BI Solution in the Cloud

Designing Business Intelligence Solutions with Microsoft SQL Server 2012 Course 20467A; 5 Days

Course 10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012

How to make BIG DATA work for you. Faster results with Microsoft SQL Server PDW

Big Data Technologies Compared June 2014

Microsoft SQL Server 2012: What to Expect

Implementing a Data Warehouse with Microsoft SQL Server 2012

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

Using Attunity Replicate with Greenplum Database Using Attunity Replicate for data migration and Change Data Capture to the Greenplum Database

Emerging Technologies Shaping the Future of Data Warehouses & Business Intelligence

SQL Server 2012 Business Intelligence Boot Camp

The Vertica Analytic Database Technical Overview White Paper. A DBMS Architecture Optimized for Next-Generation Data Warehousing

Inge Os Sales Consulting Manager Oracle Norway

Implementing a Data Warehouse with Microsoft SQL Server 2012

Enterprise and Standard Feature Compare

Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard

Microsoft BI Platform Overview

Implementing a Data Warehouse with Microsoft SQL Server 2012

ICONICS Choosing the Correct Edition of MS SQL Server

Modernizing Your Data Warehouse for Hadoop

PSAM, NEC PCIe SSD Appliance for Microsoft SQL Server (Reference Architecture) September 11 th, 2014 NEC Corporation

The Role Polybase in the MDW. Brian Mitchell Microsoft Big Data Center of Expertise

SQL Server 2012 Parallel Data Warehouse. Solution Brief

SQL Server 2005 Features Comparison

Innovative technology for big data analytics

White Paper February IBM InfoSphere DataStage Performance and Scalability Benchmark Whitepaper Data Warehousing Scenario

SAP HANA SAP s In-Memory Database. Dr. Martin Kittel, SAP HANA Development January 16, 2013

Agenda. ! Strengths of PostgreSQL. ! Strengths of Hadoop. ! Hadoop Community. ! Use Cases

Building an Effective Data Warehouse Architecture James Serra

Oracle Database 11g for Data Warehousing

PureSystems: Changing The Economics And Experience Of IT

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

MDM for the Enterprise: Complementing and extending your Active Data Warehousing strategy. Satish Krishnaswamy VP MDM Solutions - Teradata

Big Data Processing: Past, Present and Future

EMC GREENPLUM DATABASE

The Methodology Behind the Dell SQL Server Advisor Tool

Using distributed technologies to analyze Big Data

Architectures for Big Data Analytics A database perspective

Oracle Big Data SQL Technical Update

<Insert Picture Here> Best Practices for Extreme Performance with Data Warehousing on Oracle Database

MS SQL Performance (Tuning) Best Practices:

Main Memory Data Warehouses

Vectorwise 3.0 Fast Answers from Hadoop. Technical white paper

Transcription:

SQL Server Parallel Data Warehouse: Architecture Overview José Blakeley Database Systems Group, Microsoft Corporation

Outline Motivation MPP DBMS system architecture HW and SW Key components Query processing example PDW and BI demo Upcoming capabilities Summary 2

Workload Types Online Transaction Processing (OLTP) Balanced read-update ratio (60%-40%) Fine-grained inserts and updates High transaction throughput e.g., 10s K/s Usually very short transactions e.g., 1-3 tables Sometimes multi-step e.g., financial Relatively small data sizes e.g., few TBs Data Warehousing and Business Analysis (DW) Read-mostly (90%-10%) Few updates in place, high-volume bulk inserts Concurrent query throughput e.g., 10s K / hr Per query response time < 2 s Snowflake, star schemas are common e.g., 5-10 tables Complex queries (filter, join, group-by, aggregation) Very large data sizes e.g., 10s TB - PB Day-to-day business Analysis over historical data 3

SQL Server Parallel Data Warehouse Shared-nothing, distributed, parallel DBMS Built on Windows Server and SQL Server Built-in data and query partitioning Provides single system view over a cluster of SQL Servers Appliance concept Software + hardware solution, low TCO Choice of hardware vendors (e.g., HP, Dell) Optimized for DW workloads Bulk loads ( ~1.2 TB/hr) Sequential scans (700 TB in 3hr) Scale from 10s of TBs to PBs 1 data rack manages ~144 TB (600GB * 24 LFF * 10 nodes) 1 PB takes ~7 racks 4

Hardware Architecture Compute Nodes Control Nodes SQL Active / Passive SQL SQL SQL SQL SQL SQL SQL ETL Load Interface Corporate Backup Solution 5 SQL Dual Fiber Channel Data Center Monitoring SQL Dual Infiniband Client Drivers (ODBC, OLEDB, ADO.NET) Spare Compute Node 2 Rack Appliance

Software Architecture Query Tool MS BI (AS, RS) DWSQL 3 rd Party Tools Internet Explorer Data Access (OLEDB, ODBC, ADO.NET, JDBC) IIS Admin Console Compute Node Compute Nodes Compute Nodes Data Movement Service PDW Engine Data Movement Service User Data SQL Server Landing Zone Node DW Authentication DW Configuration DW Schema TempDB Data Movement Service SQL Server Control 6 Node

Key Software Functionality PDW Engine Provides single system image T-SQL compilation Global metadata and appliance configuration Global query optimization and plan generation Global T-SQL execution coordination Global transaction coordination Authentication and authorization Supportability (HW and SW status info via DMVs) Data Movement Service Data movement across the appliance Distributed query execution operators Parallel Loader Runs from the Landing Zone SSIS or command line tool Parallel Database Copy High performance data export Enables Hub-Spoke scenarios Parallel Backup/Restore Backup files stored on Backup Nodes Backup files may be archived into external device/system 7

Query Processing SQL statement compilation Parsing, validation, optimization Builds an MPP execution plan A sequence of discrete parallel QE steps Steps involve SQL queries to be executed by SQL Server at each compute node As well as data movement steps Executes the plan Coordinates workflow among steps Assembles the result set Returns result set to client 8

Example DW Schema 18,000,048,306 rows 4,500,000,000 rows SELECT TOP 10 L_ORDERKEY, SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)) AS REVENUE, O_ORDERDATE, O_SHIPPRIORITY FROM CUSTOMER, ORDERS, LINEITEM WHERE C_MKTSEGMENT = 'BUILDING' AND C_CUSTKEY = O_CUSTKEY AND L_ORDERKEY = O_ORDERKEY AND O_ORDERDATE < 2010-03-05' AND L_SHIPDATE > 2010-03-05' GROUP BY L_ORDERKEY, O_ORDERDATE, O_SHIPPRIORITY ORDER BY REVENUE DESC, O_ORDERDATE 30,000,000 rows 2,400,000,000 rows 600,000,000 rows 9 25 rows 5 rows 450,000,000 10/12/2011

Example Schema TPCH -------------------------------------------------------------------- -- -- Customer Table -- distributed on c_custkey -------------------------------------------------------------------- -- CREATE TABLE customer ( c_custkey bigint, c_name varchar(25), c_address varchar(40), c_nationkey integer, c_phone char(15), c_acctbal decimal(15,2), c_mktsegment char(10), c_comment varchar(117)) WITH (distribution=hash(c_custkey)) ; -------------------------------------------------------------------- -- -- Orders Table -------------------------------------------------------------------- -- CREATE TABLE orders ( o_orderkey bigint, o_custkey bigint, o_orderstatus char(1), o_totalprice decimal(15,2), o_orderdate date, o_orderpriority char(15), o_clerk char(15), 10 o_shippriority integer, o_comment varchar(79)) -------------------------------------------------------------------- -- -- LineItem Table -- distributed on l_orderkey -------------------------------------------------------------------- -- CREATE TABLE lineitem ( l_orderkey bigint, l_partkey bigint, l_suppkey bigint, l_linenumber bigint, l_quantity decimal(15,2), l_extendedprice decimal(15,2), l_discount decimal(15,2), l_tax decimal(15,2), l_returnflag char(1), l_linestatus char(1), l_shipdate date, l_commitdate date, l_receiptdate date, l_shipinstruct char(25), l_shipmode char(10), l_comment varchar(44)) WITH (distribution=hash(l_orderkey)) ;

Example - Query Ten largest building orders shipped since March 5, 2010 SELECT TOP 10 L_ORDERKEY, SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)) O_ORDERDATE, O_SHIPPRIORITY FROM CUSTOMER, ORDERS, LINEITEM WHERE C_MKTSEGMENT = 'BUILDING' AND C_CUSTKEY = O_CUSTKEY AND L_ORDERKEY = O_ORDERKEY AND O_ORDERDATE < 2010-03-05' AND L_SHIPDATE > 2010-03-05' GROUP BY L_ORDERKEY, O_ORDERDATE, O_SHIPPRIORITY ORDER BY REVENUE DESC, O_ORDERDATE AS REVENUE, 11

Example Execution Plan ------------------------------ -- Step 1: create temp table at control node ------------------------------ CREATE TABLE [tempdb].[dbo].[q_[temp_id_664]] ( [l_orderkey] BIGINT, [REVENUE] DECIMAL(38, 4), [o_orderdate] DATE, [o_shippriority] INTEGER ); ------------------------------ -- Step 2: create temp tables at all compute nodes ------------------------------ CREATE TABLE [tempdb].[dbo].[q_[temp_id_665]_[partition_id]] ( [l_orderkey] BIGINT, [l_extendedprice] DECIMAL(15, 2), [l_discount] DECIMAL(15, 2), [o_orderdate] DATE, [o_shippriority] INTEGER, [o_custkey] BIGINT, [o_orderkey] BIGINT ) WITH ( DISTRIBUTION = HASH([o_custkey]) ); ------------------------------- -- Step 3: SHUFFLE_MOVE -------------------------------- SELECT [l_orderkey], [l_extendedprice], [l_discount], [o_orderdate], [o_shippriority], [o_custkey], [o_orderkey] FROM [dwsys].[dbo].[orders] JOIN [dwsys].[dbo].[lineitem] ON ([l_orderkey] = [o_orderkey]) WHERE ([o_orderdate] < 2010-03-05' AND [o_orderdate] >= 2010-09-15 00:00:00.000') INTO Q_[TEMP_ID_665]_[PARTITION_ID] 12 SHUFFLE ON (o_custkey); ------------------------------ -- Step 4: PARTITION_MOVE ------------------------------ SELECT [l_orderkey], sum(([l_extendedprice] * (1 - [l_discount]))) AS REVENUE, [o_orderdate], [o_shippriority] FROM [dwsys].[dbo].[customer] JOIN tempdb.q_[temp_id_665]_[partition_id] ON ([c_custkey] = [o_custkey]) WHERE [c_mktsegment] = 'BUILDING' GROUP BY [l_orderkey], [o_orderdate], [o_shippriority] INTO Q_[TEMP_ID_664]; ------------------------------ -- Step 5: Drop temp tables at all compute nodes ------------------------------ DROP TABLE tempdb.q_[temp_id_665]_[partition_id]; ------------------------------- -- Step 6: RETURN result to client -------------------------------- SELECT TOP 10 [l_orderkey], sum([revenue]) AS REVENUE, [o_orderdate], [o_shippriority] FROM tempdb.q_[temp_id_664] GROUP BY [l_orderkey], [o_orderdate], [o_shippriority] ORDER BY [REVENUE] DESC, [o_orderdate] ; ------------------------------- -- Step 7: Drop temp table at control node -------------------------------- DROP TABLE tempdb.q_[temp_id_664];

Data Movement Operations SHUFFLE_MOVE (N:N) Distributed Distributed data exchange across the appliance Result is a distributed table hashed on some column PARTITION_MOVE (N:1) Union of distributed partitions across compute nodes into a table in the control node MASTER_MOVE (1:N) Replicate a table from the control node to all compute nodes BROADCAST_MOVE (N:N) Distributed Replicated data exchange across appliance Unconditional shuffle to all compute nodes TRIM_MOVE Distribute a replicated table by trimming each copy Since all the nodes have same copy of the replicated tables the idea is that nodes keep the values that belong to the distributions in that node REPLICATE_MOVE (1:N) Moves a replicated table from 1 to N compute nodes. 13

PDW Demo 14

Other Important Functionality Fault tolerance All HW components have redundancy: CPUs, Disks, networks, power, storage processors All SW components use MS Cluster Services for failover Control, compute and management nodes have A/P Integration with Microsoft and 3 rd party BI tools SS Integration Services (ETL) has PDW as a destination SS Analysis Services (OLAP) has PDW as a source SS Reporting Services, Excel PowerPivot SAS, Business Objects, Microstrategy Hadoop connectors (ETL) Appliance health, monitoring, PDW appliance validator 15 UCI ISG Seminar 1/8/2010

EDW Architecture 16

BI Demo 17

Upcoming Capabilities Column-store support Data compression and query speed Enhanced distributed query processing New execution strategies (DW) New optimization techniques (DW) Data movement Faster, less CPU-intensive, more scalable Deeper analytics Map-reduce-like functionality inside the cluster Data mining, embedded analytics Enhanced HW architecture choices Low-power clusters 18 UCI ISG Seminar 1/8/2010

Summary Motivation MPP DBMS system architecture HW and SW Key components Query processing example PDW and BI demo Upcoming capabilities 19

THANKS! 20