Query Optimization in Microsoft SQL Server PDW The article done by: Srinath Shankar, Rimma Nehme, Josep Aguilar-Saborit, Andrew Chung, Mostafa



Similar documents
SQL Server Parallel Data Warehouse: Architecture Overview. José Blakeley Database Systems Group, Microsoft Corporation

Modern Data Warehousing

Structured data meets unstructured data in Azure and Hadoop

Microsoft Analytics Platform System. Solution Brief

Split Query Processing in Polybase

James Serra Sr BI Architect

Big Data Processing: Past, Present and Future

Using Attunity Replicate with Greenplum Database Using Attunity Replicate for data migration and Change Data Capture to the Greenplum Database

Rackspace Cloud Databases and Container-based Virtualization

Query Optimization in Microsoft SQL Server PDW

Can the Elephants Handle the NoSQL Onslaught?

Luncheon Webinar Series May 13, 2013

Enterprise Performance Tuning: Best Practices with SQL Server 2008 Analysis Services. By Ajay Goyal Consultant Scalability Experts, Inc.

SQL Server to SQL Server PDW. Migration Guide (AU3)

Execution Strategies for SQL Subqueries

Why Query Optimization? Access Path Selection in a Relational Database Management System. How to come up with the right query plan?

SAP HANA SPS 09 - What s New? SAP HANA Scalability

HP Vertica and MicroStrategy 10: a functional overview including recommendations for performance optimization. Presented by: Ritika Rahate

Understanding SQL Server Execution Plans. Klaus Aschenbrenner Independent SQL Server Consultant SQLpassion.at

SQL Server PDW. Artur Vieira Premier Field Engineer

SQL Server 2014 Performance Tuning and Optimization 55144; 5 Days; Instructor-led

Course Outline. SQL Server 2014 Performance Tuning and Optimization Course 55144: 5 days Instructor Led

SQL Server to SQL Server PDW Migration Guide

SQL Server 2012 Parallel Data Warehouse. Solution Brief

SQL Server 2012 PDW. Ryan Simpson Technical Solution Professional PDW Microsoft. Microsoft SQL Server 2012 Parallel Data Warehouse

Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc.

Microsoft SQL Database Administrator Certification

"Charting the Course... MOC AC SQL Server 2014 Performance Tuning and Optimization. Course Summary

SQL Server Integration Services Design Patterns

A Breakthrough Platform for Next-Generation Data Warehousing and Big Data Solutions

Microsoft BI Platform Overview

Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344

SQL Server Business Intelligence on HP ProLiant DL785 Server

Testing SQL Server s Query Optimizer: Challenges, Techniques and Experiences

White Paper February IBM InfoSphere DataStage Performance and Scalability Benchmark Whitepaper Data Warehousing Scenario

Data Management in the Cloud

QuickSpecs. What's New Support for two InfiniBand 4X QDR 36P Managed Switches

Course 55144: SQL Server 2014 Performance Tuning and Optimization

Big Data, Fast Processing Speeds Kevin McGowan SAS Solutions on Demand, Cary NC

EMC/Greenplum Driving the Future of Data Warehousing and Analytics

Microsoft technológie pre BigData. Ľubomír Goryl Solution Professional

Integrating Apache Spark with an Enterprise Data Warehouse

Planning the Installation and Installing SQL Server

Course 55144B: SQL Server 2014 Performance Tuning and Optimization

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

32-bit and 64-bit BarTender. How to Select the Right Version for Your Needs WHITE PAPER

1 Changes in this release

Please give me your feedback

The HP Neoview data warehousing platform for business intelligence

Universal PMML Plug-in for EMC Greenplum Database

DBAs having to manage DB2 on multiple platforms will find this information essential.

How To Create A Fact Table On Hadoop (Hadoop) On A Microsoft Powerbook (Powerbook) On An Ipa 2.2 (Powerpoint) On Microsoft Microsoft 2.3

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011

MS SQL Server 2014 New Features and Database Administration

ISSN: (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies

Technical Challenges for Big Health Care Data. Donald Kossmann Systems Group Department of Computer Science ETH Zurich

Preparing Data Sets for the Data Mining Analysis using the Most Efficient Horizontal Aggregation Method in SQL

In-Database Analytics

Solving Business Pains with SQL Server Integration Services. SQL Server 2005 / 2008

SQL Maestro and the ELT Paradigm Shift

Real Application Testing. Fred Louis Oracle Enterprise Architect

I/O Considerations in Big Data Analytics

SWISSBOX REVISITING THE DATA PROCESSING SOFTWARE STACK

Affordable, Scalable, Reliable OLTP in a Cloud and Big Data World: IBM DB2 purescale

DBMS / Business Intelligence, SQL Server

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Beyond Conventional Data Warehousing. Florian Waas Greenplum Inc.

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010

SQL Server Training Course Content

Objectives. Distributed Databases and Client/Server Architecture. Distributed Database. Data Fragmentation

Performance and Scalability Overview

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Bringing Big Data into the Enterprise

Upgrading to Microsoft SQL Server 2008 R2 from Microsoft SQL Server 2008, SQL Server 2005, and SQL Server 2000

How to Build a High-Performance Data Warehouse By David J. DeWitt, Ph.D.; Samuel Madden, Ph.D.; and Michael Stonebraker, Ph.D.

Hadoop & Spark Using Amazon EMR

CHAPTER 1: CLIENT/SERVER INTEGRATED DEVELOPMENT ENVIRONMENT (C/SIDE)

Information Architecture

Using distributed technologies to analyze Big Data

Many DBA s are being required to support multiple DBMS s on multiple platforms. Many IT shops today are running a combination of Oracle and DB2 which

EMC Avamar 7.2 for IBM DB2

SQL Server and MicroStrategy: Functional Overview Including Recommendations for Performance Optimization. MicroStrategy World 2016

Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard

2009 Oracle Corporation 1

Summary: This paper examines the performance of an XtremIO All Flash array in an I/O intensive BI environment.

Trafodion Operational SQL-on-Hadoop

Integrating Big Data into the Computing Curricula

Symmetric Multiprocessing

Parallel Data Warehouse

Decomposition into Parts. Software Engineering, Lecture 4. Data and Function Cohesion. Allocation of Functions and Data. Component Interfaces

SUN ORACLE EXADATA STORAGE SERVER

Transcription:

Query Optimization in Microsoft SQL Server PDW The article done by: Srinath Shankar, Rimma Nehme, Josep Aguilar-Saborit, Andrew Chung, Mostafa Elhemali, Alan Halverson, Eric Robinson, Mahadevan Sankara Subramanian, David DeWitt, César Galindo-Legaria Microsoft Corporation

SMP and MPP Symmetric Multi-Processing(SMP): Having Multiple Processors which share single copy of OS. Massively Parallel Processing (MPP): systems, i.e., distributed systems consisting of multiple independent nodes connected by a network. MPP: is used to manage and query vast amounts of data.

Microsoft SQL Server Parallel Data Warehouse Is a shared-nothing parallel database appliance and is one example of an MPP system. Such architecture are especially well suited to data warehouse workloads, where large fact tables can be distributed across nodes

Microsoft SQL Server PDW he control node is responsible for: Query parsing Creating a DSQL(DSQL peration(sql&dms)) Issuing plan steps to the compute nodes. Tracking the execution steps of the plan. Assembling the individual pieces of the nal results. Compute nodes provide : Data storage Query processing

Query Optimization in PDW They store the metadata of the distributed tables in a shell database on a single SQL Server instance. The shell database provides the single system image of the data in the appliance. Using the shell database, with a given query, we generate th space of execution alternatives (MEMO). Using Data movement operations, we make a cost-based decision on the best execution plan for the distributed environment.

MICROSOFT SQL SERVER PDW ARCHITECTURE OVERVIEW Appliance: The PDW appliance is composed of hardware and software architected to function together as one box. Multiple servers are used to implement scale-out query processing in a shared-nothing fashion. Allows cost-effective and incremental growth of the appliance by adding extr servers or storage As the appliance grows over time the server components can be upgraded with more powerful CPU, Memory,storage,etc There are two distinct types of nodes that implement the query processing functionality: 1. Control Node. 2. Compute Nodes.

Control Node Manages distribution of Query execution across compute node accepts client connection to the PDW appliance and manage client authentication Manages DMS(Data Movement Service),is communication lay for transferring data between node A user can connect to PDW Control Node using variety of clie access tools using drivers with connection types, Such as: OLE DB,ODBS,ADO.NET

Compute Nodes Each compute node is the host for a single SQL Server instance. It also runs a DMS process for communication and data transfer wi the other nodes in the appliance. Each compute node stores a portion of the user data.

Tables in a PDW appliance can either be Distributed Tables Replicated Table

MICROSOFT SQL SERVER PDW ARCHITECTURE OVERVIEW(cont..) Shell Database: is a SQL Server database that defines all metadata and statistics about tables, but does not contain any user data. used to store the metadata for the user tables partitioned across the compute node Also store in this database all information regarding users and privileges, so that compilation can check for security and access rights.(no extra cost for PDw) contains global statistics for all the tables in the appliance. Data Movement Service (DMS):is responsible for moving data between all the nodes on the appliance. Once instance of DMS runs on each of the control a compute nodes. PDW utilizes temporary tables Some queries that generates no temp table can be streamed from compute node directly back to client.

The DSQL Plan and its Execution A DSQL plan may include the following types of operations: SQL Operations: that are executed directly on the SQL Server DBMS instances on one more compute nodes. DMS Operations :which move data among the nodes in PDW for further processing. Temp table operations: that set up staging tables for further processing. Return operations: which push data back to the client.

DSQL Plan Example SELECT c_custkey, o_orderdate FROM Orders, Customer WHERE o_custkey = c_custkey AND o_totalprice > 100 Step1:DMS operation: repartitions Order Table on O_custkey which is compatible to join SELECT o_custkey, o_orderdate FROM Orders WHERE o_totalprice > 100

DSQL Plan Example(cont..) Step2:SQL Operation : that selects tuples for the final result set from each compute node and returns them back to the client. SELECT c.c_custkey, tmp.o_orderdate FROM Customer c, Temp_Table tmp WHERE c.c_custkey = tmp.o_custkey

Cost Based Query Optimization in PDW

SQL Server Compilation

Implementation of PDW QO Changes to SQL Server : First change is to support exporting the optimizer search space Second is to extend the query surface to support all constructs of PDW Third is to expand the optimizer search space to include some alternatives that are relevant for distributed query execution

Plan Enumeration aïve enumeration(not successful) bottom-up optimizer starts by ptimizing the smallest expressions in he query, and then uses this nformation to progressively optimize arger expressions until the optimal hysical plan for the full query is found

Cost model Evaluates the performance of specific plan which includes SQL relational operations and data movement operation There is no equivalent to data movement operation inside SQL server, thus we cannot rely on SQL Server optimizer to generate cost for these operation Cost Model Assumptions: Absence of independent parallelism, Isolation, Homogeneity,ect

Some of types of Data operator Shuffle move(m-m): row are moved from each compute node to targeted tab Partition Move(M-1): row are moved from each compute node to targeted table on target node Control Node Move :A table in control is replicated to all compute node Broadcast move: Rows are moved from each compute node to all targeted table on all compute node

Cost of a DMS Operator Creader: Read tuples from the query executed against SQL Server and packing them into a buffer. Cnetwork: Send the data buffers over the network Cwriter: Unpack the tuples fromthe buffers sent by the sending process, and prepare buffers for insertion into a temporary table. CSQLBlkCpy: Bulk copy operation for insertion of the data buffers into a SQL Server temporary table. Csource = max(creader,cnetwork). Ctarget = max(cwriter, CSQLBulkCpy). CDMS = max(csource, Ctarget).

Costing of an Individual Component The number of bytes B to be processed by each individual cost compone depends on the distribution properties of the input and output streams. Y denote global cardinality w the width of the row N- denote the number of nodes in the appliance B=(Y*W)/N for distributed data streams. B=Y*W for replicate data stream

DSQL Generation

conclusion Use of technology developed for SQL server, the PDW QO goes beyond simple predicate pushing and join reordering The cost model of PDW QO is specially crafted to reflect the distributed environment Another important aspect of technology reuse was that it shortened the time to build a cost-based optimizer for PDW The quality and effectiveness of the result validate the approach

REFERENCES [1] Aster Data. http://www.asterdata.com/. [2] Greenplum. http://www.greenplum.com/. [3] M. M. Astrahan, M. W. Blasgen, D. D. Chamberlin, K. P. Eswaran, J. Gray, P. P. Griffiths, W. F. K. III, R. A. Lorie [4] M. Elhemali and L. Giakoumakis. Unit-testing query transformation rules. In DBTest, pages 1 6, 2008. [5] G. Graefe. The Cascades framework for query optimization.. [6] G. Graefe and W. J. McKenna. The Volcano optimizer generator: Extensibility and efficient search. In ICDE, 1993.