1 Structured data meets unstructured data in Azure and Hadoop Sameer Parve, Blesson John sameerpa@microsoft.com Blessonj@Microsoft.com PFE SQL Server/Analytics Platform System October 30 th 2014
Agenda
Data sources
Data sources Non-Relational Data 4
Data sources Non-Relational Data
SQL Server SMP Microsoft MPP (APS/PDW)
SMP vs. MPP
Microsoft data warehouse vision Make SQL Server the fastest and most affordable database for customers of all sizes. Massive scalability at a low cost Flexibility and choice Complete data warehouse solution Simplified data warehouse management
10 Appliance for high-end massively parallel processing (MPP) data warehousing Ideal for high-scale or highperformance data marts and EDWs Infiniband & Ethernet Data warehouse appliance (fully-integrated software and hardware) 10s of TB 6 PB
MS Analytics Platform System (APS) Pre-Built Hardware + Software Appliance Co-engineered with HP, Dell, and Quanta Pre-built Hardware Pre-installed Software Appliance installed in 1-2 days Support - Microsoft provides first call support Hardware partner provides onsite break/fix support Plug and Play Built-in Best Practices Save Time
Hardware architecture overview One standard node type 2 8 core Intel processors Doubled memory to 256 GB Host 1 (HST01) Updating to the newest Infiniband (FDR 56 GB/sec) IB and Ethernet Host 2 (HST02) Host 3 (HSA01) Host 4 (HSA02) JBOD Moving from SAN to JBODs Reducing costs significantly Moving away from dependency on a handful of key SAN vendors Using Windows Server 2012 technologies to achieve the same level of reliability and robustness Backup and Landing Zone (LZ) are now reference architectures and not in the appliance Customers can use their own hardware Customers can use more than 1 BU or LZ for high availability Direct attached SAS Scale unit concept Base unit: Minimum configuration; populates rack with networking Scale unit: Adds capacity by 2 3 compute nodes/related storage Passive unit: Increases high availability (HA) capacity by adding more spares
Virtual machine architecture overview IB and Ethernet Window Server 2012 Standard PDW engine DMS Manager SQL Server 2012 Enterprise Edition (PDW build) Shell databases just as in AU3+ CTL MAD AD VMM Compute 1 Compute 2 HST01 HST02 HSA01 HSA02 Direct attached SAS JBOD General details All hosts run Windows Server 2012 Standard All virtual machines run Windows Server 2012 Standard as a guest operating system All fabric and workload activity happens in Hyper-V virtual machines Fabric virtual machines, MAD01, and CTL share one server Lower overhead costs especially for small topologies PDW Agent runs on all hosts and all virtual machines and collects appliance health data on fabric and workload DWConfig and Admin Console continue to exist Minor extensions expose host-level information Windows Storage Spaces handles mirroring and spares and enables use of lower cost DAS (JBODs) rather than SAN PDW workload details SQL Server 2012 Enterprise Edition (PDW build) control node and compute nodes for PDW workload Window Server 2012 Standard DMS Core SQL Server 2012 Enterprise Edition (PDW build) Storage details More files per filegroup Larger number of spindles in parallel
Seamlessly add capacity PDW/HDI Smallest to largest Add capacity Start small with a warehouse capacity of several terabytes Add capacity Add capacity up to 6 PB 53 TB 6 PB Start small and grow Largest warehouse PB
PDW table geometries Replicated: A table structure that exists as a full copy within each PDW node CREATE TABLE <TableName> ( <Column Names and Types> ) WITH (DISTRIBUTION = REPLICATE) Distributed: A table structure that is hashed and distributed as evenly as possible across all PDW nodes on the appliance CREATE TABLE <TableName> ( <Column Names and Types> ) WITH (DISTRIBUTION = HASH(<One Column Name>))
PDW table geometries SMP system Compute nodes Date Dim Date Dim ID Calendar Year Calendar Qtr Calendar Mo Calendar Day Sales Fact Item Dim Prod Dim ID Prod Category Prod Sub Cat Prod Desc DD SD DD SD SF 1 SF 2 ID PD ID PD Date Dim ID Store Dim ID Prod Dim ID Mktg Camp Id Qty Sold Dollars Sold DD SD SF 3 ID PD Store Dim Store Dim ID Store Name Store Mgr Store Size Promo Dim Mktg Camp ID Camp Name Camp Mgr Camp Start Camp End DD SD SF 4 ID PD
SQL Server PDW 2012 Control Architecture Cost-Based Query Optimizer SELECT SELECT Shell Appliance (SQL Server) foo Control Node Engine Service Plan Steps Plan Steps Plan Steps Compute Node (SQL Server) foo Compute Node (SQL Server) foo Compute Node (SQL Server) foo
Querying data Execution sequence
Querying data MPP engine The MPP engine is designed for high-performance queries against large data sets Understanding the query architecture and execution steps of the PDW is key to writing good queries Control node orchestrates the entire set of operations across all nodes to satisfy a query Avoid queries that create hot spots and excessive data redistribution
What is HADOOP? Solution that allows commodity computers to store data and process them in parallel fashion. Resilience/fault tolerance is not provided by the use of Hardware such as RAID but by the use of cluster. 20
Types of Nodes on HDFS NameNode One per cluster-responsible for providing metadata information about the blocks within the filesystem,tracking replication and managing filesystem namespace. Backup Node Acts as the backup of NameNode DataNodes Responsible for storage of file blocks and contains the data requested by the client. 21
An Example of NameNode and DataNode 1 3 2 3 1 2 4 5 4 5 22
Other tasks on HADOOP cluster Job Tracker Responsible for submitting client application request to task tracker on nodes that contain the data to be processed. It also monitors the task tracker using heart beats and reschedules the task on another task tracker in case of a failure. One per cluster. Task Tracker Responsible for performing map, reduce and shuffle operation on data. One per Node. 23
The architecture of HADOOP BI Reporting Tools ETL Tools RDBMS 24
MapReduce MapReduce is a framework that allows user to write applications that take advantage of the fault tolerance provided by HADOOP. MapReduce programs transform lists of input data elements into lists of output data elements. This is done twice once using map method and then using reduce method. We do not alter the initial input data. The initial input is transformed and the transformed output becomes the input for the reducer function. 25
The Mapper Function 1 2 4 3 1 2 5 4 5 1(1) 2 (1) 4 (1) 3(1) 1(1) 2 (1) 4 (1) 5 (1) 5(1) 26
The Reduce Function 1(1) 2 (1) 4 (1) 3(1) 1(1) 2 (1) 4 (1) 5 (2) 1 (2) 2 (1) 3 (1) 4 (2) 5 (2) 27
Pseudo Code for the MapReduce Job mapper (Inputfilename, filedata): for each number in filedata: emit (number, 1) reducer (number, values): sum = 0 for each value in values: sum = sum + value emit (number, sum) 28
Sqoop Load/Unload Utility SQL Server SQL Server SQL Server SQL Server Sqoop Hadoop Cluster
APS and Hadoop Integration 30
HDFS Bridge in Polybase PDW Node PDW Node SQL Server DMS DMS SQL Server HDFS HDFS HDFS Hadoop Cluster 1. DMS is present on all compute and has been extended to have HDFS Bridge. 2. HDFS Bridge hides complexity of HDFS. The DMS components are reused for type conversions. 3. All HDFS file types (text, sequence, RCFiles) supported through the use of appropriate RecordReaders by the HDFS Bridge. The JAVA class used is InputFormat. 31
Reading HDFS Files in Parallel Block buffers SQL Server DMS DMS SQL Server HDFS NameNode NameNode returns locations of blocks of file HDFS DataNode HDFS DataNode HDFS DataNode 32
External Table Command There are two different ways to import data from HDFS to PDW I. CREATE EXTERNAL TABLE II.CREATE TABLE AS SELECT (CTAS) There is only one way to export data from PDW to HDFS I. CREATE EXTERNAL TABLE AS SELECT Finally, the DROP EXTERNAL TABLE command 33
Example of CREATE EXTERNAL TABLE - Temp CREATE EXTERNAL TABLE ClickStream ( url varchar(50), event_date date, user_ip varchar(50) ) WITH ( LOCATION = 'hdfs://10.192.63.147:5000/tpch1gb/clickstream.txt,format_options ( FIELD_TERMINATOR = ' '), DATE_FORMAT = ꞌmm/dd/yyyyꞌ ) ) ; 34
Example of CREATE TABLE AS SELECT PART 1 Persistent --Create the external table called ClickStream. CREATE EXTERNAL TABLE ClickStreamExt ( url varchar(50), event_date date, user_ip varchar(50) ) WITH ( LOCATION = 'hdfs://myhadoop:5000/tpch1gb/clickstream.txt,format_options ( FIELD_TERMINATOR = ' ') ) ; 35
Example of CREATE TABLE AS SELECT PART 2 Persistent --Use your own processes to create the Hadoop text-delimited files on the -- Hadoop Cluster. --Use CREATE TABLE AS SELECT to import the Hadoop data into a new --SQL Server PDW table called ClickStreamPDW CREATE TABLE ClickStreamPDW WITH ( CLUSTERED COLUMNSTORE INDEX,DISTRIBUTION = HASH (user_ip) ) AS SELECT * FROM ClickStreamExt ; 36
Example of CREATE EXTERNAL TABLE AS SELECT (Export) USE AdventureWorksPDW2012; CREATE EXTERNAL TABLE hdfscustomer WITH ( LOCATION = 'hdfs://10.192.63.147:5000/files/customer, FORMAT_OPTIONS ( FIELD_TERMINATOR = ' ') ) AS SELECT * FROM dimcustomer; 37
The DROP EXTERNAL TABLE --Drop an external table from PDW. This does not delete the external data. DROP EXTERNAL TABLE [ database_name. [ dbo ]. dbo. ]table_name [;] EXAMPLE:- DROP EXTERNAL TABLE ClickStream; 38