Structured data meets unstructured data in Azure and Hadoop

Similar documents

SQL Server PDW. Artur Vieira Premier Field Engineer

The Role Polybase in the MDW. Brian Mitchell Microsoft Big Data Center of Expertise

A Breakthrough Platform for Next-Generation Data Warehousing and Big Data Solutions

SQL Server Parallel Data Warehouse: Architecture Overview. José Blakeley Database Systems Group, Microsoft Corporation

How To Create A Fact Table On Hadoop (Hadoop) On A Microsoft Powerbook (Powerbook) On An Ipa 2.2 (Powerpoint) On Microsoft Microsoft 2.3

Microsoft Analytics Platform System. Solution Brief

Parallel Data Warehouse

Please give me your feedback

SQL Server 2012 Parallel Data Warehouse. Solution Brief

Modernizing Your Data Warehouse for Hadoop

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

James Serra Sr BI Architect

Microsoft technológie pre BigData. Ľubomír Goryl Solution Professional

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Big Data Processing: Past, Present and Future

Big Data on Microsoft Platform

Modern Data Warehousing

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Hadoop IST 734 SS CHUNG

How to make BIG DATA work for you. Faster results with Microsoft SQL Server PDW

Hadoop Architecture. Part 1

A very short Intro to Hadoop

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Information Architecture

HP Enterprise Data Warehouse Deep Dive. Steve Tramack, Sr. Engineering Manager, I2A Solutions, HP

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

I/O Considerations in Big Data Analytics

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

CSE-E5430 Scalable Cloud Computing Lecture 2

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Big Data Technologies Compared June 2014

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Luncheon Webinar Series May 13, 2013

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Accelerating and Simplifying Apache

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc.

Open source Google-style large scale data analysis with Hadoop

Hadoop & its Usage at Facebook

Enabling High performance Big Data platform with RDMA

Apache Hadoop. Alexandru Costan

Networking in the Hadoop Cluster

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Large scale processing using Hadoop. Ján Vaňo

Chapter 7. Using Hadoop Cluster and MapReduce

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

BIG DATA TRENDS AND TECHNOLOGIES

Agenda. Modern Data Warehouse Big Data Application examples. Analytic Platform Systems. Integration of Hadoop and APS. Architecture Hadoop

Introduction to Cloud Computing

Hadoop: Embracing future hardware

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Storage Architectures for Big Data in the Cloud

HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW

<Insert Picture Here> Big Data

Oracle Big Data SQL Technical Update

Big Data With Hadoop

SQL Server 2012 PDW. Ryan Simpson Technical Solution Professional PDW Microsoft. Microsoft SQL Server 2012 Parallel Data Warehouse

Using distributed technologies to analyze Big Data

In-Memory Analytics for Big Data

2009 Oracle Corporation 1

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

THE HADOOP DISTRIBUTED FILE SYSTEM

SAS and Oracle: Big Data and Cloud Partnering Innovation Targets the Third Platform

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop and Map-Reduce. Swati Gore

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

SMB Direct for SQL Server and Private Cloud

TAMING THE BIG CHALLENGE OF BIG DATA MICROSOFT HADOOP

Big Data Are You Ready? Thomas Kyte

GraySort and MinuteSort at Yahoo on Hadoop 0.23

<Insert Picture Here> Oracle and/or Hadoop And what you need to know

Next Gen Hadoop Gather around the campfire and I will tell you a good YARN

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Certified Big Data and Apache Hadoop Developer VS-1221

Alternatives to HIVE SQL in Hadoop File Structure

Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard

The Future of Data Management

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

NoSQL for SQL Professionals William McKnight

A Brief Outline on Bigdata Hadoop

NoSQL and Hadoop Technologies On Oracle Cloud

Fundamentals Curriculum HAWQ

Bringing Big Data to People

SQL Server 2014 Faster Insights from any Data Level 300

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Big Data Introduction

The Inside Scoop on Hadoop

Big Data Course Highlights

Apache Hadoop new way for the company to store and analyze big data

Inge Os Sales Consulting Manager Oracle Norway

Introduction to MapReduce and Hadoop

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

International Journal of Advance Research in Computer Science and Management Studies

Transcription:

1 Structured data meets unstructured data in Azure and Hadoop Sameer Parve, Blesson John sameerpa@microsoft.com Blessonj@Microsoft.com PFE SQL Server/Analytics Platform System October 30 th 2014

Agenda

Data sources

Data sources Non-Relational Data 4

Data sources Non-Relational Data

SQL Server SMP Microsoft MPP (APS/PDW)

SMP vs. MPP

Microsoft data warehouse vision Make SQL Server the fastest and most affordable database for customers of all sizes. Massive scalability at a low cost Flexibility and choice Complete data warehouse solution Simplified data warehouse management

10 Appliance for high-end massively parallel processing (MPP) data warehousing Ideal for high-scale or highperformance data marts and EDWs Infiniband & Ethernet Data warehouse appliance (fully-integrated software and hardware) 10s of TB 6 PB

MS Analytics Platform System (APS) Pre-Built Hardware + Software Appliance Co-engineered with HP, Dell, and Quanta Pre-built Hardware Pre-installed Software Appliance installed in 1-2 days Support - Microsoft provides first call support Hardware partner provides onsite break/fix support Plug and Play Built-in Best Practices Save Time

Hardware architecture overview One standard node type 2 8 core Intel processors Doubled memory to 256 GB Host 1 (HST01) Updating to the newest Infiniband (FDR 56 GB/sec) IB and Ethernet Host 2 (HST02) Host 3 (HSA01) Host 4 (HSA02) JBOD Moving from SAN to JBODs Reducing costs significantly Moving away from dependency on a handful of key SAN vendors Using Windows Server 2012 technologies to achieve the same level of reliability and robustness Backup and Landing Zone (LZ) are now reference architectures and not in the appliance Customers can use their own hardware Customers can use more than 1 BU or LZ for high availability Direct attached SAS Scale unit concept Base unit: Minimum configuration; populates rack with networking Scale unit: Adds capacity by 2 3 compute nodes/related storage Passive unit: Increases high availability (HA) capacity by adding more spares

Virtual machine architecture overview IB and Ethernet Window Server 2012 Standard PDW engine DMS Manager SQL Server 2012 Enterprise Edition (PDW build) Shell databases just as in AU3+ CTL MAD AD VMM Compute 1 Compute 2 HST01 HST02 HSA01 HSA02 Direct attached SAS JBOD General details All hosts run Windows Server 2012 Standard All virtual machines run Windows Server 2012 Standard as a guest operating system All fabric and workload activity happens in Hyper-V virtual machines Fabric virtual machines, MAD01, and CTL share one server Lower overhead costs especially for small topologies PDW Agent runs on all hosts and all virtual machines and collects appliance health data on fabric and workload DWConfig and Admin Console continue to exist Minor extensions expose host-level information Windows Storage Spaces handles mirroring and spares and enables use of lower cost DAS (JBODs) rather than SAN PDW workload details SQL Server 2012 Enterprise Edition (PDW build) control node and compute nodes for PDW workload Window Server 2012 Standard DMS Core SQL Server 2012 Enterprise Edition (PDW build) Storage details More files per filegroup Larger number of spindles in parallel

Seamlessly add capacity PDW/HDI Smallest to largest Add capacity Start small with a warehouse capacity of several terabytes Add capacity Add capacity up to 6 PB 53 TB 6 PB Start small and grow Largest warehouse PB

PDW table geometries Replicated: A table structure that exists as a full copy within each PDW node CREATE TABLE <TableName> ( <Column Names and Types> ) WITH (DISTRIBUTION = REPLICATE) Distributed: A table structure that is hashed and distributed as evenly as possible across all PDW nodes on the appliance CREATE TABLE <TableName> ( <Column Names and Types> ) WITH (DISTRIBUTION = HASH(<One Column Name>))

PDW table geometries SMP system Compute nodes Date Dim Date Dim ID Calendar Year Calendar Qtr Calendar Mo Calendar Day Sales Fact Item Dim Prod Dim ID Prod Category Prod Sub Cat Prod Desc DD SD DD SD SF 1 SF 2 ID PD ID PD Date Dim ID Store Dim ID Prod Dim ID Mktg Camp Id Qty Sold Dollars Sold DD SD SF 3 ID PD Store Dim Store Dim ID Store Name Store Mgr Store Size Promo Dim Mktg Camp ID Camp Name Camp Mgr Camp Start Camp End DD SD SF 4 ID PD

SQL Server PDW 2012 Control Architecture Cost-Based Query Optimizer SELECT SELECT Shell Appliance (SQL Server) foo Control Node Engine Service Plan Steps Plan Steps Plan Steps Compute Node (SQL Server) foo Compute Node (SQL Server) foo Compute Node (SQL Server) foo

Querying data Execution sequence

Querying data MPP engine The MPP engine is designed for high-performance queries against large data sets Understanding the query architecture and execution steps of the PDW is key to writing good queries Control node orchestrates the entire set of operations across all nodes to satisfy a query Avoid queries that create hot spots and excessive data redistribution

What is HADOOP? Solution that allows commodity computers to store data and process them in parallel fashion. Resilience/fault tolerance is not provided by the use of Hardware such as RAID but by the use of cluster. 20

Types of Nodes on HDFS NameNode One per cluster-responsible for providing metadata information about the blocks within the filesystem,tracking replication and managing filesystem namespace. Backup Node Acts as the backup of NameNode DataNodes Responsible for storage of file blocks and contains the data requested by the client. 21

An Example of NameNode and DataNode 1 3 2 3 1 2 4 5 4 5 22

Other tasks on HADOOP cluster Job Tracker Responsible for submitting client application request to task tracker on nodes that contain the data to be processed. It also monitors the task tracker using heart beats and reschedules the task on another task tracker in case of a failure. One per cluster. Task Tracker Responsible for performing map, reduce and shuffle operation on data. One per Node. 23

The architecture of HADOOP BI Reporting Tools ETL Tools RDBMS 24

MapReduce MapReduce is a framework that allows user to write applications that take advantage of the fault tolerance provided by HADOOP. MapReduce programs transform lists of input data elements into lists of output data elements. This is done twice once using map method and then using reduce method. We do not alter the initial input data. The initial input is transformed and the transformed output becomes the input for the reducer function. 25

The Mapper Function 1 2 4 3 1 2 5 4 5 1(1) 2 (1) 4 (1) 3(1) 1(1) 2 (1) 4 (1) 5 (1) 5(1) 26

The Reduce Function 1(1) 2 (1) 4 (1) 3(1) 1(1) 2 (1) 4 (1) 5 (2) 1 (2) 2 (1) 3 (1) 4 (2) 5 (2) 27

Pseudo Code for the MapReduce Job mapper (Inputfilename, filedata): for each number in filedata: emit (number, 1) reducer (number, values): sum = 0 for each value in values: sum = sum + value emit (number, sum) 28

Sqoop Load/Unload Utility SQL Server SQL Server SQL Server SQL Server Sqoop Hadoop Cluster

APS and Hadoop Integration 30

HDFS Bridge in Polybase PDW Node PDW Node SQL Server DMS DMS SQL Server HDFS HDFS HDFS Hadoop Cluster 1. DMS is present on all compute and has been extended to have HDFS Bridge. 2. HDFS Bridge hides complexity of HDFS. The DMS components are reused for type conversions. 3. All HDFS file types (text, sequence, RCFiles) supported through the use of appropriate RecordReaders by the HDFS Bridge. The JAVA class used is InputFormat. 31

Reading HDFS Files in Parallel Block buffers SQL Server DMS DMS SQL Server HDFS NameNode NameNode returns locations of blocks of file HDFS DataNode HDFS DataNode HDFS DataNode 32

External Table Command There are two different ways to import data from HDFS to PDW I. CREATE EXTERNAL TABLE II.CREATE TABLE AS SELECT (CTAS) There is only one way to export data from PDW to HDFS I. CREATE EXTERNAL TABLE AS SELECT Finally, the DROP EXTERNAL TABLE command 33

Example of CREATE EXTERNAL TABLE - Temp CREATE EXTERNAL TABLE ClickStream ( url varchar(50), event_date date, user_ip varchar(50) ) WITH ( LOCATION = 'hdfs://10.192.63.147:5000/tpch1gb/clickstream.txt,format_options ( FIELD_TERMINATOR = ' '), DATE_FORMAT = ꞌmm/dd/yyyyꞌ ) ) ; 34

Example of CREATE TABLE AS SELECT PART 1 Persistent --Create the external table called ClickStream. CREATE EXTERNAL TABLE ClickStreamExt ( url varchar(50), event_date date, user_ip varchar(50) ) WITH ( LOCATION = 'hdfs://myhadoop:5000/tpch1gb/clickstream.txt,format_options ( FIELD_TERMINATOR = ' ') ) ; 35

Example of CREATE TABLE AS SELECT PART 2 Persistent --Use your own processes to create the Hadoop text-delimited files on the -- Hadoop Cluster. --Use CREATE TABLE AS SELECT to import the Hadoop data into a new --SQL Server PDW table called ClickStreamPDW CREATE TABLE ClickStreamPDW WITH ( CLUSTERED COLUMNSTORE INDEX,DISTRIBUTION = HASH (user_ip) ) AS SELECT * FROM ClickStreamExt ; 36

Example of CREATE EXTERNAL TABLE AS SELECT (Export) USE AdventureWorksPDW2012; CREATE EXTERNAL TABLE hdfscustomer WITH ( LOCATION = 'hdfs://10.192.63.147:5000/files/customer, FORMAT_OPTIONS ( FIELD_TERMINATOR = ' ') ) AS SELECT * FROM dimcustomer; 37

The DROP EXTERNAL TABLE --Drop an external table from PDW. This does not delete the external data. DROP EXTERNAL TABLE [ database_name. [ dbo ]. dbo. ]table_name [;] EXAMPLE:- DROP EXTERNAL TABLE ClickStream; 38