Microsoft Analytics Platform System The turnkey modern data warehouse appliance Stefan Cronjaeger June 2014
Agenda Modern Data Warehouse Big Data Application examples Analytic Platform Systems Architecture Hadoop Integration of Hadoop and APS APS with external Hadoop Clusters APS with Hadoop in the Cloud APS with integrated Hadoop
Data sources 3
Data sources Non-Relational Data 4
Big Data: Variety, Velocity, Volume and Analytics Web Sensor and machine log Social Media Business apps
Technologies to drive Big Data
What to do with the data Geo analysis Forecast Customer interaction Keywords & Sentiment Churn Customer segmentation Shopping basket & Recommendation Scoring & Outlier 7
Examples for sentimental analysis: Not only Marketing Browse blogs, Twitter, News articles, Newsgroups Extract key words, pairs of key words, sentiments Analyze and correlate Campaign supervision Political campaigns and keywords Marketing campaigns Trend analysis Quality assurance Analyse internal technical discussion groups Get early warning of possible technical issues Supply chain for fashion Look in fashion blogs and discussion groups Forecast demand of specific fashion articles 8
Structured data: Fraud detection in large amounts of financial data where to look Not all digits are equal! 130 years ago Simon Newcomb detected that more numbers started with the digit 1. Re-discovered by Benford The idea: Look into the numbers (e.g., balance sheet), look how the numbers are usually distributed and look for deviations Application: Tax fraud in balance sheets. Actually used by auditors Manipulated numbers in scientific publications Fraud in elections, election campaign financing, 9
An application of Benford s law Differences in number statistics for EU reporting of Social Data and Deficit data by country Bernhard Rauch, Max Göttsche, Gernot Brähler & Thomas Kronfeld (2014) Deficit versus social statistics: empirical evidence for the effectiveness of Benford s law, Applied Economics Letters, 21:3, 147-151
Data sources Non-relational data
Agenda Modern Data Warehouse Big Data Application examples Analytic Platform Systems Architecture Hadoop Integration of Hadoop and APS APS with external Hadoop Clusters APS with Hadoop in the Cloud APS with integrated Hadoop
About Analytics Platform System
PDW Logical Architecture Control Node (virtualized) Compute/Storage Nodes (virtualized) Database host Servers Direct Attached Storage Nodes Client Queries Control Host Node Virtualization spare All servers are virtualization hosts Running Windows Server 2012 Control and compute nodes are virtual All run SQL Server 2012 Control node spreads data and workload across compute nodes Data loads are in parallel and take advantage of the power of all nodes Fast Infiniband interconnection
Scalability: Massively Parallel and Shared nothing Add Capacity Smallest (0TB) To Largest (5PB) Add Capacity Start small with a few Terabyte warehouse Add capacity up to 5 Petabytes 0TB 5 PB Just grow by adding scale units An SMP system would have needed to be completely reconfigured
2 InfiniBandFDR 36 Port Switches 2 Ethernet Switches 5120-24 G Control Node DL360p Failover Node DL360p For customer use The Base Unit has approximate useable storage capacity of 75TB, based on 5:1 compression. 3 additional Scale Units can fit into 1 rack, for up to 300 TB of useable storage. 3 rd Scale Unit for 8 nodes 2 ProLiant DL360p Compute Nodes Storage Block (D6000), 70 drives 2 nd Scale Unit for 6 nodes 2 ProLiant DL360p Compute Nodes Storage Block (D6000), 70 drives 1 st Scale Unit for 4 nodes 2 ProLiant DL360p Compute Nodes Storage Block (D6000), 70 drives Base Unit for 2 nodes 2 ProLiant DL360p Compute Nodes Storage Block (D6000), 70 drives Multiple racks can be configured for more useable storage. The 1TB drives can be replaced with 2TB or 3TB drives, for double or triple capacity. However, multiple Scale Units will provide better performance compared to one Base Unit with larger hard drives. For example, 3 Scale Units with 1TB drives will perform much better than 1 Base Unit with 3TB drives. Backup Node and Landing Zone (ETL Storage) is not included. The customer can order whatever they want for backup purposes, and install it themselves.
Software Windows Server 2012: Control Node, Mgmt. Node and Compute Nodes run in virtualized Environment Workload Management Workload classes System Center 2012: Single user i/f for management of PDW, OS, BI, custom apps and private cloud xvelocity In-memory execution Clustered columnstore SQL Server 2012 inside Visual Studio Data Tools Powerview directly on PDW Big Data Integration Polybase: T-SQL query to Hadoop External tables on Hadoop
A multi-region/workload appliance
Microsoft What is Hadoop? HCatalog Oozie HBase/Cassandra/Couch/ MongoDB Hive Mahout R Cascading Pig Flume Sqoop Zookeeper Ambari HBase(column DB) Hadoop = MapReduce + HDFS Avro Distributed, scalable system on commodity hardware composed of: HDFS distributed file system MapReduce programming model Others: HBase, R, Pig, Hive, Flume, Mahout, Avro, Zookeeper
APS: Parallel Data Warehouse and HDInsight region Control Node Failover Node Hadoop Head Node Hadoop redundant Head Node For customer use Configurable: Minimum 1 PDW region Additional PDW scale units Additional HDI scale units Hadoop region Hadoop region PDW scale unit PDW region
HDI region overview In a nutshell, it s a HDI instance running on an appliance. HDInsight is Microsoft branded Hortonworks distro. An integrated appliance for running PDW region and HDI region PDW is offered as a stand-alone workload on the appliance. HDI is offered only as an add-on to PDW, as a scale unit Based on V2 hardware. H/A for the Head Node is provided via Windows Failover Clustering (WFC), Data Node H/A is provided via HDFS/MapRed mechanisms Security add-ons to address security issues which are not contained in standard Hadoop Support for multiple user accounts
Query Hadoop data with T-SQL using PolyBase Bringing the worlds or big data and the data warehouse together for users and IT Select Result set Windows Azure HDInsight Cloudera Hortonworks (Windows, Linux) SQL Server Parallel Data Warehouse PolyBase Microsoft HDInsight Single T-SQL query model for PDW and Hadoop with rich features of T-SQL including joins without ETL Leverages the power of MPP to enhance query execution performance Supports Windows Azure HDInsight to enable new hybrid cloud scenarios Query non-microsoft Hadoop distributions such as Hortonworks and Cloudera
Big data insights for any user Native Microsoft BI integration to create new insights with familiar tools Leverages high adoption of Excel, Power View, Power Pivot, and SSAS No IT intervention required Everyone else using Microsoft BI tools Allow any users to create new insights with familiar tools Analyze PDW and Hadoop data in the same view Power Users Data Scientists
Differentiation: Freedom of deployment options and hybrid solutions
APS Management Console 1 PDW and Appliance
Agenda Modern Data Warehouse Big Data Application examples Analytic Platform Systems Architecture Hadoop Integration of Hadoop and APS APS with external Hadoop Clusters APS with Hadoop in the Cloud APS with integrated Hadoop
Polybase Use Case Category 1 Integration with external Hadoop clusters
Listening to SQL customers ShinSeGae Investing into Online Shopping website ( Korea s Amazon ) o SQL Server 2012 PDW & HDP 1.3/HDP 2.0 on Linux What they want 1. We want perform complex data mining on customer purchase data basket analysis. 2. We want to understand the social media data (reviews/twitter) specifically around our products & stores. 3. We will use Hadoop to keep all of our data ~ envisioned to be around 480 TB. PDW will be the efficient analysis engine for the hot data. 4. PDW & Polybaseare much faster than Hive. 5. We re interested in using data mining cloud services in Azure (hybrid scenarios) Microsoft NDA - Material
Listening to SQL customers TeleCom Understanding network quality o SQL Server 2012 PDW & Cloudera 4.5 on Linux What they want 1. We collect millions of network records for quality assessment and capacity planning on a daily basis. 2. Hadoop will be used for storage and ETL of these network record files. 3. PDW for more operational analysis, ad-hoc analysis, operational reports. 4. We are using Polybasealong with Oozie-based orchestration for a seamless & automated integration. Microsoft NDA - Material
Solution Architecture Integration with external Hadoop cluster (1) Polybase for integrating with various Hadoop distributions Support of Hortonwork s HDP 1.x & 2.x (Windows Server and Linux) Support of Cloudera scdh 4.x (on Linux) Microsoft APS Polybase Your Apps PowerPivot & PowerView External Table Push-down computation w/ AU1 release Pushing computation where data resides (Hadoop as query execution & processing aid) Transparent for users no need to learn map/reduce Seamless query experience through external tables + simplified & parallelized ETL through T-SQL (CTAS for import & CETAS for export) APS control & data nodes External Data source Polybase/APS query engine External File Format Web Apps Social Apps Mobile Apps Sensor & RFID Integration with 3 rd party tool and Microsoft insights/bi layer Existing applications simply work External tables populated through application layer like regular tables SQL Server Security Model You decide who sees what type of data SQL Server permission model adapted for each Polybase object external table, data source, and file format Microsoft NDA - Material
T-SQL Examples Integration with external Hadoop cluster (2) Creating external table, data source, file format Your Apps PowerPivot & PowerView CREATE EXTERNAL DATA SOURCE HDP2.0 WITH (TYPE = Hadoop, LOCATION = hdfs://hdp:8020,job_tracker_location= HDP:50300 ); CREATE EXTERNAL FILE FORMAT MyRCFile WITH(FORMAT_TYPE = RCFile, SERDE_METHOD = LazyBinarySerDe ) Microsoft APS Polybase External Table CREATE EXTERNAL TABLE Clickstream(url varchar(50),event_date date) WITH (DATA_SOURCE = HDP2.0,LOCATION = /employees/ employee.txt, FILE_FORMAT = MyRCFile); External Data source Polybase/APS query engine External File Format Querying Hadoop data SELECT user_name FROM ClickStream cs, PDW_User u WHERE cs.user_ip = u.user_ip and cs.url= www.microsoft.com ; APS control & data nodes Web Apps Social Apps Mobile Apps Sensor & RFID Persistently exporting & importing CREATE EXTERNAL TABLE Web_Sales WITH (LOCATION='/TPCDS/web_sales/, DATA_SOURCE = HDP2.0, FILE_FORMAT = MyRCFile) AS SELECT u.* FROM PDW_User CREATE TABLE PDW_Sales WITH DISTRIBUTION = Hash (id) AS SELECT FROM Web_Sales Microsoft NDA - Material
Solution Architecture (Details) ShinSeGae 2. Unstructured/semi-structured text data - External Polybase tables D, E, F Text (Board/SNS/ Internal Text ) Weather.. 1. Web log data(160gb/daily) External Polybase tables A, B, C Complex Event Processing (Storm) Message Queues (KAFKA, Open source) Tracking Log Servers SSG.com (renewal) Online Shopping Mall Recommendation engine & personalized advertising 3. Company emails External Polybase tables G, H, I Mails Campaign HDP 1.3 on Linux (5-10 servers) raw/cold data Analytic information (right customer targeting) Polybase Queries 10 GB Ethernet APS/PDW Operational Data Store EDW Recent/hot data stored in PDW EIS OLAP (Tabular) DATA Mining Visualization (Silverlight) BI analyst Microsoft NDA - Material
Solution Architecture (Details) Telcom Capturing Network logs (>300 GB/per day) External Polybasetables A, B, C Usage of Hive s Metadata stores HCatalog Polybase Queries APS/PDW Network quality analysis High-frequency Event Processing (Network logs) Cloudera s CDH 4 on HP (18+ servers) raw/cold data (Petabyte of network logs) Infiniband Operational Data Store EDW Hot operational PDW data Capacity Planning Visualization (PowerPivot) BI analyst/planner/ Decision-maker Oozie Workflows Remote procedure calls via stored procedures to trigger Polybase queries Microsoft NDA - Material
Polybase Use Case Category 2 Integration with Microsoft Azure
Listening to SQL customers (5) Government Bridging the gap between cloud & onprem ocurrent POC -SQL Server 2012 PDW & HDInsightAzure What they want 1. HDInsight/Hadoop in the cloud to store and massage our raw data (XML files) generated by our web-application. 2. PDW to keep the data on-prem (legal requirement) and to have an efficient query engine for analysis purposes. 3. Polybase is a great way of accessing our files in the cloud via simple T-SQL. 4. With this solution, we can allow web users to quickly ask questions while the heavy, more complex business analysis is accomplished by PDW users. Microsoft NDA - Material
Solution Architecture Hybrid Scenarios Microsoft Azure Your Apps Azure HDInsight Polybase as key integrative feature Integration with external Hadoop, HDInsight region & Azure Storage Data aging strategies Aging of cold data to Azure Storage APS & HDInsight region for hot & warm data Azure Storage Public Internet Azure Express Route Queryhot data & cold aged data APS as modern cloud end-point for Azure Seamless querying of hot & cold data through APS APS as gateway allowing users to query all on-prem data via PowerBI and T-SQL examples On-premises or private cloud Your Apps Microsoft or 3 rd party Applications Microsoft APS Polybase APS control & data nodes CREATE EXTERNAL DATA SOURCE WASB WITH (TYPE = Hadoop, LOCATION = wasbs://dailylogs@myaccount.blob.core.windows.net ); CREATE EXTERNAL TABLE clickstream_hdinsights (url varchar(50), event_date date) WITH (DATA_SOURCE = WASB, LOCATION = /input/ log1.txt,file_format = MyDelimitedText); SELECT FROM clickstream_hdinsights, PDW_Table Microsoft NDA - Material
Solution Architecture (Details) Government HDI tools for data transformation Web apps- Generating tons of smaller XML files (~7KB each) Web Application for Tax Filing (einvoice) Other Web Feeds Transforming to large text files ~ 10 GBs each (External WASB Tables) HDI on Azure Azure Blob Storage cheap data store alternative to Hadoop onprem solution Public Internet or Azure Express Route Polybase Queries APS/PDW Operational Data Store EDW PDW/APS for fast query response & data processing of hot data Microsoft BI stack IBM Cognos Microsoft NDA - Material
Polybase Use Case Category 3 Unified Appliance with PDW and HDInsight region
Listening to SQL customers (6) Beverage & Vending Machines What are you drinking? Why is the machine down? o POC - SQL Server/APS with PDW & HDI region What they want 1. We want a complete solution stack we do not have Hadoop experts in-house and don t have the money to get it. 2. We want to store all raw data coming from vending machines into Hadoop. 3. 360 degree of all our data structured customer data & unstructured data coming from vending machines. 4. Predicate maintenance of machines. Microsoft NDA - Material
Solution Architecture Unified APS appliance Your Apps External Table PowerPivot & PowerView Distributed & replicated table Unified appliance Multi-workload support with PDW and HDInsight region HDInsight powered by HDP bits No need to deal with multiple support teams ( better together ) Seamless & performing query experience through Polybase External tables can be used for HDI data PDW data nodes connected via high-speed network (Infiniband) to Hadoop data nodes Unified Microsoft APS with PDW & HDI region Simplified management & monitoring One consistent monitoring experience through appliance management tools T-SQL examples APS control & data nodes Web Apps HDI name & data nodes Social Apps Mobile Apps Sensor & RFID CREATE EXTERNAL DATA SOURCE HDI_R WITH (TYPE = Hadoop, LOCATION = 'hdfs://htukia-c-hhn01:8020,job_tracker_location ='HTUKIA-C HHN01:50300' CREATE EXTERNAL TABLE HDI_Region (url varchar(50), event_date date) WITH (DATA_SOURCE = WASB, LOCATION = /input/ log1.txt,file_format = MyDelimitedText); SELECT FROM clickstream_hdinsights, PDW_Table Microsoft NDA - Material
Solution Architecture (Details) Internal Microsoft Data Scientist Data scientist group 1 - using chaing of Hive queries & PowerQueryvia HiveODBC Hive & PowerQuery via Hive ODBC Analyzing ~3 TB Web Traffic msn.com Log files Microsoft servers Log files Secure Gateway & AD Integration HDI region 1 scale unit HDI region System Center & AdminConsole Polybase Queries Infiniband PDW region Full Rack PDW Data scientist group 2 -Using Polybasefor existing tooling (T-SQL, BI tools), performing processing of complex analytical queries & consistent management experience PowerQuery/PowerV iew/powermap Analytical queries via SSDT APS with PDW & HDI region Microsoft NDA - Material
Microsoft Digital Crime Unit Part of Microsoft LCA (Legal and Corporate Affairs) mandated to help protect Internet DCU s Challenge: To effectively combat digital crime requires the collection of huge amounts of data from multiple sources. DCU needs to be able to: Process 10s of TBs daily and house PBs of data historically (accessible as needed) House 100s of terabytes from multiple sources that is easily queryable. Use leading edge business intelligence and visualization tools.
Corporate Security Officers DCU Big Data Solution DCU Investigators and Analysts Predictive Analytics Embedded BI SQL Azure Azure MSFT SQL Stream Insight Data Sources Sinkholes, Passive DNS, Files, 3 rd Party Security Info. 500 TB SAN Storage PowerView HP Business S Decision S R Appliance S Hadoop 30 Node Cluster On Windows Excel with PowerPivot SSIS SharePoint, SSRS, SSAS, PowerView, PowerPoint HP EDW Appliance MSFT PDW
Microsoft Digital Crime Unit Data Source for BI Drop Extract Load Transform Data Source for BI Source for BI Hadoop SSIS PDW SSAS Microsoft BI Microsoft Digital Crime Unit currently being implemented) Part of Microsoft LCA (Legal and Corporate Affairs) mandated to help protect the Internet To effectively combat digital crime requires the collection of huge amounts of data from multiple sources. Process 10s of TBs daily and house PBs of data historically (accessible as needed) House 100s of terabytes from multiple sources that is easily queryable. Use leading edge business intelligence and visualization tools. 30 Node Hadoop on Windows Server Control Rack and 10 Node PDW Data Rack HP BDA (Business Decision Appliance) upgraded to SQL 2012 BI Voyage currently implementing PDW and BI portions of the project.
Why 2 Storage Platforms? HADOOP Parallel Data Warehouse Storage Capacity in the Petabytes Storage Capacity in the 100s of Terabytes Simplified Load, just drop unstructured or semi-structured files ETL process more complex to transform data in to reporting optimized DB structures No optimization of queries Structures can be optimized for common query patterns. Queried by IT professionals Queried by business analysts Complex and slow to query multiple sources at once Hadoop is DCU s Centralized Data Warehouse. Simple load and high capacity make it optimal for storing huge volumes of data. Optimized for fast query against key data from multiple sources. PDW is DCU s Data Mart platform. Easily accessible, intuitive data structures, and blazing fast for querying data.
APS Differentiators Part of a product family: From SQL server standalone to Cloud service offerings TCO: Very low, especially when looking on the whole bundle: ETL (SSIS), PDW, Data marts (SQL server) and Analytics (SSAS, SSRS) Appliance: Much lower effort for DBAs Microsoft product stack integration SSIS, SSAS, SSRS, PowerPivot, System Center, integration with Cloud services Linear Scaling via Shared Nothing xvelocity: Column Store and In-Memory execution Polybase: Integration with Big Data and Hadoop HDInsight integrated: fast Infiniband interconnect, management and security Microsoft exhibits one of the best value propositions on the market with a low cost and a highly favorable price/performance ratio - Gartner, February 2012
Columnstore Up to 100x faster queries Up to 15x more compression Updatable clustered columnstore vs. table with customary indexing 48 Parallel query execution Query Results Store data in columnar format for massive compression Load data into or out of memory for nextgeneration performance Updateable and clustered for real-time trickle loading
Concurrency that fuels rapid adoption Great performance with mixed workloads ETL/ELT with SSIS, DQS, MDS Analytics Platform System SQL Server SMP ERP CRM LOB APPS ETL/ELT with DWLoader PDW SSRS / SSAS Hadoop / Big Data PolyBase BI Tools Ad hoc queries HDInsight
MEC, a global media agency, uses SQL Server PDW with in-memory technology to cut query time helping marketers unlock the value of their data. SQL Server Analytics Platform System gives us massively parallel advantages. Whereas it would take up to four hours to run queries scaling across multiple nodes, now it takes just minutes.
Value through a single flexible appliance solution Why Analytics Platform System when I have SQL Server? Single appliance solution PDW Reduce the data center footprint Lower energy costs and usage Accelerate time to value and insights with no forklift required for scaling out PolyBase HDInsight Simplify management with built in System Center Reduce tuning efforts while retaining high performance
Value through a single flexible appliance solution Why Analytics Platform System when I have SQL Server? Your choice of hardware PDW Integrated support plan with a single Microsoft contact Co-engineered with HP, Dell and Quanta best practices PolyBase HDInsight Pre-configured, built, tuned software and hardware Leading performance with commodity hardware
CROSSMARK needed faster and more detailed insight into terabytes of information about product supply and demand. They deployed a turnkey business intelligence solution from Microsoft and HP that is based on the Microsoft SQL Server Parallel Data Warehouse. People can instantly create their own reports with SQL Server Power View and PowerPivot for Excel and they can build those reports 50 percent to many times faster compared with the previous system.