Modern Data Warehousing Cem Kubilay Microsoft CEE, Turkey & Israel
Time is FY15
Gartner Survey April 2014 Piloting on premise 15% 10% 4% 14% 57% 2014 5% think Hadoop will replace existing DW solution (2013: 10%) Piloting in the cloud Production on-premise with cluster 50% decline even with the presence of Hadoop 2.0 Production on premise with appliance
Data sources
Data sources Non-Relational Data 5
The EDW also powers BI and other Analytical solutions that offer business insight Fraud Department Analytics Department Enterprise Data Warehouse Reports Information Workers Dashboards Finance Department Reports
Value Business event Reducing Cycle Time in the Enterprise Data latency Data captured Analysis latency Intelligence delivered Decision latency Action taken Action time or Action distance TDWI The Business Case for Real-Time BI. Based on concept developed by Richard Hackathorn, Bolder Technology
Value Business event Reducing Cycle Time in the Enterprise Data latency Data captured Analysis latency Intelligence delivered Decision latency Action taken Action time or Action distance TDWI The Business Case for Real-Time BI. Based on concept developed by Richard Hackathorn, Bolder Technology
Value Business event Reducing Cycle Time in the Enterprise Data latency Data captured Intelligence delivered Analysis latency Decision latency Action taken Action time or Action distance TDWI The Business Case for Real-Time BI. Based on concept developed by Richard Hackathorn, Bolder Technology
Value Business event Reducing Cycle Time in the Enterprise Data latency Data captured Intelligence delivered Action taken Analysis latency EDW Decision latency Action time or Action distance TDWI The Business Case for Real-Time BI. Based on concept developed by Richard Hackathorn, Bolder Technology
SQL Server Parallel Data Warehouse Microsoft Analytics Platform System 12
Storage Managed by Windows Storage Spaces Each servers has 256 GB RAM Can Scale up to 6 PB Starts with ¼ Rack (2 Servers) In-Memory Analytics Integrated «BigData» analytics
PARALLEL QUERY EXECUTION Table with 10,000 distinct Cust_ids, distributed on Cust_id Control Node Compute Node 1 SQL Server Instance... Compute Node 10 SQL Server Instance Query:...... Dist. 1 Dist. 8 Dist. 73 Dist. 80 SELECT cust_id, SUM (units) FROM [sales] GROUP BY [cust_id] SELECT cust_id, SUM (units) FROM sales_1 GROUP BY [cust_id] SELECT cust_id, SUM (units) FROM sales_8 GROUP BY [cust_id] SELECT cust_id, SUM (units) FROM sales_73 GROUP BY [cust_id] SELECT cust_id, SUM (units) FROM sales_80 GROUP BY [cust_id] DIRECT RESULTS 125 rows 125 rows 125 rows 125 rows Fully Parallel Query Execution
Access Data Faster: In-memory for Real-Time Powered by in-memory columnstore in PDW and SQL Server Customer Products Sales Supplier Country
Connecting Islands of data with PolyBase Bringing Hadoop point solutions and the data warehouse together for users and IT Hortonworks Windows Server Select Result set Windows Azure HDInsight SQL Server Parallel Data Warehouse Single T-SQL query model for PDW and Hadoop Rich features of T-SQL including joins without ETL Enhance query execution performance using the power of massively parallel processing Cloudera Hortonworks Linux PolyBase Microsoft HDInsight Open and collaborative platform Supports Windows Azure HDInsight to enable new hybrid cloud scenarios Query non-microsoft Hadoop distributions such as Hortonworks and Cloudera
1. Export COLD DATA to Hadoop Hadoop SQL Server PDW 17
5. Combine data from different sources Hadoop Query: Join between HDFS table and PDW table select c.*, o.* from pdwcustomer c, hdfsorders o, Cloud_Twitter ct where c.c_custkey = o.o_custkey and ct.name=c.name and o_orderdate < 9/1/2010 Execution plan : 4 3 RETURN OPERATION DMS SHUFFLE FROM HDFS on o_custkey Select c.*. o.* from Customer c, otemp o,cttemp ct where c.c_custkey = o.o_custkey and ct.name=c.name Read hdfstemp into otemp, partitioned on o_custkey CREATE otemp,cttemp 2 On PDW compute nodes distrib. on o_custkey Hadoop SQL Server PDW 1 Run Map Job on Hadoop Apply filter to hdfsorders, Cloud_Twitter, put data to Temp tables 18 18
Performance SharePoint Dashboards SharePoint Scorecards Excel Workbooks PowerPivot Applications Queries 10-100x faster than traditional DW systems Microsoft SSAS ROLAP MOLAP Direct Access Microsoft SSRS Microsoft Analytics Platform System Optimized for mixed workload & near real-time data analysis Enhanced loading, 2+TB/hour Simplicity Ease of installation, one throat to choke Ready to go for immediate load and query = fast time to value No indexing, tuning, data sorting or materialized view maintenance SSIS/Existing ETL Tool Source Systems Value Non- Proprietary Standards based architecture reduces risk & cost Minimal implementation and ongoing administration cost Lowest full life cycle TCO
APS & Power BI better together: APS as an On-premises data hub for Power BI 1. APS integrates Data Mgmt Gateway as region out of box, enabling Tier 1 gateway hub for the enterprise. 2. Gateway registers with Azure, discovers on-premises assets, enabling users to query onpremises via Power BI. 3. APS scales gateway workload performance/concurrency or complex mashups across sources for Power BI enabling: 1. Multiple joins 2. Large amount of data 3. Different data formats/sources 4. Compute pushdown for Hadoop etc. 4. APS scales gateway platform, improving resiliency, HA and management. O365 Power BI Metadata catalog Public Internet Intranet HDI Azure DB Azure Secure Gateway APS PDW HDI SQL Assets Hadoop 3 rd - Party
Some Customer References (Microsoft Internal)
Value of APS