Big Data Processing: Past, Present and Future Orion Gebremedhin National Solutions Director BI & Big Data, Neudesic LLC. VTSP Microsoft Corp. Orion.Gebremedhin@Neudesic.COM B-orgebr@Microsoft.com @OrionGM
Big Data Processing: Past, Present and Future
Topics Covered History and Fundamentals of Big Data Processing SQL Server for Big Data, Past, Present and Future Summary
Characteristics of Big Data Big data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn t fit the strictures of your database architectures. To gain value from this data, you must choose an alternative way to process it.
Characteristics of Big Data The Vs of Big Data Volume 40 Zettabytes (43 Trillion Gigabytes) of data will be created by 2020. 300 Times increase from 2005 Most companies in the U.S have at least 100Tb of data Velocity NYSE captures 1TB of trade information every day The average modern car has over 100 sensors Variety Nearly 420 Million wearable health monitors Over 4 Billion hours of video watched on YouTube everyday
History of Big Data A big data cluster is a highly interconnected platform built from a collection of commodity parts. *Disruptive Possibilities by Jeffrey Needham Copyright 2013
Scale Up vs. Scale Out Scale up (SMP) Scale out (MPP) + (n) Upgrade components or buy bigger server each time Add nodes to the cluster Multiprocessor system where processors share resources : Operating System (OS), memory, I/O devices and connected using a common bus Multiple processors, each processor using its own OS and memory and communicating with each other using some form of messaging interface
Notable milestones in Commodity hardware CDC 6600 by Control Data Corporation. "The 6600 CPU had multiple functional units which could operate simultaneously (i.e., in parallel), allowing the CPU to overlap instructions' execution times.. http://en.wikipedia.org/wiki/cdc_6600 A Beowulf cluster (1990s) is a computer cluster of what are normally identical, commodity-grade computers networked into a small local area network with libraries and programs installed which allow processing to be shared among them. http://en.wikipedia.org/wiki/beowulf_cluster
Some Applications of Big Data Big Data supercomputers are pattern explorers. Shopping Patterns Sensor and Intelligent devices Data analytics Social Network associations and suggestions Predictive analytics Crime investigation
SQL Server for Big Data
SQL Server Optimizations
Microsoft Analytics Platform System About Analytics Platform System! SQL Server Parallel Data Warehouse PolyBase!!!! Microsoft HDInsight!!!
APS Growth Topology Scale Unit Base Unit Base UnitExtension
Introducing the Microsoft Analytics Platform System Relational and nonrelational data in a single appliance Near real-time performance with In- Memory Columnstore Industry s lowest data warehouse appliance price per terabyte Enterprise-ready Hadoop Integrated querying across Hadoop and PDW using T- SQL Direct integration with Microsoft BI tools such as Microsoft Excel Ability to scale out to accommodate growing data Removal of data warehouse bottlenecks with MPP SQL Server Concurrency that fuels rapid adoption Value through a single appliance solution Value with flexible hardware options using commodity hardware
Deployment options and hybrid solutions
Connecting islands of data with PolyBase Selec t Resul t set Provides a single T-SQL query model for PDW and Hadoop with rich features of T-SQL, including joins without ETL Microsoft Azure HDInsight Hortonworks for Windows and Linux Cloudera SQL Server Parallel Data Warehouse PolyBase Microsoft HDInsight Uses the power of MPP to enhance query execution performance Supports Windows Azure HDInsight to enable new hybrid cloud scenarios Provides the ability to query non- Microsoft Hadoop distributions, such as Hortonworks and Cloudera
Microsoft s modern data warehouse SQL Server 2014 Analytics Platform System Microsoft Azure HDInsight Data Platform
Summary Understand your data growth to determine when to Scale-Out. Determine the right tool for the workload you have.
Questions? Questions and Discussion