BITKOM& NIK - Big Data Wo liegen die Chancen für den Mittelstand?
The Big Data Buzz
big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. (Wikipedia)
Apache Hadoop the leading big data analytics technology won the Media Guardian Innovator of the Year Award
Big Data Analytics
Is there a comprehensive strategy for Big Data in your company? Please estimate the yearly growth of data for reporting and analysis No 63% 2011 9% 66% 19% 7% Planned 23% 2012 4% 54% 35% 8% 2013 3% 48% 36% 13% Yes 14% 0% 20% 40% 60% 80% 0% 20% 40% 60% 80% 100% Negative growth / No growth 1-25% growth 25%-50% growth Over 50% growth Survey with 274 participants from DACH, France, Nordics, Netherlands,
What problems have you encountered when using Big data? In which areas does your company use Big data analysis? Controlling 24% Inadequate technical know-how 46% Marketing 19% Inadequate analytical know-how 44% Sales 18% Lack of compelling business case 36% IT 18% Technical problems 34% Production 17% Cost 33% Research and development 14% Data privacy issues 25% Supply Chain 7% Can not make Big data usable for end-users 15% 0% 20% 40% 0% 20% 40% 60%
The database journey continuous: Big Data
Scale-Up (SMP) Scale-out (MPP) Up to 256 Cores in Windows today Parallel Data Warehouse
but is this already Big Data?
The large hadron collider produces 15 PB/year* http://public.web.cern.ch/public/en/lhc/computing-en.html
But what if my customer doesn t own a large hadron collider
Large scale plants Vehicle fleets Smart Grids Green Energy Stock Exchanges Host Protocols Computer Centers Web Farms Twitter Facebook Google Analytics
Source: The Importance of 'Big Data': A Definition, Mark Beyer, Douglas Laney, G00235055 "Big data" is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.
200 Mio Feeds 100 PB 20 PB
Social analytics different sources data structures sophisticated algorithms
Up to 75 control units in one car Up to 1.000 possible special equipments About 15 GB data on board (incl. navigational data) Up to 12.000 stores for onboard diagnosis More than 50.000 car diagnoses happening each day
How to deal with the 3 Vs?
Hadoop
The Hadoop Ecosystem (simplified) Quelle: Tom White s Hadoop: The Definitive Guide
Scalable machine learning library that leverages the Hadoop infrastructure Key use cases: Recommendation mining Examine user behavior, build recommendation model Clustering Grouping data into related topics Classification Learn from classified documents to assign categories to unlabeled data Algorithmns: K-means Clustering, Naïve Bayes, Decision Tree, Neural network, Hierarchical Clustering, Positive Matrix Factorization and more
Zookeeper Ambari HCatalog Oozie HBase/Cassandra/Couch/ MongoDB Hive Mahout R Cascad-ing Pig Flume Sqoop HBase (Column DB) Avro Hadoop = MapReduce + HDFS Hortonworks HStreaming Karmasphere Splunk Cloudera Hadapt MapR Datameer
Use of Big Data
180 PB raw data in > 40.000 computers (polystructured)* Biggest Hadoop cluster: 4.500 nodes (2x4 CPUs, 4x1 TB disks, 16 GB RAM) Ad Impressions: Cube with 207 Measures 24 Dimensions 247 Attributes Desktop Clients (Excel & Tableau): < 6s ad hoc query time http://wiki.apache.org/hadoop/poweredby
Query engine for SQL & Hadoop Cost base optimizer. Decides on: Rendering operators in Map/Reduce-Jobs or Moving HDFS data into RDBMS storage HDFS-Bridge for parallelized Data Transport Regular T-SQL Results PDW V2 & HDFS Data Nodes
maturing Not every problem questions simple
2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation.