Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing Wayne W. Eckerson Director of Research, TechTarget Founder, BI Leadership Forum Business Analytics TechTarget
What comes next? Kilobyte (KB) Megabyte (MB) Gigabyte (GB) Terabyte (TB) Petabyte (PB) Exabyte (EB) Zettabyte (ZB) Yottabyte (YB) 10 3 bytes 10 6 bytes 10 9 bytes 10 12 bytes 10 15 bytes 10 18 bytes 10 21 bytes 10 24 bytes 3
Information explosion Unstructured & Content Depot Structured & Replicated Source: IDC Digital Universe 2009; White Paper, Sponsored by EMC, May 2009 2005 2006 2007 2008 2009 2010 2011 2012 Every 18 months, non-rich structured and unstructured enterprise data doubles 4
Data deluge Structured data - Call detail records - Point of sale records - Claims data Semi-structured data - Web logs - Sensor data - Email, Twitter Unstructured data - Video, Audio, - Images, Text A Sea of Sensors, The Economist, Nov 4, 2010 5
Three Big Data revolutions Data warehousing (1995+) Analytical platforms (2005+) Hadoop ecosystem (2010+) Business Analytics TechTarget 6
First revolution: data warehousing Operational System Operational System ETL Data Data Warehouse ETL Data Mart BI Server Reports / Dashboards Operational System Operational System Business Analytics TechTarget 7
Second revolution: analytical platforms 1010data Aster Data (Teradata) Calpont Datallegro (Microsoft) Exasol Greenplum (EMC) IBM SmartAnalytics Infobright Kognitio Netezza (IBM) Oracle Exadata Paraccel Pervasive Sand Technology SAP HANA Sybase IQ (SAP) Teradata Vertica (HP) Purpose-built database management systems designed explicitly for query processing and analysis that provides dramatically higher price/performance and availability compared to general purpose solutions. Deployment Options -Software only (Paraccel, Vertica) -Appliance (SAP, Exadata, Netezza) -Hosted(1010data, Kognitio)
Game-changing technology Purpose built - For analytics in general - For specific analytic workloads Quicker to deploy - Preconfigured and tuned - Fast ROI Faster and more scalable - Faster query response times - Linear performance Built-in analytics - Libraries of functions - Extensible SDK Less costly - Less power, cooling, space - Fewer people to maintain
Business value of analytic platforms Kelley Blue Book Consolidates millions of auto transactions each week to calculate car valuations AT&T Mobility Tracks purchasing patterns for 80M customers daily to optimize targeted marketing CBS Interactive Analyzes Web visitor behavior to optimize content/ad placement and revenue Analytical appliance MPP Analytical Database Hadoop + Analytical database
Third Revolution - Hadoop Open source projects Hosted by Apache Foundation Initially developed by Google, Yahoo, etc. Offers scale out architecture on commodity servers with direct attached storage Business Analytics TechTarget 11
Hadoop distilled Click to edit Master title style Data scientist Open Source $$ Unstructured data BIG DATA MapReduce Distributed File System Schema at Read Benefits - Any data - Agile - Expressive - Affordable Drawbacks - Immature - Batch oriented - Security, concurrency, metadata, etc. - Expertise - TCO? 12
Click Hadoop to edit hype Master title style Overheard Hadoop will replace relational databases. Hadoop will replace data warehouses. Hadoop has a superior query engine compared to analytical platforms. Gartner Group Hype Cycle Use Hadoop for any application that requires more than one node. 13
Hadoop adoption rates No plans 38% Considering 32% Experimenting 20% Implementing 5% In production 4% Based on 158 respondents, BI Leadership Forum, April, 2012 14
Hadoop workloads Today In 18 Months Staging area Online archive Transformation Engine 83% 92% 92% 92% 92% 92% Ad hoc queries 58% 67% Scheduled reports 42% 67% Visual exploration 25% 67% Data mining 58% 83% Based on respondents that have implemented Hadoop. BI Leadership Forum, April, 2012 15
Hadoop s impact on the data warehouse Replaces it 0% Offloads existing workloads 50% Handles new workloads 67% Shares existing workloads 33% Shares new workloads 25% Don't know 8% Based on respondents that have implemented Hadoop. BI Leadership Forum, April, 2012 16
BI Framework 2020 17 Content Intelligence Keyword search, BI tools, Xquery, Hive, Java, etc. MapReduce, XML schema, Key-value pairs, graph notation, etc. Business Intelligence End-User Tools Reports and Dashboards HDFS, NoSQL databses Design Framework MAD Dashboards Architecture Data Warehousing Data Warehousing Reporting & Analysis Analytic Analytic Sandboxes Sandboxes CEP, Streams Event-driven Ad hoc query, Ad hoc Spreadsheets, SQL OLAP, Visual Analysis, Analytic Workbenches, Hadoop Excel, Access, OLAP, Data mining, visual exploration Analytics Intelligence Exploration Power Users Event-Driven Alerts and Dashboards Event detection and correlation Dashboard Alerts Continuous Intelligence
Pros: - Alignment -Consistency Cons: - Hard to build - Politically charged - Hard to change - Expensive - Schema Heavy Data Warehousing Architecture BI Framework TOP DOWN- Business Intelligence Corporate Objectives and Strategy Reporting & Monitoring (Casual Users) Predefined Metrics Non-volatile Data Reports Beget Analysis Analysis Begets Reports Pros: - Quick to build - Politically uncharged - Easy to change -Low cost Cons: - Alignment - Consistency - Schema Light Analytics Architecture Ad hoc queries Analysis and Prediction (Power Users) Processes and Projects 18 Volatile Data
The new analytical ecosystem Operational Systems (Structured data) Operational System Extract, Transform, Load (Batch, near real-time, or real-time) Streaming/ CEP Engine Casual User Operational System Machine Data Hadoop Cluster Data Warehouse Virtual Sandboxes Dept Data Mart BI Server Top-down Architecture Bottom-up Architecture Web Data Inmemory Sandbox Audio/video Data Free- Standing Sandbox External Data Documents & Text www.bileadership.com Analytic platform or nonrelational database 19 Power User
Analytical sandboxes Operational Systems (Structured data) Operational System Extract, Transform, Load (Batch, near real-time, or real-time) Streaming/ CEP Engine Casual User Operational System Machine Data Hadoop Cluster Data Warehouse Virtual Sandboxes Dept Data Mart BI Server Top-down Architecture Bottom-up Architecture Web Data Inmemory Sandbox Audio/video Data Free- Standing Sandbox External Data Documents & Text www.bileadership.com Analytic platform or nonrelational database 20 Power User
Recommendations Your BI architecture is now an analytical ecosystem Deploy analytical platforms to turbo-charge performance Explore Hadoop for big data Reconcile top-down and bottom-up BI environments Business Analytics TechTarget 21
Questions? Wayne Eckerson weckerson@techtarget.com Business Analytics TechTarget 22
Hadoop ecosystem Courtesy, Hortonworks, 2012. Business Analytics TechTarget 23