Hadoop and Data Warehouse Friends, Enemies or Profiteers? What about Real Time? Kai Wähner kwaehner@tibco.com @KaiWaehner www.kai-waehner.de
Disclaimer! These opinions are my own and do not necessarily represent my employer
Key Messages Big Data is not just Hadoop, concentrate on Business Value! A good Big Data Architecture combines DWH, Hadoop and Real Time! The Integration Layer is getting even more important in the Big Data Era!
Agenda Terminology Data Warehouse and Business Intelligence Big Data Processing with Hadoop Fast Data Processing in Real Time
Agenda Terminology Data Warehouse and Business Intelligence Big Data Processing with Hadoop Fast Data Processing in Real Time
Big Data Architecture Big Data Architecture Hadoop DWH / BI Real Time
DWH means analyzing structured data http://www.exforsys.com/tutorials/msas/data-warehouse-database-and-oltp-database.html
Big Data means analyzing everything Store everything Even without structure Use whatever you need (now or later) http://blogs.teradata.com/international/tag/hadoop/
What is Big Data? The combined Vs of Big Data Volume (terabytes, petabytes) X Velocity (realtime) Value Variety (social networks, blog posts, logs, sensors, etc.)
Real Time Wikipedia Definition: Real time programs must guarantee response within strict time constraints, often referred to as "deadlines. Real time responses are often understood to be in the order of milliseconds, and sometimes microseconds. The term "near real time refers to the time delay introduced, by automated data processing or network transmission. The distinction between the terms "near real time" and "real time" is somewhat nebulous and must be defined for the situation at hand. Hereby, for this talk, I define: Real time == response in nanoseconds microseconds milliseconds Near real time == (response time > one second)
Agenda Terminology Data Warehouse and Business Intelligence Big Data Processing with Hadoop Fast Data Processing in Real Time
Big Data Architecture Big Data Architecture Hadoop DWH / BI Real Time
DWH vs. BI Data Warehouse (DWH) Storage Business Intelligence (BI) Analytics Both terms are often used as synonym, i.e. when someone talks about a DWH, this might include analytics BI can be used without a DWH
Typical DWH Process A DWH is Business Case driven : Reporting Dashboards Drill Down Analytics Different DWH Options: Enterprise DWH ( == EDW) Department / Project DWH Embedded BI (into Applications) http://wikibon.org/blog/not-your-fathers-data-analytics/
BI == Reporting + Statistics + Data Discovery DWH BI
BI Visualization
Products DWH SQL: e.g. MySQL MPP: e.g. Teradata, EMC Greenplum, IBM Netezza Scale very well (almost linear), very high performance, hardware / software costs also increase a lot BI Microsoft Excel BI Tools: e.g. TIBCO Spotfire, Tableau, MicroStrategy Hint: Good BI tools allow data discovery / visualization using different sources, not just DWH are easy to use
BI Tool Example: TIBCO Spotfire
DWH - Real World Use Case http://spotfire.tibco.com/assets/bltef8a0cfc133c4cdf/zipcar.pdf
Embedded BI - Real World Use Case https://www.jaspersoft.com/embeddedshowcase/periscope.html
Problems of a DWH No flexibility / agility Just structured data Just some (maybe aggregated) history data Just good for already known business cases Low speed ETL is batch, usually takes hours or sometimes even days No proactive reactions possible too late architecture High costs (per GB) Just selected data Too old data is often outsourced to archives
DWH vs. Big Data http://martinfowler.com/bliki/datalake.html
Agenda Terminology Data Warehouse and Business Intelligence Big Data Processing with Hadoop Fast Data Processing in Real Time
Big Data Architecture Big Data Architecture Hadoop DWH / BI Real Time
Why no longer DWH, but Hadoop? Hadoop was built to solve problems of RDBMS and DWH Benefits of Hadoop: Store and analyze all data all data == not just selected (maybe aggregated) data all data == structured + semi-structured + unstructured be more flexible, adapt to changing business cases Better performance (massively parallel) Ad hoc data discovery also for big data volumes Save money (commodity hardware, open source software)
What is Hadoop? Apache Hadoop, an open-source software library, is a framework that allows for the distributed processing of large data sets across clusters of commodity hardware using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.
MapReduce Simple example: Input: (very large) text files with lists of strings, such as: 318, 0043012650999991949032412004...0500001N9+01111+99999999999... We are interested just in some content: year and temperate (marked in red) The Map Reduce function has to compute the maximum temperature for every year
Hadoop Products few Apache Hadoop MapReduce HDFS Ecosystem many Features included
Hadoop Ecosystem
Hadoop Products few Apache Hadoop MapReduce HDFS Ecosystem + Hadoop Distribution Packaging Deployment-Tooling Support many Features included
Hadoop Distributions EMR ( more available)
Hadoop Products Apache Hadoop Hadoop Distribution Big Data Suite few MapReduce HDFS Ecosystem + Packaging Deployment-Tooling Support + many Tooling / Modeling Code Generation Scheduling Integration Features included
Big Data Integration Suite: TIBCO BusinessWorks
Hadoop Real World Use Case: Replace ETL to improve Performance The advantage of their new system is that they can now look at their data [from their log processing system] in anyway they want: Nightly MapReduce jobs collect statistics about their mail system such as spam counts by domain, bytes transferred and number of logins. Benefit: Improved speed compared to typical ETL. When they wanted to find out which part of the world their customers logged in from, a quick [ad hoc] MapReduce job was created and they had the answer within a few hours. Not really possible in your typical ETL system. http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-query-terabytes-data ( no TIBCO reference)
Hadoop Real World Use Case: Storage to reduce Costs Global Parcel Service A lot of data must be stored forever Numbers increase exponentially Goal: As cheap as possible Problem: Queries must still be possible (compliance!) Solution: Commodity servers and Hadoop querying http://archive.org/stream/bigdataimpraxiseinsatz-szenarienbeispieleeffekte/big_data_bitkom-leitfaden_sept.2012#page/n0/mode/2up ( no TIBCO reference)
DWH or Hadoop? DWH Hadoop Data Structured All data Maturity Established in Enterprise New concepts Tooling Installed, good knowledge and experience New tools, coding required, business can still use SQL-similar queries or same BI tool Costs High (per GB) Low (per GB)
DWH plus Hadoop? DWH and Hadoop complement each other very well Store all data in Hadoop (cheap per GB) ETL from Hadoop to DWH (expensive per GB) Create specific reports / dashboards in DWH (leverage existing products and knowledge) Do Ad Hoc (Big) Data Discovery directly in Hadoop, no DWH needed Good BI tools support both, DWH and Hadoop! For example, TIBCO Spotfire has connectors to: RDBMS (e.g. MySQL) MPP (e.g. Teradata, IBM Netezza, Greenplum) Hadoop (e.g. Hive, Impala) In-Memory (e.g. TIBCO ActiveSpaces, SAP HANA)...
Recommendation DWH vs. Hadoop vs. NoSQL Short term: Use Hadoop (only) when you can save (a lot of) money or when you can not solve your business problem without Hadoop. A lot of things have to be improved, e.g. governance, security, performance, and tool support. Long term: Hadoop can replace DWH (as you can create a DWH on top of Hadoop with SQL interface as of today)! Be aware: A lot of other options emerged for analyzing big data besides Hadoop, e.g. - Analytical databases with SQL interface (MemSQL, Citus Data) - Log Analytics (Splunk, TIBCO LogLogic) - Graph databases (Neo4j, InfiniteGraph) - Cassandra, MongoDB, you name it...
Vendors Strategy... Hadoop vendors push Hadoop as DWH replacement Called e.g. Enterprise Data Hub (Cloudera) or Data Lake (Hortonworks) http://gigaom.com/2013/10/29/clouderas-plan-to-become-the-center-of-your-data-universe/ http://hortonworks.com/wpcontent/uploads/downloads/2013/04/hortonworks.apachehadooppatternsofuse.v1.0.p
Vendors Strategy... MPP / DWH vendors add Hadoop support as complementary addon to their DWH Reason (probably): Market pressure! Benefit: One platform (including tooling and support) for DWH and Hadoop ( SQL-for-everything )
Example: EMC combines DWH and Hadoop http://wikibon.org/wiki/v/emc_integrates_greenplum_db_and_hadoop_with_pivotal_hd http://www.gopivotal.com/big-data/pivotal-hd
Example: Teradata combines DWH and Hadoop http://www.teradata.com/teradata-enterprise-access-for-hadoop/ http://gigaom.com/2014/04/07/teradata-says-hadoop-is-good-for-business-but-for-how-long/
Hadoop evolving from Batch to Near Real Time Hadoop is MapReduce == Batch (== hours, minutes, seconds) Good for complex transformations / computations of big data volumes Not so good for ad hoc data exploration Improvements: Hive Stinger (Hortonworks) etc. Non-MapReduce processing engines added in the meantime (YARN makes it possible) Ad hoc data discovery (== seconds) Hive / Pig with Apache Tez replacing MapReduce under the hood for data processing New Query engines, e.g. Impala (Cloudera) or Apache Drill (MapR) MPP vendors (e.g. Teradata, EMC Greenplum) also add own query engines Offer fast data exploration (without MapReduce) SQL-for-everything Some Hadoop problems remain No good, easy tooling (Hadoop ecosystem) might be solved next years Missing maturity (alpha / beta versions) might be solved next years Commodity hardware no longer sufficient with these new emerging technologies (for instance: SQL-on- Hadoop solutions require a lot of memory) No real time (== ms, ns), but near real time (> 1 sec) too late architecture
Agenda Terminology Data Warehouse and Business Intelligence Big Data Processing with Hadoop Fast Data Processing in Real Time
Big Data Architecture Big Data Architecture Hadoop DWH / BI Real Time
Real Time: The Two-Second Advantage A little bit of the right information, just a little bit beforehand whether it is a couple of seconds, minutes or hours is more valuable than all of the information in the world six months later this is the two-second advantage. Vikek Ranadivé, Founder and CEO of TIBCO
The Value of Data decreases over Time $$$$ $$$ $$ Business Event Data Ready for Analysis Analysis Completed Decision Made Event Processing speeds action and increases business value by seizing opportunities while they matter $ Action Taken Time
What is Big Data? The combined Vs of Big Data Volume (terabytes, petabytes) Velocity (realtime) X Fast Data Variety (social networks, blog posts, logs, sensors, etc.)
Complex Event / Stream Processing / In-Memory Concepts Streams: Monitoring millions of events in a specific time window to react proactively Stateful: Collect, filter and correlate events with state to anticipate outcomes and react proactively Transactional: Highly performant transactional event processing Products vs. Frameworks Products are mature, mission-critical, in production, e.g. TIBCO StreamBase, IBM InfoSphere Streams Open Source Frameworks, e.g. Apache Spark and Apache Storm Future will tell us about performance, tooling, support, etc. Can be combined with Hadoop Are complementary to Products such as TIBCO StreamBase In-Memory Can also be used for big data (Terabytes possible!) Usually complementary, i.e. they can respectively have to be combined with stream processing / complex event processing
Stream Processing Architecture (Example: TIBCO StreamBase) Connect to streams Snapshot AND always-live updates TIBCO Live Datamart Orders / Executions Transaction Cost TIBCO StreamBase Active Tables Continuous Query Ad Hoc Query Trading Signal Market Data Continuous Query Processor Alerts Alert Setting Anticipate opportunities, proactive action
Example: TIBCO StreamBase Tooling StreamBase Development Studio Visual Development Visual Debugging Feed Simulation Unit Testing StreamBase Live Datamart Real Time Analytics and Visualization Ad hoc queries Alerts and Notifications Web, Mobile and API Integration
Some Fast Data Use Cases Algorithmic trading (trading) Fraud detection (finance) Predictive sensor analytics (manufacturing) Continuous network analytics (telecom) Omni-channel sales (retail) Let s take a closer look at one example FAST DATA use cases show up everywhere, not just in trading! 56
The future of retail technology is real-time and event driven. - CIO, leading retailer
Copyright 2000-2013 TIBCO Software Inc. PSYCHOLOGICAL ROUTER 88% Inventory 18% 43% 52% Location 28% Spend 23% MATCH 92% Last Experience 76% Browser Type 68% App Version 85% Nice to see you again! 79%
The Event-Driven Retail Reference Architecture REAL-TIME CUSTOMER INTERACTION EVENT-DRIVEN PAYMENTS SENTIMENT ANALYTICS & ALERTING LIVE PROMOTIONS & PRICING PROGRAM, CAMPAIGN & OFFER MANAGEMENT WALLET LOYALTY POINTS EVENT-DRIVEN VIRTUAL CUSTOMER IMAGE EVENT-DRIVEN INVENTORY FABRIC EXTERNA L EXTERNA L CRM INVENTORY WAREHOUS E STORE
Retailers want to treat their stores like warehouses... Demand (from the ESB) Inventory (from In-Memory) Action (dynamic rules) Cross Sell Aggression (from correlation rules)
Real Time plus Hadoop? Hadoop: Storage Complex computing (MapReduce) Real Time: Immediate (proactive) reactions Monitor streaming data in Real Time Example: TIBCO StreamBase and its Apache Flume connector for reading streaming data from Hadoop / HDFS or to send streaming data to Hadoop / HDFS
Real Time plus Hadoop Real World Use Case Use Case: Predict pricing movement in live bets http://www.casestudyu.com/news/2014/04/04/7762652.htm Hadoop: Store all history information about all past bets Use MapReduce to precompute odds for new matches, based on all history data TIBCO StreamBase: Compute new odds in real time to react within a live game after events (e.g. when a team scores a goal) Monitor stream data in real time dashboards http://vimeo.com/91461315
Streaming Algorithm???????? WHEN 5 KEY BOOKIES RAISE THE SAME ODDS IN A 5-SECOND WINDOW, BET LESS
Reference Architecture: Streaming Betting Analytics GLOBAL, DISTRIBUTED INFRASTRUCTURE Event Processing BETTING LINES SCORES B U S MONITOR AGGREGATE REAL-TIME ANALYTICS CORRELATE B U S Predictive odds analytics Historical odds deviations NEWS HISTORICAL COMPARISON CACHE CACHE CACHE Zero Latency Betting Analytics HADOOP Real-Time Analytics Context: Historical Betting Data, Odds, Outcomes Copyright 2000-2015 TIBCO Software StreamBase Inc. LiveView
Recap: Big Data Architecture Big Data Architecture Hadoop DWH / BI Real Time
Off Topic What about Integration?
Off Topic Integration is no talking point in this session However: It gets even more important in the future! The number of different data sources and technologies increases even more than in the past CRM, ERP, Host, B2B, etc. will not disappear DWH, Hadoop cluster, event / streaming server, In-Memory DB have to communicate Cloud, Mobile, Internet of Things are no option, but our future!
Recap: Key Messages Big Data is not just Hadoop, concentrate on Business Value! A good Big Data Architecture combines DWH, Hadoop and Real Time! The Integration Layer is getting even more important in the Big Data Era!
Questions? Kai Wähner kwaehner@tibco.com, @KaiWaehner, www.kai-waehner.de