BIG DATA AND THE ENTERPRISE DATA WAREHOUSE WORKSHOP Business Analytics for All Amsterdam - 2015
Value of Big Data is Being Recognized Executives beginning to see the path from data insights to revenue and profit Big data often illuminates behavior Analytic sandbox results taken directly to management Data is becoming an asset on the balance sheet Valued based on Cost to produce the data Cost to replace the data if it is lost Revenue or profit opportunity provided by the data Revenue or profit loss if data falls into competitors hands Legal exposure from fines and lawsuits if data is exposed to the wrong parties
Big Data Analytic Use Cases Behavior tracking Search ranking Ad tracking Location and proximity tracking Causal factor discovery Social Media Share of voice, audience engagement, conversation reach, active advocates, advocate influence, advocacy impact, resolution rate, resolution time, satisfaction score, topic trends, sentiment ratio, and idea impact Financial account fraud detection/intervention System hacking detection/intervention On line game gesture tracking
More Sample Big Data Use Cases Non-numeric data and unique algorithms Document similarity testing Genomics analysis Cohort group discovery Satellite image comparison CAT scan comparisons Big science data collection Complex numeric data Smart utility meters Building sensors In flight aircraft status
The Challenge Legacy RDBMS s have been hugely successful And we will continue to use them The traditional pure relational data warehouse architecture can t handle some of the use cases New strategic data types Unstructured, semi-structured, machine data Evolving, just-in-time schema Links, images, geo-positions, genomes, log files
The Larger Picture Why Hadoop as Part of EDW? EDW mission unchanged: Publish the Right Data Tactical Lowered operational costs Linear scalability Reliable, redundantly store data archive Improve ETL SLA s Strategic Support variety of new data sources New analytics near impossible in RDBMS Schema on read for exploratory analysis Archive - Keep granular data forever
Hadoop Data Warehouse Configurations Hadoop as an ETL engine Text and numbers loaded into a conventional RDBMS Leverage Hadoop and data lake/data hub to feed EDW Extended RDBMS environment Extend the current formidable RDBMS legacy Add features/functions to address big data analytics Hadoop as a destination environment Direct querying of text and numbers and other data types with augmented SQL in Hive, Impala or equivalent Simultaneous co-existence of same data sets used by NoSQL analytic applications Active archive of all data, forever Complementary analytic sandbox Stand up Hadoop environment next to EDW environment
Extended Relational Data Warehouse with Big Data Additions
Hadoop Can Support a Multi-Modal Destination EDW: Enterprise Data Hub Hadoop Distributed File System HDFS Sources: Transactions Free text Images Machines/ Sensors Links/ Networks Commodity hardware Profoundly distributed Scalable to infinity PB+ Agnostic content Implemented in standard Java High speed streaming in/out Fault tolerant Write once, always replicated HDFS is NOT a database, it is a file system Flume, Sqoop, ETL Tools: Files: Metadata: HCatalog SQL Apps: Hive Impala BI Tools: Tableau, Cognos, Bus Obj,
10 Complementary Analytic Sandbox Credit: Bill Schmarzo, EMC, https://infocus.emc.com/william_schmarzo/schmarzos-2015-bigdata-predictions/
Implement Caches of Increasing Latency Each cache supports different class of applications Data quality improves left to right High value facts and dimensions pushed right to left
Focus on the Integration Challenge Data warehouse integration is drilling across: Establish conformed attributes (e.g., Customer Category) in each database Fetch separate answer sets from different platforms grouped on the conformed attributes Sort-merge the answer sets at the BI layer
Comparing the Two Architectures Relational DBMSs Proprietary, mostly Open source Expensive Less expensive MapReduce/Hadoop Data requires structuring Data does not require structuring Great for speedy indexed lookups Great for massive full data scans Deep support for relational Indirect support for relational semantics semantics, e.g. Hive Indirect support for Deep support for complex data structures complex data structures Indirect support for iteration, Deep support for iteration, complex branching complex branching Deep support for Little or no support for transaction processing transaction processing
Whither the Data Warehouse? Corral and embrace the big data analysts Analysts and IT must meet each other half way Insist on using shared data warehouse resources Conformed dimensions - integrate now, avoid data silos! Virtualized data sets - quick prototypes, migrate to prod. Build a cross department analytic community with IT partnership Big Data and the EDW Hadoop becomes partnered with the EDW Hadoop to support new data types and analytics BI tools must step up to support the final integration
End