BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014 Ralph Kimball Associates 2014
The Data Warehouse Mission Identify all possible enterprise data assets Select those assets that have actionable content and can be accessed Bring the data assets into a logically centralized enterprise data warehouse Expose those data assets most effectively for decision making
Enormous RDBMS Legacy Legacy RDBMSs have been spectacularly successful, and we will continue to use them. Too successful If all you have is a hammer, everything looks like a nail. RDBMS dilemma: a new ocean of new data types that are being monetized for strategic advantage Unstructured, semi-structured and machine data Evolving schemas, just-in-time schemas Links, images, genomes, geo-positions, log data
Houston: we have a problem Traditional RDBMSs cannot handle The new data types Extended analytic processing Terabytes/hour loading with immediate query access We want to use SQL and SQL-like languages, but we don t want the RDBMS storage constraints The disruptive solution: Hadoop
The Data Warehouse Stack in Hadoop Hadoop is an open source distributed storage and processing framework To understand how data warehousing is different in Hadoop, start with this powerful architecture difference:
The Data Warehouse Stack in Hadoop Hadoop is an open source distributed storage and processing framework To understand how data warehousing is different in Hadoop, start with this powerful architecture difference:
Hadoop for Exploratory DW/BI Sources: Transactions Free text Images Machines/ Sensors Links/ Networks EDW Overflow Purpose built for EXTREME I/O speeds; Use ETL tool or Sqoop HDFS Files: Metadata (system table): HCatalog Industry standad HW; Fault tolerant; Replicated; Write once(!); Agnostic content; Scalable to infinity All clients can use this to read files Query Engines: Hive SQL Impala SQL Others These are query engines, not databases! BI Tools: Tableau Bus Obj Cognos QlikView Others Query engines can access HDFS files before ETL BI tools are the ultimate glue integrating EDW resources
Data Load to Query in One Step Copy into HDFS with ETL tool, Sqoop, or Flume into standard HDFS files (write once) registering metadata with HCatalog Declare query schema in Hive or Impala (no data copying or reloading) Immediately launch familiar SQL queries: Exploratory BI
Typical Large Hadoop Cluster 100 nodes (5 racks) Each node Dual hex core CPU running at 3 GHz 64-378 GB of RAM 24-36 TB disk storage (6-10 TB effective storage with default redundancy of 3X) Overall cluster (!) 6.4-37.8 TB of RAM (wow, think about this ) Up to a PB of effective storage Approximate fully loaded cost per TB: $1000 +/-
210 Committing to High Performance HDFS files with Embedded Schemas Sources: Transactions Free text Images Machines/ Sensors Links/ Networks EDW Overflow Purpose built for EXTREME I/O speeds; Use ETL tool or Sqoop HDFS Raw Files: Commodity HW; Fault tolerant; Replicated; Append Only(!); Agnostic content; Scalable to infinity Parquet Columnar Metadata FILES: (system table): HCatalog Read optimized schema defined column store All clients can use this to read files Query Engines: Hive SQL Impala SQL Others These are query engines, not databases! BI Tools: Tableau Bus Obj Cognos QlikView Others
High Performance Data Warehouse Thread in Hadoop Copy data from raw HDFS file into Parquet columnar file Parquet is not a database: it s a file accessible to multiple query and analysis apps Parquet data can be updated and the schema modified Query Parquet data with Hive or Impala At least 10x performance gain over simple raw file Hive launches MapReduce jobs: relation scan Ideal for ETL and transfer to conventional EDW Impala launches in-memory individual queries Ideal for interactive query in Hadoop destination DW Impala at least 10x additional performance gain over Hive
Use Hadoop as Platform for Direct Analysis or ETL to Text/Number DB Huge array of special analysis apps for Unstructured text Hyper structured text/numbers (machine data) Positional data from GPS Images Audio, video Consume results with increasing SQL support from these individual apps Or, write text/number data into Hadoop from unstructured source or external EDW relational DBMS
The Larger Picture: Why Use Hadoop as Part of Your EDW? Strategic: Open floodgates to new kinds of data New kinds of analysis impossible in RDBMS Schema on read for exploratory BI Attack same data from multiple perspectives Choose SQL and non-sql approaches at query time Keep hyper granular data in active archive forever No compromise data analysis Compliance Simultaneous incompatible analysis modes on same data files Enterprise data hub: one location for all data resources Tactical: Dramatically lowered operational costs Linear scaling across response time, concurrency, and data size well beyond petabytes Highly reliable write-once, redundantly stored data Meet ETL SLAs
It s Not That Difficult Important existing tools already work in Hadoop ETL tool suites: familiar data flows and user interfaces BI query tools: identical user interfaces, integration Standard job schedulers, sort packages (e.g. SyncSort) Skills you need anyway: Java, Python or Ruby, C, SQL, Sqoop data transfer Linux admin but, MapReduce programming no longer needed Investigate and add incrementally: Analytic tools: MADLib extensions to RDBMS, SAS, R Specialty data tools E.g., Splunk (machine data)
Integration is Crucial Integration is MORE than bringing separate data sources onto a common platform. Suppose you have two customer facing data sources in your DW producing the following results. Is this integration?
Doing Integration the Right Way Teaspoon sip of EDW 101 for Hadoop Professionals! Build a conformed dimension library Plan to download dimensions from EDW Attach conformed dimensions to every possible source Join dimensions at query time to fact tables in SQL-capable files Embed dimension content as columns in NoSQL structures, and also HBase.
Integrating Big Data Remember: Data warehouse integration is drilling across: Establish conformed attributes (e.g., Customer Category) in each database Fetch separate answer sets from different platforms grouped on the same conformed attributes Sort-merge the answer sets at the BI layer (or is this Post-BI? Depends on the way you fetch )
Out of the Box Possibility: Billions of Rows, Millions of Columns Tough problem for all current relational platforms: huge Name-Value data sources (e.g. customer observations) Think about Hbase (!) Intended for impossibly wide schemas Fully general binary data content Fire hose SCD1 and SCD2 updates of individual records Continuously growing row and columns Only simple SQL direct access possible now: no joins Not yet ready for full EDW membership Stay tuned!
Summing Up: The Data Warehouse Renaissance Hadoop DW becomes equal partner with Enterprise DW Hadoop will be the strategic environment of choice for new data types and new analysis modes Hadoop: Extreme data type diversity Huge library of specialty analysis tools with SQL extensions Starting point for exploratory BI and ETL-to-EDW processing Destination point for serious BI Permanent active archive of hyper granular data BI tools implement Hadoop-to-EDW integration BI tools must step up to deliver the final integration payload
The Kimball Group Resource www.kimballgroup.com Best selling data warehouse books NEW BOOK! The Classic Toolkit 3 rd Ed.è In depth data warehouse classes taught by primary authors Dimensional modeling (Ralph/Margy) ETL architecture (Ralph/Bob) Dimensional design reviews and consulting by Kimball Group principals White Papers on Integration, Data Quality, and Big Data Analytics