Tiber Solutions Understanding the Current & Future Landscape of BI and Data Storage Jim Hadley
Tiber Solutions Founded in 2005 to provide Business Intelligence / Data Warehousing / Big Data thought leadership to corporations and government agencies. Deeply skilled in all facets of BI/DW/Big Data solutions star schema, ETL, BI, data visualization, data analytics, data architecture, information architecture, BI agile development methodology, and MDM/governance. Provide hands-on architecture, implementation, and coaching expertise within IT organizations from the CIO to the developers. Partner with business executives to co-invent optimal BI/DW applications to dramatically improve their business. 2
Tiber Solutions Customers Amethyst Technologies Amtrak Census Bureau Cognosante Defense Logistics Agency Department of Health and Human Services Department of the Treasury Fannie Mae Federal Depository Insurance Corporation Frontpoint Security Freddie Mac Graduate Management Admission Council Internal Revenue Service Military Health System National Institutes of Health Occupational Safety and Health Administration Office of the Comptroller of the Currency SAP Business Objects Securities and Exchange Commission 3
Agenda Business Intelligence Landscape - Concepts/Architectures - BI Tool vs. Data Visualization Tool Comparison Data Storage Landscape - Concepts/Architectures - Product Group Comparison 4
Business Intelligence Landscape Data Retrieval Facts Facts Success Factors: Retrieval Speed Ease of Access Data Presentation Success Factors: Visualization Richness and Diversity Delivery Options (e.g., Mobile, Push) 5
Business Intelligence Landscape Characteristics Business Intelligence Tools Data Visualization Tools Product Examples Strengths SAP Web Intelligence Cognos MicroStrategy Data Retrieval Dynamic, Complex Ad Hoc Queries Tableau Qliktech Qlikview TIBCO Spotfire Microsoft BI Stack Data Presentation Rich and Diverse Visualizations Limitations Limited Visualizations Limited Ad Hoc Capabilities Primary Use Ad Hoc Query Canned Reports Data Visualization Data Exploration Ad Hoc Query Capabilities Yes No (must be in cube) Leverages Semantic Layer For Data Retrieval Yes Partially Queries Data In Database Real-Time Yes No Requires Persisting Data Set In Cubes or Files No Yes Requires Developer Skills Semantic Layer (Universe) Yes Reports Some Cubes Yes Reports/Dashboards - No SAP Products SAP Web Intelligence SAP Dashboards - Requires Developer SAP Lumira Not nearly as mature SAP Explorer Limited visualizations 6
Business Intelligence Tool Architecture Business Terms Semantic Layer (Universe) Business Layer Folders Used to organize objects into logical groups (e.g., Customer Dim, Sales Measures) Objects Business terms are used to represent database columns (e.g., CUST_NM) or SQL formulas (e.g., SUM(REVENUE_AMT)- SUM(COST_AMT)) Technical Layer Connections Database connection parameters Tables/Columns Fact and Dimension tables and columns Joins Predefined joins between fact tables and dimension tables Contexts A group of joins. Each fact table should have a context SQL Facts Facts Assumptions: Data warehouse/data mart exists in which ETL processing has harmonized and combined data from multiple data sources. 7
Business Intelligence Tool Architecture Assumptions: Fact tables are at different levels of granularity (detail). 1-to-N fact tables can be queried with common dimensions. Objects Selected by End User Dims - Fiscal Year, Fiscal Quarter, Product Group Measures - Net Sales Amount, Forecast Amount Sales Context Related Tables and Columns Fiscal Year d_date.fiscal_yr Fiscal Quarter d_date.fiscal_qtr Product Group d_product.product_grp Net Sales Amount f_sales.net_sales_amt Forecast Amount f_forecast.forecast_amt Forecast Context Sales Query: SELECT d.fiscal_yr, d.fiscal_qtr, p.product_grp, SUM(f_sales.net_sales_amt) FROM d_date d, d_product p, f_sales f WHERE f.date_key=d.date_key AND f.product_key=p.product_key GROUP BY d.fiscal_yr, fiscal_qtr, p.product_grp Full Outer Join Forecast Query: SELECT d.fiscal_yr, d.fiscal_qtr, p.product_grp, SUM(f_forecast.forecast_amt) FROM d_date d, d_product p, f_sales f WHERE f.date_key=d.date_key AND f.product_key=vp.product_key GROUP BY d.fiscal_yr, d.fiscal_qtr, p.product_grp Facts Facts 8
Data Visualization Tool Architecture OLTP DW/DM Nightly SQL Load Nightly SQL Load Data Visualization Experience OLAP/File column names can be renamed to business terms. Easy for end users to drag/drop/ visualize data using multiple visualization styles. Data across cubes can be combined. Data Retrieval Observations: There is an assumption that the data is available, combinable, and clean (without any ETL or DQ). Data can be sourced from any database or file. Most products use OLAP cube technology to improve performance. OLAP cubes can be linked (joined) together, but they must have shared common dimensions and granularity. Data retrieval across OLAP cubes can be difficult. OLAP cubes are refreshed at night. Does not support dynamic ad hoc queries. IT is usually required to set up OLAP cubes on servers. OLAP cubes have practical size limits. Data Presentation Observations: Data visualization products support 100s of visualization styles. Tools are good at recommending visualizations based on data result set. Tools are very interactive. Easy to integrate visualizations together. Business users can successfully use the client tools without IT really. 9
Federated BI Architecture Use Case: How many passengers made refundable reservations and never traveled in 2014? Traditional BI/EDW Federated Bi 1. Query 2014 refundable reservation rows 25 million. Batch Real-time 2. Query 2014 travel rows 15 million. Batch Real-time 3. Left outer join the reservation query result set with the travel query result set based on common dimension data travel date, customer information, originating city, destination city, and flight number. Batch Real-time 4. Aggregate the joined result set rows counting all rows where travel information is null. Real-time Real-time Traditional BI/DW Federated BI Semantic Layer (Universe) Semantic Layer (Universe) Federated Architecture Data Warehouse Real-time Batch (Nightly) Reservations Travel ETL Reservations Travel 10
Data Storage Concepts/Architectures Columnar Data Storage Compression/Tokenization Parallelization In-Memory Performance Bottleneck: Reading data off of disk. 11
Columnar Data Storage Traditional RDBMS Columnar Data Storage 1 2 3 4 5 6 7 8 9 10 SELECT col1, col2, col3 FROM table 1 2 3 4 5 6 7 8 9 10 SELECT col1, col2, col3 FROM table Data is stored row-oriented on disk. All columns are read off of disk even if only a subset of columns are selected. Unselected columns are pruned after disk read. Optimized for row inserts Data is stored column-oriented on disk. Only selected columns are read off of disk. Unselected columns are not read off of disk. Optimized for data retrieval. Results: Less columns to read = Less disk to read = Faster data retrieval speeds Quantitative Results: 3 times faster 12
Compression/Tokenization Traditional RDBMS Compressed Databases State State V-List Alabama Alabama Alabama Alabama Alabama Alaska Alaska... Wyoming 10 million rows 1 1 1 1 1 2 2... 50 10 million rows 1 = Alabama 2 = Alaska 3 = Arizona 4 = Arkansas 5 = California 6 = Colorado 7 = Connecticut... 50 = Wyoming 50 bytes Data is stored on disk as it appears to the end user. Columns are byte-bound. Example: 50 bytes x 10 million rows = 500MB to read from disk. 6 bits (0.75 bytes) All distinct values are given a token representation. Tokens are stored on disk and not the actual data values. Columns are not byte-bound. Example: 2 6 = 64 values (50 values required) 6 bits or 0.75 bytes required 0.75 bytes x 10M rows = 7.5MB of disk read Results: Narrower columns = Less disk to read = Faster data retrieval speeds Quantitative Results: 66 times faster Total Quantitative Results: 3 (columnar) x 66 (compression) = 200 times faster 13
Parallelization Full-Table Scan Parallelized Full-Table Scan Parallelized Partition Scan Sales Table Sales Partition - 1 Sales Partition - 2005 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 20 million rows Sales Partition - 2 Sales Partition - 3 Sales Partition - 4 Sales Partition - 5 Sales Partition - 6 Sales Partition - 7 Sales Partition - 2006 Sales Partition - 2007 Sales Partition - 2008 Sales Partition - 2009 Sales Partition - 2010 Sales Partition - 2011 Sales Partition - 8 Sales Partition - 2012 Sales Partition - 9 Sales Partition - 2013 The entire table is read sequentially. Example: 20 million rows are read sequentially in 200 seconds. Sales Partition - 10 The table s 10 partitions are read in parallel Example: 20 million rows are read in 10 parallel processes (2 million rows each) in 20 seconds. Sales Partition - 2014 One partition is read (Where Year = 2012) Example: 2 million rows are read by one process (2 million rows) in 20 seconds. Results: Parallel partition reads = Faster data retrieval speeds Quantitative Results: 10 times faster Total Quantitative Results: 3 (columnar) x 66 (compression) x 10 (parallel) = 2,000 times faster Total quantitative results are rarely this significant and are for illustrative purposes only. 14
In-Memory In-memory processing is the trump card. However, in-memory processing is not cheap. Using column-oriented data storage and compression/tokenization techniques can significantly allow more data to fit into memory. Don t assume in-memory is the only solution. Example: Perceived Problem: My Honda is too slow Actual Problem: Driver only drives the car in first gear. Solution 1: Buy a Ferrari and drive it in first gear. Solution 2: Keep your Honda and learn how to use a clutch. 15
Data Storage Product Group Comparison Characteristics Traditional RDBMS Columnar In-Memory Hadoop Ecosystem Columnar Data Storage No Yes Sometimes No Compression/Tokenization No Yes Sometimes No Parallelization Yes Yes Yes Yes In-Memory No No Yes No Product Examples Oracle IBM DB2 SQL Server Amazon Redshift Vertica HBase EMC GreenPlum IBM DB2 BLU SAP HANA MemSQL HDFS/MapReduce HCatalog Cassandra 16
Data Storage Final Thoughts Columnar data storage, compression, parallelization, and in-memory processing ONLY address data retrieval performance. These techniques DO NOT address: - Harmonization of data sources (e.g., VA = Virginia = VIRGINIA, missing DC and Guam) - Data quality issues - Complexity of different data sets (e.g., many-to-many relationships, ratios, timing of data capture, etc.) - End users ability to intuitively and easily access, present, and understand information. 17
Questions Jim Hadley, President Email: jhadley@tibersolutions.com Phone: 703.593.2833 18