Eliminating Required Waste and Non-Value Added Processing in Data Warehousing a Six Sigma LEAN Perspective John G. McManus Bank of America September 1, 2009
2 Last Night s Festivities
3 Continued
4 And More
5 Definitely a first for the CIO Finance Summit
John G. McManus Bank of America Senior Vice President Global Wealth and Investment Management Technology Responsible for: data warehouse/datamart, sales reporting, workflow, document management and incentive compensation applications Six Sigma Green Belt certified Formerly director of data warehouse and business intelligence systems at AT&T Broadband Served as officer in the US Army and Army Reserve for 22 years Currently supporting transition integration of the Merrill Lynch and Bank of America brokerage businesses and technology 6
Overview of the Presentation A quick look at Bank of America Six Sigma and LEAN fundamentals Six Sigma & LEAN perspective on data warehousing lifecycle processes Summary Questions 7
Bank of America Annual Report 2008 (does not include Merrill Lynch) Annual revenue of $72B Annual net income $4B 6,100 Retail Banking Outlets 18,700 ATMs 59 million consumer and small business banking relationships 29 million active on-line banking users Bank of America serves more than million consumer and small business relationships Global Wealth and Investment Management - more than 2 million individual and institutional clients worldwide 8
Six Sigma at Bank of America Focuses on the reduction/elimination of waste in our processes Focuses on the reduction/elimination of defects Reduces variation using statistical tools Provides ongoing process control and continuous improvement Start with the customer perspective Begin with the End in Mind Stephen Covey The end for a data warehouse is data loaded and available for use 9
Sigma Level - Metric Measures Defects per Million Opportunities Sigma Level Defects per Million Opportunities 2 308,567.0 3 66,807.0 4 6,210.0 5 233.0 6 3.4 10
LEAN Six Sigma focuses on getting variation out of our processes LEAN focuses on waste elimination, leveraging common tools and techniques from Six Sigma LEAN Tools include: End to End System Map Physical Process Map Time Value Map Spaghetti Chart Product Process Flow Analysis 11
Seven Types of Waste Defects Incomplete or corrupt source system feeds Low chip yield causes Xbox shortage for Christmas 2005 Overproduction Excessive storage space, redundant data Transportation Moving data multiple times though the information pipeline Waiting Processes that don t kick off until all source files are in hand Inventory Wasted storage space, excessive history Motion Putting the bun on the burger only to remove it to put on the pickles Processing (too much) Three nested views on top of an aggregate table in a datamart 12
Defining Value Value Added Activity An activity that CHANGES the size, shape, fit, form or function of material or information (for the 1st time) to meet customer requirements. Transformation of dimensional data to load to a fact table Non-Value Added Activity All other activities that take time or resources or does not satisfy customer requirements. Partitioning of data Indexing Creation of aggregate tables Denormalization 13
The Activity of the Product The only four things a product can be doing Storage (White Space) Transportation Inspection Processing Value added Non-value added By the way what is our Product within data warehousing? 14
The Activity of the Equipment Value Add and Non-Value Add activity Run Automated processing Manual processing Load Unload Setup Replenish Complete Idle Maintain 15
Typical Data Warehouse Development Tasks - Categorized 16 Procure and install multi-terabyte database/storage system Develop logical data model - VAP Develop physical data models Required Waste Calculate storage requirements for all objects Required Waste Design database partitions / strategy Required Waste Design indexes Required Waste Design aggregates and materialized views Required Waste Create DDL - VAP (for data tables all else is req d waste) Create physical data tables including partitions Required Waste Create indexes Required Waste Create aggregates and materialized views Required Waste Create users, roles, entitlements - VAP Load data - VAP Create datamarts and cubes Required Waste Performance tune loads, extracts and reports Required Waste
Actaul Results - DDL Migration DDL from traditional RDBMS migrated to data warehouse appliance within 1 day. Process removed 95+% of content from existing DDL files including tablespace, partition, and index declarations. 17 BEFORE 1821 lines of SQL CREATE TABLE CMVFACT.FCT_MKV_CUST_BAL ( PER_CCYYMM NUMBER(6) NOT NULL, SK_CIF NUMBER(15) NOT NULL, SK_PR_L5 NUMBER(8) NOT NULL, BAL_CD NUMBER(5) NOT NULL, BUS_SEG_L3_ID NUMBER(4) NOT NULL, SEG2_CD NUMBER(2) NOT NULL, ISO_CRNCY_CD CHAR(3) NOT NULL, CO_CST_CTR_ID CHAR(10) NOT NULL, CAL_MON_DAY_NR NUMBER(2) NOT NULL, CAL_MON_BUS_DAY_NR NUMBER(2) NOT NULL, BAL_AM NUMBER(15,2) NOT NULL, BAL_ORIG_CRNCY_AM NUMBER(18,2) NOT NULL, SRCE_SYS_CD NUMBER(4) NULL ) TABLESPACE CMVFACT NOLOGGING PCTFREE 0 PCTUSED 90 INITRANS 1 MAXTRANS 255 STORAGE(BUFFER_POOL DEFAULT) NOPARALLEL NOCACHE PARTITION BY RANGE(PER_CCYYMM) (PARTITION P200205 VALUES LESS THAN (200206) STORAGE(FREELISTS 1 FREELIST GROUPS 1), PARTITION P200206 VALUES LESS THAN (200207) STORAGE(FREELISTS 1 FREELIST GROUPS 1), PARTITION P200207 VALUES LESS THAN (200208) STORAGE(FREELISTS 1 FREELIST GROUPS 1), PARTITION P200208 VALUES LESS THAN (200209) STORAGE(FREELISTS 1 FREELIST GROUPS 1), After 17 lines of SQL CREATE TABLE FCT_MKV_CUST_BAL ( PER_CCYYMM integer NOT NULL, SK_CIF bigint NOT NULL, SK_PR_L5 integer NOT NULL, BAL_CD integer NOT NULL, BUS_SEG_L3_ID integer NOT NULL, SEG2_CD integer NOT NULL, ISO_CRNCY_CD CHAR(3) NOT NULL, CO_CST_CTR_ID CHAR(10) NOT NULL, CAL_MON_DAY_NR integer NOT NULL, CAL_MON_BUS_DAY_NR integer NOT NULL, BAL_AM numeric(15,2) NOT NULL, BAL_ORIG_CRNCY_AM numeric(18,2) NOT NULL, SRCE_SYS_CD integer NULL ) distribute on (sk_cif);
Typical Data Warehouse Load Tasks Drop indexes Drop constraints Drop aggregates Sort data before loading Transform and load data Look at why a job that ran in 2 minutes yesterday now runs in 2 hours Add storage to accommodate data growth Regenerate indexes Add storage to accommodate data growth Regenerate constraints Recreate aggregates Regenerate materialized views Recompute database statistics Run reconciliation routines 18
Time Value Map Inputs Data Warehouse Load Tasks (time in minutes) Process Step Description Store Trans Inspect NV Process VA Process 001 Migrate Data from Source Systems 120 002 Reconcile Data Back to Source 20 003 Preprocess, sort and prep data 30 004 Drop Indexes 5 005 Drop Constraints 1 006 Drop Aggregates 2 007 Drop Materialized Views 2 008 Load Data 30 009 Create Constraints 180 010 Create Indexes 90 011 Create Materialized Views 60 012 Create Aggregates 120 013 Gather Statistics 300 Actual times for weekly datamart load 19
Time Value Map Categorize each process step according to the Activity of the Product Identify value added vs. non-value added processing White space is bad and needs to be eliminated 20 The largest and most difficult waste to find is time and you can never get it back Henry Ford KEY: Value added processing Non-value added processing
Actual Results 4TB Production Timeline - Monthly Production Timeline with traditional RDBMS / hardware CMV Timeline (assume batch begin 17th calendar day 5:00 PM PST Load Stage Dimension/FTP Fact Build Aggregation Index/Swap Certification 17th 18th 19th 20th 21st 22nd 23rd 24th 25th 26th Calendar Days Production Timeline using data warehouse appliance CMV Timeline (assume batch begin 17th calendar day 5:00 PM PST Load Stage Dimension/FTP Fact Build Aggregation Certification 17th 18th 19th 20th 21st Calendar Days 22nd 23rd 24th 8 DAYS removed from monthly batch process or 96 days of increased data availability! 25th 26th 21
22 Information Sprawl Tracking the Activity of the Product (in LEAN Terms E2E Process or Spaghetti Chart)
Various Six Sigma Methods for Determining Root Cause Failure Modes and Effects Analysis (FMEA) Focuses priority on the most critical areas Cause and Effect Diagram (Fishbone) - feeds FMEA Shows influences on a process potential failures Load failed why? Ran out of tablespace Partition not defined for new month Incomplete data file Data anomalies Bad ETL logic The 5 Why s Ask Why? 5 times 23
The 5 Why s Get to the Root Cause for Failures Hey Boss, we missed our SLAs for the users again today Why? The job that always runs in 2 minutes ran for 2 hours Why? We think the database chose a different execution path Why? The Statistics might have been out of date Why do we need statistics? To enable the database to make intelligent choices about which indexes to use and how to join tables Why do we need indexes? Because traditional RDBMs have significant design flaws Why? You ll have to ask them 24
Why do we do it? Creation of indexes Queries and loads would perform abysmally without them Creation of aggregates Users want data summarized in many instances Queries and reports would perform abysmally without them Creation of data marts Denormalize the data for reporting efficiencies Need to offload impact of reporting from data warehouse Specialized transformations DDL to support required waste 95% of DDL has nothing to do with the data itself 25
Today s Challenges Larger data volumes and increased data retention Increased reliance on data by the business Smaller and smaller maintenance/load windows Increased SLAs, support for international availability windows Highly sophisticated users with a thirst for data Increased risk, compliance and regulatory oversight of data = 26
What if we Could Eliminate the Need for Indexes? Eliminate storage needs for indexes, usually 2x 3x data needs Eliminate time consuming index regeneration usually longer than data load times Eliminates on-going tuning/maintenance of indexes less administrative DBA support needed Deliver data faster get it in the hands of decision makers quicker Enable user self-empowerment unlock the data Increase reliability less places to fail while loading 27
What if we Could Eliminate the Need for Aggregates? Eliminate storage needs for aggregation tables and associated indexes Eliminate time consuming development and testing of aggregation routines could be man months Eliminates time consuming daily/weekly aggregate regeneration Data is more current no need to limit users to weekly or monthly views only [Artificial Constraint] Data is delivered faster allowing allow earlier access by the users Increase reliability less places to fail many aggregation routines have flawed logic the users just don t know it 28
What If We Could Eliminate The Need For Datamarts? Eliminate total cost of ownership for datamart hardware and software licenses Eliminate development, maintenance and production costs of marts Single source of the truth Data is delivered faster allowing allow earlier access by the users 29
Summary On average, data warehousing is >90% required waste or non-value added processing The cost of this waste translates to: Unacceptably long time to market Unnecessary hardware and software license costs Terabytes of wasted storage Elongated development data load cycles Longer periods of data unavailability Stale data Poorly performing loads and queries Excessive administrative costs 30
Final Thought Henry Ford said: Time waste differs from material in that there can be no salvage. The easiest of all wastes, and the hardest to correct, is the waste of time, because the waste of time does not litter the floor like wasted material. We need to eliminate the waste in data warehousing Reclaim the time we currently waste 31
Questions? John.G.McManus@BankofAmerica.com 32