Chapter 1 The Big Picture: A Itroductio to Data Warehousig Itroductio I 1977, Jimmy Carter was Presidet of the Uited States, Star Wars hit the big scree, ad Apple Computer, Ic. itroduced the world to the first persoal computer. Four years later, Roald Reaga was presidet, Price Charles ad Lady Diaa married, ad IBM bega sellig its IBM PC. From that begiig, computig power started movig from the maiframe to the desktop. Soo thereafter, spreadsheets ad word processig applicatios bega their jourey to replace clipboards ad typewriters. By 1985, Mikhail Gorbachev was the leader of the Soviet Uio, New Coke hit the shelves, ad data was goig everywhere. Eve iside the maiframe computers, data was fidig its ow home. Supervisors, maagers, ad executives alike were o loger able to look at a sigle clipboard to fid out how the busiess was performig. The data was hopelessly etreched i applicatios ad ooks ad craies that would ever agai see the light of day. Such was the origi of Decisio Support Systems.
Buildig ad Maitaiig a Data Warehouse Decisio Support Systems Decisio Support Systems allowed maagers, supervisors, ad executives to oce agai see the clipboard with all its iformatio. The iformatio, which previously had bee o a clipboard, had become Ralph Kimball was a co-creator of the Xerox Star Workstatio, the world s first commercially viable GUI applicatio. Ralph was the fouder ad CEO of Red Brick Systems, the group which created a extremely fast RDBMS targeted specifically for data warehousig. Whe he authored The Data Warehouse Lifecycle Toolkit, Ralph itroduced the Dimesioal Data Model (discussed i Chapter 5, Desig). a report, either prited o paper or displayed o a scree. Oe report revealed oe sigle busiess area. Aother report revealed a differet busiess area. By 1992, Widows 3.1 was i the stores ad Ralph Kimball ad Bill Imo were figurig out how to gather data from two busiess areas ad figurig out how to warehouse the data of a eterprise (Figure 1.1). Kimball ad Imo, workig separately, arrived at a commo set of guidelies (or priciples). These priciples are: Subject Orietatio: Data will be grouped by subject, rather tha author, departmet, or physical locatio. So, all maufacturig data goes together, ad the sales data, ad the promotios data, etc., regardless of where it came from. Data Itegratio: Eve though data comes from separate applicatios, departmets, etc., differeces should be smoothed out so they have the same look ad feel. Bill Imo was the creator of the Corporate Iformatio Factory ad Govermet Iformatio Factory. I so doig, Bill also established may of the priciples of Data Warehousig. Form: Whe two data elemets (e.g., phoe umbers) have differet layouts (e.g., 123-123-1234 ad (123) 123-1234), oe layout will be superimposed o both of them. Fuctio: Whe two data elemets idetify the same thig (e.g., a hammer) with two differet ames (e.g., part 32G ad part B49), these two ames will be replaced with oe ame. Grai: Whe two data elemets apply differet hierarchies (e.g., regio ad district) to the same thig, or differet levels of detail (e.g., miles ad feet), the two data elemets will be resolved to the same level of hierarchy or detail. Novolatility: Ulike the data i operatioal applicatios, which is discarded oce the compay is fiished usig it, the data i a data warehouse will remai i the warehouse.
The Big Picture Source Data Operatioal Applicatio Operatioal Applicatio Operatioal Applicatio Figure 1.1 The big picture. Data Acquisitio ad Itegratio ETL ELT Data Quality Repository Data Quality Data Quality Measuremets Desig Relatioal Relatioal Relatioal RDBMS BI Reportig Predefied Reports Iteractive Reports OLAP Metadata Repository Metadata Metadata Applicatio(s)
Buildig ad Maitaiig a Data Warehouse Time Variat: All data has a cotext at a momet i time. A data warehouse will keep that cotext. So, all data from 1995 will retai its cotext withi 1995. Oe Versio of the Truth: The proliferatio of data i the 1980s ad 1990s yielded may copies of the same data. Oly the oe, true gold, stadard copy of each data elemet would be icluded i a data warehouse. Log-Term Ivestmet: A data warehouse should be flexible eough to absorb chages i the compay ad the world, ad scalable eough to grow with the compay. By doig so, a data warehouse ca add value to the compay for a log time. Dimesioal ad Third Normal Form Data Models Kimball ad Imo arrived at the same set of priciples, yet each used completely differet desigs. Kimball created the Dimesioal Data Model (Figure 1.2). Also kow as the Star Schema (because it resembles a star), a Dimesioal Data Model has a distict shape. I the middle is a Fact table (a Fact is a evet, trasactio, or somethig that happes at a sigle momet i time). Surroudig the Fact table are Dimesio tables. Each Dimesio table holds all the permutatios of a sigle hierarchy of the compay (e.g., geography: city, couty, state, regio, district, etc.; or time: secod, miute, hour, day, fiscal week, payroll week, fiscal quarter, etc.). Bill Imo preferred the Third Normal Form Data Model (Figure 1.3). Rather tha capture hierarchies ad relatioships i Dimesio tables, the Third Normal Form allowed the data to have the same flexibility as the compay. Withi the data warehousig commuity, a debate emerged. Which was better, the Dimesioal Data Model or the Third Normal Form data model? By the twety-first cetury, the aswer was clear both. Both desigs had their stregths ad their weakesses. Rather tha apply a oe size fits all midset, data warehouse desigers leared to apply the stregths ad avoid the weakesses of both i each situatio. Storig the Data While the debate betwee Dimesioal Data Model ad Third Normal Form Data Model was still goig o, the data warehousig commuity was also decidig how to physically store the data. Three methods were foud: a cetral Eterprise Data Warehouse (EDW), several distributed Data Marts, ad a Operatioal Data Store (ODS). The cetral EDW held all the data from all the busiess subjects i oe database (Figure 1.4). The Data Mart held oe subject area oly (Figure 1.5). If aother
The Big Picture subject area were eeded, the that would be aother Data Mart. The best method of feedig data to a Data Mart was to itegrate that data ito a Eterprise Data Warehouse first ad the sed the data o to its Data Mart. The Operatioal Data Store (Figure 1.6) lives o the other side of the EDW ad retrieves operatioal data from the busiess, itegrates the data, ad stores the data i its ow database. Ulike the EDW, the data i a ODS is volatile. Volatile meas the ODS oly stores the value for a data elemet (e.g., balace o had) that is true at real-time (e.g., balace o had as of right ow). This is differet from the ovolatile data i a EDW (e.g., balace o had for every day for the past two years). Whe a ODS is preset, the EDW ca gather its data from the ODS rather tha from the busiess. There s o eed to ask the busiess the same questio twice. Time Dimesio Date Time Week Moth Year Thig Dimesio Thig Thig Name Thig Weight Thig Height Trasactio Evet Place Dimesio Place Place Name Place Address Place Purpose Perso Dimesio Perso Perso Name Perso Class Perso Type Time Thig Place Perso Equipmet Equipmet Dimesio Equipmet Equipmet Name Equipmet Descriptio Equipmet Purpose Figure 1.2 Dimesioal Data Model.
Buildig ad Maitaiig a Data Warehouse Trasactio Header Trasactio_ID Trasactio Time Date Time Trasactio Thig Thig Trasactio Place Place Caledar Year Year Time of Day Time Thig Thig Place Place Caledar Moth Moth Weight Height Address Thig Weight Thig Height Place Address Caledar Week Week Name Caledar Date Date Thig Name Place Name Equipmet Name Perso Name Figure 1.3 Third Normal Form Data Model. Trasactio Equipmet Trasactio Perso Equipmet Perso Equipmet Equipmet Perso Perso Class Purpose Place Purpose Equipmet Purpose Descriptio Equipmet Descriptio Perso Class Type Perso Type
The Big Picture EDW Purchasig Persoel Logistics Marketig Sales Maufacturig Figure 1.4 Eterprise Data Warehouse (EDW). The set of applicatios that gather data from the busiess ad brig that data ito the data warehouse are called extract, trasform, ad load (ETL) applicatios. The ETL aalyst is resposible for makig the data warehouse philosophy happe. Data Itegratio: ETL applicatios itegrate the data from the busiess, regardless of its origi, form, fuctio, or grai. Novolatility: ETL applicatios itroduce ew data without destroyig old data. Time Variat: ETL applicatios store the data with a key structure that poits to a timeframe. Oe Versio of the Truth: ETL applicatios referece oly the oe gold stadard for every data elemet. Log-Term Ivestmet: Populate data ito a data warehouse, realizig the log-term flexibility of the data warehouse desig. Data Availability The data iside a data warehouse is of o use to a busiess without a way to use that data. Busiess Itelligece Reportig, also kow as BI Reportig, is a set of applicatios by which a busiess ca haress data ad iformatio i a data warehouse. Data is idividual bits of facts ad figures. By itself, data tells the busiess very little. Iformatio is the compilatio of idividual bits of data ito a observatio or coclusio, which adds value to the busiess. BI Reportig icludes various methods by which data ad iformatio ca be available to the busiess. Predefied reports, a staple of all iformatio systems, ca dissemiate aswers to the same questios (e.g., who, how may, where) o a daily basis. Iteractive reports allow the busiess to ask a ew questio, or revise a existig questio, ad the receive the aswer. OLAP (olie aalytical processig)
Buildig ad Maitaiig a Data Warehouse DataMart EDW DataMart Marketig Purchasig Persoel Marketig Sales Logistics Maufacturig Logistics DataMart Purchasig Figure 1.5 EDW ad Data Marts. Busiess Uit ODS Operatioal Applicatio Marketig EDW Purchasig Persoel Logistics Marketig Sales Maufacturig Figure 1.6 Operatioal Data Store.
The Big Picture reportig allows busiess aalysts to drill up ad dow, left ad right i stream of cosciousess aalysis. Usig the Iteret, all of these reportig optios are available olie. The ext frotier of BI Reportig is data miig, the search for correlatios i the busiess, which caot be see. Metadata provides the backgroud ad cotext, which gives cocrete meaig to the facts ad figures i a data warehouse. Every four years, o the first Tuesday of November, all Americas use the same metadata the umber of votig precicts reportig. With 50 percet of the precicts reportig, o oe believes the result. With 75 percet of the precicts reportig, we begi to take the result seriously. Whe 90 percet of the precicts have reported their umbers, we tur the TV off ad go to bed; the electio is over. Metadata i a data warehouse works the same way. How complete is the data? What is the formula for that umber? Whe did the ew umbers come i? Like the umber of precicts reportig, metadata i a data warehouse gives meaig ad cotext to data. Moitorig Data Quality Fially, data quality is the cotiuous effort to moitor the accuracy, completeess, ad cofidece of the data i a data warehouse. The world is full of surprises, ad some of them affect the data i a data warehouse. Oly the aïve assume the busiess ad its data warehouse live i a perfect world where othig goes wrog. Diligetly moitorig data before it eters the data warehouse, the goal is to deliver data ad iformatio from which a busiess ca derive its strategic ad tactical decisios with cofidece. The explosio of data ad iformatio truly was a explosio. The facts ad figures of busiess foud their ow homes i accoutig systems, ivetory databases, ad a myriad of home-grow applicatios, all of which help ru the busiess. Data warehousig gathers ad itegrates that disparate data so the busiess, through its data, ca be see i oe place.