Framework for Data warehouse architectural components Author: Jim Wendt Organization: Evaltech, Inc. Evaltech Research Group, Data Warehousing Practice. Date: 04/08/11 Email: erg@evaltech.com Abstract: The data warehouse framework consists of five functional components, each of which is responsible for a specific set of processes essential to the decision-support environment: Source, Load, Storage, Query, and Meta-Data. The Source, Load, and Storage functional components support operational data migration to the data warehouse. The Query component handles the business processes supporting decision-support data access and analysis. The Metadata component serves as a foundation for the other four functional components by providing the data that controls their processing and interactions. Intellectual Property / Copyright Material All text and graphics found in this article are the property of the Evaltech, Inc. and cannot be used or duplicated without the express written permission of the corporation through the Office of Evaltech, Inc. Evaltech, Inc. Copyright 2011 Page 1 of 7
Overview A data warehouse environment's objective is first to transform data extracted from the applications supporting the organization's OLTP environment into high-quality, integrated data. Then it must store this data in a structure optimized for end-user access within the OLAP decision-support environment. During this process, summary data is added to the warehouse to provide management with information about revenues, costs, and activity volumes. Data is transferred from the operational to the warehouse environment on a periodic basis appropriate to the type of business analysis being performed against the data warehouse. For example, data about clients who have defected from a company by closing all their accounts must be available on a daily basis so that marketing can activate its customer retention programs. However, financial summaries for income statements and balance sheets for tracking profitability by customer, product, market segment, and business unit are only required monthly. The data warehouse consists of five functional divisions, each of which is responsible for a specific set of processes essential to the decision-support environment: Source, Load, Storage, Query, and Meta-Data. The Source, Load, and Storage functional divisions support operational data migration to the data warehouse. The Query division handles the business processes supporting decision-support data access and analysis. The MetaData division serves as a foundation for the other four functional divisions by providing the data that controls their processing and interactions. SOURCE The Source functional division includes those processes that identify the source applications of data transferred to the data warehouse. The data warehouse is typically sourced from data in the organization's operational databases. However, warehouses are increasingly tapping external sources for data on market share distribution within the industry or demographic and profile data on potential customers. Likewise, data may come from databases that business knowledge workers maintain on private LANs or individual PCs. Determining the best source of data held redundantly in many databases can be one of the more challenging activities warehouse sourcing analysts face. Many of the processes associated with the sourcing function-such as data mapping, data integration analysis, and data quality assessment-actually occur during the data warehouse analysis and design phase. In fact, most of the time associated with initial data warehouse development is allocated to these activities. Industry experts claim that identifying sources, defining the rules for transforming data from source applications into the integrated data necessary for the data warehouse, and detecting and resolving data quality and integration issues often consume 75 to 80 percent of project time. Unfortunately, automating these tasks is not easy. While certain tools can help detect data quality problems and generate extraction programs, most of the information required for developing data mapping and transformation rules and resolving data quality issues exists only in the heads of the businesspeople and analysts working with the source applications. Extracting that knowledge can be extremely timeconsuming. Factors that directly impact time estimates for data analysis activities include the number of source applications that must be mapped into the data warehouse and the quality of meta-data maintained about those source applications. Applications with minimal data documentation take longer to analyze and map. The business rules that an application enforcessuch as data element valid domains, derivation rules, and dependencies between data elementsare another large concern. If these rules must be extracted from the source application code, you should probably plan on doubling the time you allocate to data sourcing tasks. Evaltech, Inc. Copyright 2011 Page 2 of 7
LOAD The Load functional division comprises the processes associated with migrating data from source applications to warehouse databases. They include data extraction, data cleansing, data transformation, and data warehouse loading. Data extraction involves accessing the source application s data at the appropriate time. This process is the first step of the data's journey to the warehouse environment. Various extraction alternatives let you balance the performance, timing, and storage constraints of data extraction. For example, if the source application maintains an online database, you can submit a query directly to the database to create the extract files. When adopting this approach, you must develop a strategy to guarantee the extracted data's referential integrity. It's important to ensure that no intervening update transactions occur when a related set of records is being extracted. Performance in both the source application and the data warehouse environment may drop if online transactions and data warehouse extract queries compete for processing time. An alternate solution is to create a snapshot copy of the source application s database from which to extract data. This alternative eliminates the referential integrity and performance concerns; you will need additional disk space to hold the database copy. Time is a crucial consideration in the extraction process. Many source applications have a batch processing cycle in which offline transactions are applied to the database. The programs that create data warehouse extract files must be incorporated into the application's processing schedule so that they generate the files at the appropriate point in the database update cycle. After its extraction from the source application, the data must be prepared for loading into the data warehouse database. The data is assessed to determine if any quality problems exist. Data quality problems can be handled in several ways, depending on the error's source. If the error is inherent to the source application, the data can be cleansed systematically as part of the warehouse data transformation process. Unfortunately, most errors occur because the source application only performs minimal domain validation, which lets invalid values slip through. The only way to fix these types of errors is to tighten up the data validation routines in the source applications. Finally, even the best-behaved application can't prevent users from entering a value that is valid for the data element but incorrect for the corresponding real-world entity. The final step in preparing extracted data for loading into the warehouse database is data transformation. This process invokes the rules that convert the values the data held in a local application into the values of the decision-support environment's global, integrated perspective. With this process complete, the data is loaded into the data warehouse database. The environment must then be validated to ensure that all source application data has been loaded successfully before the period's data can finally be made available to data warehouse users. STORAGE The Storage functional division encompasses the architecture necessary for integrating the various views of warehouse data. Although we often discuss the data warehouse database as if it were a single data store, its data may in reality be distributed across multiple databases managed by different DBMSs. Two classes of DBMSs are pretty well suited to support data warehouse environments: relational (RDBMS) and multidimensional (MDDBMSs). An MDDBMS organizes data into an "n-dimensional" array. Each dimension of the array represents some aspect of the business around which analysis is conducted. Typical dimensions are Products, Organizational Units (such as stores, branches, headquarters, and manufacturing plants), and Time. Each cell in the array represents a fact based on a combination of dimensions (for example, revenue for a specific product at a specific store in a specific time period). Data warehouse users can easily aggregate information by selecting property combinations from these dimensions. Multidimensional databases present data in a manner that data warehouse users can easily understand and access. Evaltech, Inc. Copyright 2011 Page 3 of 7
The multidimensional data array provides a specific view of the enterprise's integrated data. Each business area may require that its own view be organized into arrays optimized to meet its own specific query requirements. Because each business area draws upon its own subset of the enterprise's integrated data resource, it's highly unlikely that the same multidimensional database will support the decision- support requirements of finance, marketing, manufacturing, and human resources. An RDBMS is usually best suited to managing an integrated database, which is neutral with respect to each business area's processing needs. While the multidimensional data view is designed to optimize ease of use among end users, the integrated data warehouse database is designed to optimize data sharing across all business areas. To distinguish between the two types of data usage, we use the term data warehouse database when discussing the integrated, enterprise wide data store, and the term data mart when referring to the multidimensional view that meets the specific needs of one or more business areas. The separation of data management into the enterprise data warehouse database and its satellite data marts introduces the need for a data distribution strategy that coordinates delivery of new data to the multidimensional databases. The data warehouse architect should consider whether to incorporate a replication server into the data distribution architecture to manage delivery of the right data to the right data mart at the right time. A replication server is a sophisticated application that selects and partitions data for distribution to each data mart, applies security constraints, transmits a copy of that data to the appropriate receiving sites, and logs all data transmissions. The data warehouse architecture must address the conditions by which historical data is moved from the online environment and archived offline. Data is archived at several levels. Near-term data is held in a medium from which it can easily be restored to the online environment. Older, rarely used data may be held in a secure but more cost-effective medium. Historical data is purged when it is so old that it no longer has any business value. Each business area may have its own criteria for determining the archiving rules for its historical data. Marketing may find that customer profile data older than three years no longer has value for predicting consumer behavior. Risk management, on the other hand, may need access to historical data over the last 10 years to detect loss trends. QUERY The data warehouse environment is designed to provide integrated, high-quality data to support the enterprise's decision making processes. The Query functional division incorporates the architectural components that the organizations knowledge workers and executives use to access and analyze warehouse data in order to detect trends and determine the enterprise's health. The query environment lets end users conduct analysis and produce reports through their multidimensional ad hoc OLAP tools. However, new technology promises to support the next generation of business analysis: data mining and business simulation. Data mining tools analyze data elements to identify unanticipated correlations among seemingly unrelated data elements. Data mining techniques have been effective in determining parameters for detecting fraud and identifying which customers are likely to be targeted by a competitor's marketing campaign. One of the primary purposes of these technologies is to check the effectiveness of the organization's business rules. Product lines may not be selling as well as expected. Market characteristics may be shifting. New sales channels may appear, while existing ones may no longer be effective. Analysis conducted through the data warehouse can pinpoint ineffective business rules. Simulation tools let the organization create models to test the impact of any necessary changes on the business environment. For example, marketing may want to project how introducing the Internet as a sales channel to its upscale customers may reduce the existing sales channels' effectiveness. How many customers must use the self-service Internet mode to warrant a downsizing of the telemarketing unit? Marketing can simulate the effects of various business scenarios to project possible shifts among sales channels and identify an optimal mix of Evaltech, Inc. Copyright 2011 Page 4 of 7
investment strategies. Simulation lets businesses project the future based on trends in the data warehouse's historical data. Once new business rules have been established, they must be fed back into the relevant operational applications. The migration of changed business rules from the decision-support environment back to the OLTP applications is carried out by the warehouse functional division frameworks closed loop processes. For example, after analyzing stock data, brokerage houses feed buy and sell decisions to the trading systems. Financial service organizations adjust their loan approval rules based on the trend analysis of data warehouse data. Another framework component, data summarization, spans multiple functional divisions. A data warehouse architect must determine how to support the business users' data summary requirements. Numerous viable options exist: Summary data can be derived during the load process and stored in the enterprise's relational database, derived when the replication server distributes data to the data mart's multidimensional database, or derived on-the-fly when a user submits a query or launches a simulation. META-DATA The fifth and final functional division of a data warehouse is Meta-Data, which serves as a foundation for the other divisions. Meta-data is as essential to knowledge workers as the data in the warehouse. The data warehouse environment requires meta-data on the elements extracted from the source application, including their domain, validation and derivation rules, and the rules for transforming these elements into the data warehouse's integrated perspective. Meta-data also describes the data warehouse databases, including the distribution rules that control data migration from the central data warehouse to related data marts. In addition to the data about data structures, performance and monitoring data also qualifies as meta-data. The processes that monitor the data warehouse processes (such as extract, load, and usage) create meta-data that is used to determine how well the environment is performing. Likewise, meta-data that identifies data quality issues detected in the extract and load process must be available to data warehouse users, who can incorporate this knowledge as a factor in determining the accuracy levels of their analysis. Data warehouse administrators can manage and provide access to the enterprise's meta-data through repository services. This repository-based framework assumes that a centrally managed repository tool, or set of tools, is available to integrate, version, and synchronize all warehouse relevant meta-data with its corresponding "real" data and system counterparts. This meta-data must be administrated, secured, and made available to all interested audiences. All of these activities may take some creative development. For many organizations, developing a data warehouse provides an impetus for integrating data from various sources into a consistent format and structure-a level of integration that was not previously undertaken among separate operational and external systems. This process brings various meta-data to a central storage point where compare and merge functions can help automate the integration process. Impact analysis-undertaken from several starting points and across several types of meta-data-speeds up the decision-making involved in developing a consensus view of warehouse data. In the past, systems users received minimal amounts of meta-data about an operational system at the time of production implementation (in the form of user documentation manuals or brief fieldsensitive online help text). Because production systems rarely changed significantly, metadata documentation rarely changed. However, data warehousing has introduced an environment in which data will continually increase and will be summarized and aggregated according to ever- Evaltech, Inc. Copyright 2011 Page 5 of 7
changing algorithms. Meta-data must keep pace. The warehouse environment requires more meta-data-for example, to substantiate the current algorithms and track how warehouse data was calculated in the past. Meta-data about "real" warehouse data that has become historical is also required and needs versions to the extent that the "real" warehouse data has versions. User documentation manuals, which are cumbersome to produce and keep up to-date, no longer suffice as a meta-data source for warehouse users; neither do IT oriented mechanisms, to which not all knowledge workers have access, and which have low user comfort levels. Meta-data must now be provided through business friendly mechanisms that are current with the warehouse access technology itself, and in time frames and formats that meet business needs. For example, periodic subject area attribute lists or business cycle transformation rules may be needed in report format to facilitate warehouse project signoffs, warehouse usage audits, and other warehouse management milestones. As additional warehouse data is made available to potential knowledge workers and more knowledge workers need to know what warehouse data is available, it is increasingly important that these knowledge workers have the appropriate metadata access permissions. It is no longer sufficient to make the meta-data available within the same constraints as systems data. Warehouse meta-data administrators must now mirror the administration functionality of traditional production systems. Meta-data about warehouse data must undergo more extensive synchronization than meta-data for the production phase of operational systems. When warehouse data is replicated to remote locations, corresponding meta-data must also be replicated. When warehouse data is selected and aggregated to satisfy departmental or functional perspectives for a data mart, or when it is given little continuing analytical value and is slated for archival, the relevant meta-data must accompany it. Whatever meta-data work remains must precede production warehouse meta-data management. Moving and managing meta-data among warehouse development tools is the largest integration issue reflected by the meta-data management framework. For one tool type to reuse the metadata developed in another tool type, meta-data formats and structures must be able to cooperate across any existing tool interfaces. A particularly challenging issue is the recognition and timely handling of changes in the data sources that feed the warehouse environment. Proactive recognition and triggering mechanisms will minimize negative impact on the warehouse's Source, Load, and Store functions. How well warehouse tool interfaces and a central repository toolset provide these mechanisms is a significant evaluation criterion and implementation requirement. The meta-data management functions contained in this framework must be implemented for a warehouse environment, whether or not they are provided through a centrally managed repository toolset. The most common alternative to repository-based meta-data management architecture involves a pair-wise interface between tools. In this approach, the tool considered to be the primary metadata source propagates the appropriate metadata directly to those tools that use it. For example, data quality assessment and data transformation tools need access to metadata describing the source applications' data structures. Data transformation tools need access to meta-data about the warehouse's database structure. The data transformation tool may be the "system of record" for data mapping and transformation rules between the source applications and warehouse data. Warehouse users need access to this same meta-data through their query tools. Tools can often extract the meta-data they need from the source tool. For example, data quality assessment and data transformation tools can reverse engineer meta-data about source application data structures, and query tools often have a facility for importing meta-data. Direct interfaces between tools do allow metadata sharing-but at what cost? Tasks such as reverse engineering must be performed redundantly. A tool may be able to interface with only a subset of available tools, Evaltech, Inc. Copyright 2011 Page 6 of 7
leaving warehouse developers to build interfaces to other products using a generic import/export facility. Finally, few tools support version control and impact analysis. The five data warehouse functional divisions provide a framework for organizing a data warehouse's architectural components. The framework describes the transformation data must undergo in its journey from an OLTP to an OLAP environment. Future articles will illustrate how you can use the functional division framework as a road map when developing your data warehouse's infrastructure. Enterprise Data Warehouse Architecture Evaltech, Inc. Copyright 2011 Page 7 of 7