Making SAP Information Steward a Key Part of Your Data Governance Strategy Part 2 SAP Information Steward Overview and Data Insight Review Part 1 in our series on Data Governance defined the concept of Data Governance and gave suggestions on how to go about implementing an initial program at a corporate level. The definition that we use is: Data Governance is your organization s management strategy to meet the data quality needs of final data users and consumers. It verifies that data meets your organization s security requirements and ensures that it complies with any regulatory laws. It is the marriage of data quality, data management, and risk management principles. It is implemented via corporate policies, procedures, controls, and software. Now that we know what it is and how to start a program, let s discuss how SAP Information Steward can fit into a data governance initiative. SAP Information Steward is an enterprise-level data quality solution that allows you to profile data, perform impact and lineage analysis, construct a corporate dictionary, and define custom cleansing rules for incoming data. Each of these functions is performed by a different module of the software, which are: Data Insight, Metadata Management, Metapedia, and Cleansing Package Builder. Your initial data governance goal will determine which of these to utilize first. Data Insight is the data profiling tool and data quality monitor. Metadata Management is the impact and lineage analysis tool that can determine where a piece of data is used through the enterprise and what may affect that data. Metapedia is the corporate dictionary where business terms can be defined for use throughout the organization. Finally, Cleansing Package Builder is the data quality tool that allows data area experts to define transformations and cleansing rules in order to standardize a particular set of data. This post will cover Data Insight in detail, while subsequent posts will breakdown the other modules of the Information Steward tool. Data Insight allows you to profile data from a range of sources that include standard relational databases, SAP HANA, SAP ERP, SAP Master Data Services, and even flat files. Data profiling is simply the process of analyzing the data that exists in a source and collecting statistics from that analysis. It answers the question: What does my data source actually contain?, as there is often a disparity between what a source should contain and what it contains in reality. Data profiling is the starting point for data integration tasks, data warehouse projects, and many data governance programs. Without this starting point, one cannot properly calculate true measurements of the data quality improvements that are achieved through a data governance or data quality program. There are several flavors of data profiling. The basic Column Profiling option collects data statistics of a column such as the percentage of null values, the distinct values found, the minimum and maximum values in the field, common patterns in the data, and much more. Advanced column profiling is also possible with Address profiling, Redundancy profiling, Uniqueness profiling, and Dependency profiling. Address profiling utilizes the Data Services job engine to parse address data through its Address Cleansing transform and return a percentage breakdown of good and bad addresses in the data. It
also displays the percentage of correctable addresses if the data were to be run through a Data Services Address Cleanse transform. Redundancy profiling checks the amount of overlap in data between two sources, and is good for rooting out referential integrity issues. Uniqueness profiling determines the percentage of unique values in a field. And lastly, Dependency profiling allows you to determine the degree of dependency that two or more columns of data have upon each other. The screenshot below shows the results of column data profiling of a database table. Note that certain results only appear when appropriate; for example an average result only appears for numeric columns and string length results only for character columns. Figure 1 Sample results of column data profiling Information Steward also stores sample rows of data that fit a certain profiling result. This allows you to drill down to record-level detail without having to leave the tool. In the below screenshot, I first selected the Value result of the GPO_NAME column. I then selected the value NOVATION that appeared in the list of values in the right-hand panel. Note that these values are sorted by most popular to least as determined by the number of those values in the dataset. The bottom panel contains raw data that fits these selections.
Figure 2 The list of values for the column GPO_NAME appears on the right and sample data rows appear in the bottom panel for the value NOVATION Data Insight s second focus is that of a data quality monitor. This is achieved through the use of custom data validation rules and data quality scorecards. A data validation rule defines the form that a piece of data should take and scores the results of a rule when it is run against one or multiple sources. For example, if a database column should contain only US zip codes, a rule for that column could be that the data in it should be either five digits in length or five digits followed by a hyphen followed by four more (e.g. 99999 or 99999-9999). The rule that is created can then be bound to any source field that should contain US zip codes. Furthermore, multiple rule scores are stored in the Information Steward repository and can be analyzed over time to determine whether data quality is improving or getting worse. The rule building tool allows for a wide range of flexibility to build rules that will suit your organization s needs. The screenshot below shows the various score results of a rule that has been selected in the left panel. This rule has been bound to multiple sources and has been run multiple times. Take note of the scores per source binding, the color that corresponds to those scores (which are configurable per binding), and the From Last arrow that indicates whether the score improved or got worse since the last time the score was calculated. This screen also displays the total number of rows that were analyzed and the number that failed the rule.
Figure 3 Rules are displayed in the left panel along with its corresponding bindings on the right. Scores are calculated and kept historically Scorecards are the final visualization piece of data quality monitoring. They allow data validation rules to be aggregated to higher levels of analysis in Data Quality Dimensions and Key Data Domains. Data Quality Dimensions are generic categories that rules fall under such as Quality, Accuracy, and Completeness, and are defined when a rule is created. Key Data Domains are user-driven areas of focus for which data quality is to be analyzed. These can be as broad or granular as desired. The difference between these two is that Key Data Domains are the highest level for which data quality is measured, and Data Quality Dimensions are a subset of Key Data Domains. Information Steward calculates total scores for each and keeps historical scores for analysis over time as well. In the screenshot below, a scorecard is broken down into Key Data Domains of Distributors that I ve blurred out due to it being customer sensitive information. Data Quality Dimensions score are broken out in each panel of Key Data Domains to show each result. And finally, the Quality Trend line at the bottom of each panel shows the score of the Key Data Domain over time.
Figure 4 Key Data Domains are broken out into separate panels. Quality Dimensions are displayed as well and an overall Quality Trend for the Key Data Domain appears at the bottom The next screenshot shows the drill-down screen for the right-most Key Data Domain of the above screenshot. This is accessed by selecting Show more in the upper right of the Key Data Domain panel. Notice that a list of Key Data Domains, Data Quality Dimensions, and Validation Rules all appear with a score breakdown. You are able to select a value from any of these to display a Quality Trend in the lower left as Information Steward keeps historical data for all three of these dimensions.
Figure 5 The "Show more..." screen appears with a selection made from the list Key Data Domains on the left. Data Quality Dimensions and Validation Rules are also able to be selected on this screen to display the respective data quality trend in the lower left. One last thing to note is the View failed data button that appears in the lower right panel. Clicking this will bring up the rows of data that failed validation rules for the selected Key Data Domain (or Data Quality Dimension or Validation Rule if one of those is selected instead). With this button you are able to see the raw data records and which validation rule(s) that they failed directly within the tool. An option to export those records to Excel is also available if further analysis is necessary. In conclusion, the Data Insight module within SAP Information Steward can be a very influential asset in starting a data governance program or data quality initiative. Its powerful data analysis and data quality monitoring capabilities will greatly help in building a case for data governance in your organization. In Part 3 of the series, we will discuss the remaining modules of the SAP Information Steward solution of Metadata Management, Metapedia, and Cleansing Package Builder. In subsequent posts we will discuss some case studies where we use Information Steward in the real-world.
Rich Hauser, Senior Business Intelligence Consultant Decision First Technologies Richard.Hauser@decisionfirst.com Rich is a senior business intelligence consultant specializing in Enterprise Information Management. He has delivered customized SAP BusinessObjects solutions for customers of all sizes across a variety of industries. With Decision First Technologies, Rich utilizes SAP Data Services and SAP Information Steward.