UNIVERSITY OF LJUBLJANA FACULTY OF ECONOMICS

Transcription

1 UNIVERSITY OF LJUBLJANA FACULTY OF ECONOMICS MASTER THESIS COMPARISON OF SELECTED MASTER DATA MANAGEMENT ARCHITECTURES Ljubljana, February 2013 Katerina Atanasovska

2 TABLE OF CONTENTS INTRODUCTION... 1 RESEARCH PROBLEM AND PURPOSE OF MASTER THESIS... 1 RESEARCH GOALS... 5 RESEARCH METHODS DEFINITION OF DATA, DATA TYPES, DATA DIMENSTIONS AND DATA INCONSISTENCIES DATA TYPES Analytical data Transactional data Master data Metadata DATA QUALITY DIMENSIONS Intrinsic data quality Contextual data quality Representational data quality Accessibility data quality DATA INCONSISTENCY DATA QUALITY IMPROVEMENT MASTER DATA MANAGEMENT DEFINITION GOALS OF MDM MDM ACTIVITES BENEFITS FROM MDM MASTER DATA MANAGEMENT SOLUTIONS HISTORICAL REVIEW OF MDM SOLUTIONS FUNCTIONALITIES, CONCEPTS AND ARCHITECTURE ARCHITECTURE OF MDM DESCRIBED THROUGH SELECTED MDM SOLUTIONS Microsoft Master Data Services SAP Netweaver IBM InfoSphere Oracle MDM Suite ANALYSIS OF SELECTED MASTER DATA MANAGEMENT ARCHITECTURES MDM OF SELECTED ARCHITECTURES AND QUALITY DIMENSIONS COMPARISON OF SELECTED ARCHITECTURES THROUGH THE THREE DIMENSIONAL MODEL COMPARISON OF SELECTED ARCHITECTURES THROUGH THE FIVE MDM ACTIVITIES CASE STUDY OF MDM SOLUTION USED IN STUDIO MODERNA PROBLEMS WITH PRODUCT DATA MANAGEMENT i

3 5.2.CENTRAL PRODUCT REGISTER (CPR) Product statuses Product data security BENEFITS OF CPR COMPARISON OF CPR AND SELECTED MDM ARCHITECTURES BUILD VS BUY MDM SOLUTION CONCLUSION LIST OF REFERENCES LIST OF FIGURES Figure 1: Definition of master data and the master record... 1 Figure 2: Applications used for MDM... 4 Figure 3: Enterprise data... 7 Figure 4: List of data attributes Figure 5: List of techniques for solving data inconsistencies Figure 6: Workflow of MDM Figure 7: The data quality activity levels Figure 8: MDM Activities Figure 9: Evolution of IBM MDM applications Figure 10: Dimensions of master data management Figure 11: Traditional MDM architecture Figure 12: MDM architecture with additional published services Figure 13: MDS data model Figure 14: Table types Figure 15: Key mapping during import and export Figure 16: Logical model Figure 17: Domain model and physical model Figure 18: Physical model Figure 19: Example of field mappings during data import Figure 20: Example of SSN pattern match Figure 21: Example of record merge Figure 22: List of predefined tables for Customer entity Figure 23: Example of cross reference between PARTIES and SYS_REFERENCE Figure 24: Example of data validation workflow Figure 25: Example of data flow in CPR LIST OF TABLES Table 1: An example estimating the positive impact of customer MDM Table 2: Gartner s Magic Quadrant for Data Table 3: MDS repository objects vs. Relational database objects Table 4: Advantages and disadvantages of MDS Table 5: Advantages and disadvantages of SAP Table 6: Advantages and disadvantages of IBM InfoSphere ii

4 Table 7: Advantages and disadvantages of Oracle MDM Table 8: DQ dimension and MDM Table 9: MDM solutions and three-dimensional model Table 10: MDM overview through four data management phases Table 11: CPR solutions for product data management Table 12: Comparison of MDM architectures and CPR s three dimensional model Table 13: Comparison of MDM architectures and CPR s MDM Phases Table 14: Comparison of MDM architectures and CPR s time and cost iii

5 INTRODUCTION Research problem and purpose of master thesis Most of the businesses today perform and track their everyday transactions with the help of various information systems. Companies use them to automate their business processes, store their data and make further business decisions based on the end results given from various applications. The great success of these systems is not just based on the complex processing logic used in their backend software code, but also on their friendly user interfaces that make such software easy to work with. Development of new and sophisticated information technologies (IT) in the past decade resulted in growth and expansion of numerous business solutions on the market. Benefits from this development are seen in improving workflows of many companies. However, the side effects from fast growth of IT created additional headaches for businesses and again redirected them back to look for solutions from IT vendors. One of the major problems users of such applications are dealing with is the constant growth of dirty data in the system. There are two reasons why IT is responsible for producing bad data: 1. Trying to get closer to the customers vendors focused on the application design, and various business scenarios, neglecting the data validations and filters in the whole architecture. This weakened the system to track content of entered data; 2. New system-oriented architecture (SOA) allows integration of different applications into one system. Knowing that each application carries its own database, there is a high possibility that same data may be stored in different sources and this automatically produces data redundancy in the system. Figure 1: Definition of master data and the master record Source: J. Bracht et al, Smarter Modeling of IBM InfoSphere MDM Solutions, 2012, p. 29 The problem of bad data became most visible and hard to handle when companies started experiencing revenue loss, increased costs, customer complains, employment frustrations etc. 1

6 Statistics below, based on researches made by Arlbjørn and Haug (2010, p. 294) show the alarming situation that companies are placed in because of poor data quality: - 88 per cent of all data integration projects either fail completely or significantly overrun their budgets; - 75 per cent of organizations have identified costs stemming from dirty data; - 33 per cent of organizations have delayed or cancelled new IT systems because of poor data; - $611bn per year is lost in the US in poorly targeted mailings and staff overheads alone; - According to Gartner, bad data is the number one cause of CRM system failure; - Less than 50 per cent of companies claim to be very confident in the quality of their data; - Business intelligence (BI) projects often fail due to dirty data, so it is imperative that BI-based business; - decisions are based on clean data; - Only 15 per cent of companies are very confident in the quality of external data supplied to them; - Customer data typically degenerates at 2 per cent per month or 25 per cent annually; - Organizations typically overestimate the quality of their data and underestimate the cost of errors; - Business processes, customer expectations, source systems and compliance rules are constantly changing. Working as database analyst in Studio Moderna, I have been dealing with examples of bad data every day. Duplicates, misspellings, missing values are some of the irregularities that appear in customers databases. It is very hard to work on statistics and analysis knowing that the numbers contain duplicates, but the huge amount of data load and the time constraint, which has always been an issue, don t allow you to go through and cleanse what is considered to be obsolete in such cases. At the end, the picture you present for the requested business scenario may be irrelevant for that time being. Not because of incorrect query statements and miscalculations, but the content of data involved in the whole processes. It is very frustrating when whoever works on data analysis deals with situations in which they spent time looking for some error, trying to find the reason for mismatching results, and discover that it s just another misspelled name or missing address. There are several techniques that help solving problems with bad data. Few of them are: data mining, data cleansing, data profiling, data governance etc. Depending on the tasks they perform, these techniques are divided into four major groups: techniques to clean, consolidate, govern and share data. Today, all of them fall under Master Data Management (MDM), discipline that brings together any method, technique or technology that deals with data quality improvement. 2

7 In many cases throughout different literature MDM is defined as software for improving data quality, but Master Data Management covers much broader area then that. In more formal definition given by Mauri and Sarka (2011, p. 16) MDM is a set of coordinated processes, policies, tools and technologies used to create and maintain accurate master data. We do not have a single tool for MDM. Anything that is used to solve data quality issue falls under category of MDM. For example, running nightly procedures for data cleansing, defining constraints of tables to check inserted data, defining table users and permissions, any of that can be considered as managing data. Master data is specifically used in this discipline, because it represents core data for every enterprise and it needs to be correct, and precisely maintained in systems, so company can work with lowest possible number of data issues. In addition to this strategy, vendors developed sophisticated MDM software solutions where they implemented numerous techniques for improving data quality. The concept of these solutions is designed for large, medium and small sized enterprises. Entire software suites are appropriate for larger companies that work with great amounts of data. Another example where MDM suites are used, are companies that extended their businesses through merging or acquisition, and are confronting problems of bad data created as a result of introducing new systems to their existing environment. Individual modules of the suites are appropriate for medium and small sized companies, where certain MDM applications are used for analysis and data cleaning. There is a significant number of established vendors who offer Master Data Management products. D&B/Purisma Data Hub, DataFlux MDM, Data Foundations OneData, i2 MDM, IBM InfoSphere MDM Server, Initiate Systems Master Data Services, Kalido MDM, Liaison Technologies MDM, Microsoft MDM, Oracle Customer Data Hub, Oracle Hyperion DRM, Oracle UCM, Orchestra Networks EBX, SAP NetWeaver MDM, Siperian MDM Hub, Sun MDM Suite/Mural, Teradata MDM, TIBCO CIM, VisionWare MultiVue are just part of the list of various MDM applications. Considering that MDM is fairly new technology which is establishing on the market in the past 10 years, it s hard to decide which of the above listed products could be the best solution for one organization. Market reviews predict bright future for MDM vendors. The aggregate MDM market will grow from US$2.8 billion to US$4 billion over the forecast period ( ), including revenues from both MDM packaged solutions and implementation services as well as the billion plus dollars related to data service providers such as Acxiom and Dun & Bradstreet. The aggregate enterprise MDM market (customer and product hubs, plus systems implementation services) totaled US$730 million at YE2007 and will reach US$2 billion by the end of Software sales are but one portion as MDM systems integration services reached US$510 million alone during 2007 and are projected to exceed US$1.3 billion per year by 2012 (Zornes, 2009, p. 3). Despite these predictions, majority of companies are still favoring in-house solution over packaged MDM software. In 2006, Ventana Research surveyed 515 companies on this 3

8 matter. Their findings were that only 4% of the interviewed companies completed their MDM implementation project, 7% are still in ongoing implementation phase and 33% are in progress. Less than half of these companies have some kind of packaged software whereas 20% have their own developed solution. Nearly half of them are considering implementing some data governance tool, but only 24% are planning to realize that some time in the future ( Smith, 2006). Similar numbers were recently confirmed by Messerschmidt and Stuben (2011, p. 5). They interviewed 49 companies from 12 different countries and eight industries including small and large business. Numbers showed that most of these companies are willing to implement some MDM software but are still using their own built MDM solution. Figure 2 represents the answers that companies gave regarding the MDM application they use. Most of them answered that they still use in-house development instead of packaged software. Figure 2: Applications used for MDM Source: M. Messerschmidt & J. Stuben, Hidden Treasure, 2011, p. 33 From the various statistics presented earlier, it seems that majority of organizations are looking into implementing some kind of MDM tool, but are still not quite prepared for packaged software that is placed on the market. When one organization has certain budget to invest in technological upgrade, then it strives to make the best decision money can buy. Their decision is introduction to the problem of this master thesis, which defines the never ending debate on packaged vs. custom built solution. Problem would be examined through four architectures of already established vendors: Microsoft, IBM,SAP and Oracle and also the case study on custom built solution for the requirements of Studio Moderna. The structure of this thesis is divided in two parts. First part explores the problem of bad data, discussed through some general concepts of data, data quality, standards for quality data and possible causes of data inconsistencies. The second part covers the purpose of my thesis, which is an analysis of data management process implemented in selected MDM software solutions offered on the market, and how their MDM architecture assists in improving data quality. This analysis will be made through researching and comparing different MDM architectures and the way they perform data modeling, validation, import and 4

9 exports of data and security of the system. MDM software solutions that will be compared in this thesis are: Microsoft Master Data Services, SAP NetWeaver MDM, IBM InfoSphere and Oracle MDM Suite. There are a lot of vendors who offer MDM solutions, but I chose these four because they are already known for their database management systems, as well as many business intelligence (BI) tools. As an addition to these four products, I included one custom made solution called Central Product Register (CPR), developed for product data management in Studio Moderna. Research goals Goal of my work is to create comparison model for products from selected MDM vendors. This model will discuss domain, method of use and implementation style as main dimensions that characterize each MDM system. Also, I will discuss in more details some of the techniques used to consolidate, cleanse, govern and share data. This comparison will describe how each vendor understands and implements data management in their solution. It will also highlight advantages and disadvantages each product has, and try to find out if implementing such solution will really benefit the business or it s another fancy application for better organization and view of data, which doesn t actually solve the core problem of data quality. Including custom build solution as case study example for MDM product is introduced to show how master data management is understood within a company, and to describe company s attempt to solve the problem without help from off-the-shelf product. Discussing the company s internal data management introduces another goal in this thesis, which shows that users shouldn t just rely on MDM software as only solution to quality data. In most cases problem should be looked much deeper, not in data itself but in sources that produce data, whether that is user or application. Often, problem lays in lack of knowledge or experience in business processes and company s workflows. In such cases, MDM products can improve data and solve current issues, but it s not a long-term solution because the problem exists elsewhere, and sooner or later it will produce bad results again. If proven, this finding can be very useful, because it can save users time and money for purchasing and implementing software that wasn t the right choice in the first place. Research methods There are two research methods used in this thesis: 1. Comparative Analysis; 2. Case study of in-house MDM solution in Studio Moderna. All of the data and statistics used in this paper are collected from different literature and publications, so only secondary data from other sources is used. Since this topic compares products from technical point of view, I chose: literature, white papers, technical notes to be most appropriate sources for my thesis. And based on the title for my research, most suitable method for comparison is the comparative analysis itself. This method is used when 5

10 researching collected materials and creating general summary for the four different MDM solutions. Another method used in this research, is case study of in-house MDM solution built for Studio Moderna needs. I consider this case study as suitable example, because it deals with the problem of data quality and covers data management processes, same as the four products sold by Microsoft, Oracle, IBM and SAP. Also, it can extend the discussion of data management in terms of business process changes and restructuring the workflows, not just data cleansing and governance as options for data quality improvements. As an addition to these two methods, I also used unstructured interviews. Following people were interviewed: - Mr. Bostjan Kos, Information Management Client Technical Professional at IBM, Slovenia; - Tadej Zajc, Sales Representative in Oracle, Slovenia and - Sasa Strah, project lead for Central Product Register solution in Studio Moderna. These were informal interviews done over , and contained questions regarding Master Data Management products that the representatives listed above work with, as well as their experience with customers who use their software. 1. DEFINITION OF DATA, DATA TYPES, DATA DIMENSTIONS AND DATA INCONSISTENCIES 1.1. Data types Data is part of our everyday life. Words, numbers, dates, pictures are all examples of data. 'Data' represents collection of unorganized facts, which can be organized into useful information. 'Processing' refers to a group of actions which can convert inputs into outputs. The series of operations performed to convert unorganized data into organized information is called data processing, and includes resources like people, procedures and devices to convert data into information (Minimol & Sarngadharan, 2010, p.85). As introduced earlier in this thesis, main goal of MDM is to improve data quality in organizations, therefore organizational (enterprise) data will be discussed further in this thesis. Enterprise data represents all the inputs that are produced, processed and stored in an enterprise. It can be used in different business scenarios and for different purposes of the company. For easier management, enterprise data is divided in three categories: analytical, transactional and master. This grouping is based not on the content or format of the data, but on different ways that the same one is used. There is no strict rule that splits data and places it in one of these three categories. One record can be defined as analytical or transactional data depending on the way it s used in certain business scenarios. 6

11 For example, sales data can be seen as transactional data representing the daily sales transactions in one company. On the other hand, sales can be also used for analytical purposes, to present the sales status in one organization for certain time period. Such example puts sales data in two groups: transactional and analytical, depending on the way it s used in a given situation. Figure 3: Enterprise data Analytical data Source: An Oracle White Paper on Master Data Management, 2011, pg. 4 Analytical data is used to provide some general picture of company s work. It s the end result of statistics, analysis or other calculations performed over collected inputs. Main use is to show the business situation in a given time period. It is usually stored in the business intelligence (BI) part of company s system and is shown in reports, OLAP cubes, graphs etc. An example of analytical data can be clients demographics overview, yearly profits and loss, or any summary results collected on global enterprise level. It helps in making major business decisions and often times it determines the course of company s progress Transactional data Transactional data represents records which refer to transactions in one system. Transactions are activities that are related to business events, for example: payments, sales, creation of new account, creating new student record, with other words any change related to an object in given time. Compared to analytical data, this type is much more detailed and it tracks and records every new input, update or delete in a system. That s why the amount of transactional data increases every day, proportionally with the growth of the number of transactions. Even though analytical and transactional data are opposite categories and completely different, they still cannot function without one another. It is very hard to review numerous records of transactions created on daily basis, so in such cases analytical data is used to summarize transactional data, and provide final number of daily changes in the system. On 7

12 the other hand, we can always examine anomalies in analytical data by reviewing each transactional record that was calculated in those numbers Master data Master data is core data for each enterprise and contains detailed information for its main domains. Since every enterprise is engaged in different businesses, it has different domains as well. Examples would be: customer, product, location etc. Master data can be categorized according to the kinds of questions user will address; three of the most common questions - Who?, What?, and How? return the most common domains: party, product, and account. Each of them represents a class of things - for example, the party domain can represent any kind of person or organization, including customers, suppliers, employees, citizens, distributors, and organizations. Similarly, the product domain can represent all kinds of things that companies sell or use - from tangible consumer goods to service products such as mortgages, telephone services, or insurance policies. The account domain describes how a party is related to a product or service that the organization offers. What are the relations of the parties to this account, and who owns the account? Which accounts are used for which products? What are the terms and conditions associated with the products and the accounts? And how are products bundled? ( Dreibelbis, Hechler, Milman, Oberhofer, van Run & Wolfson, 2008, pg.14). However, this grouping cannot be taken as general rule that all companies apply. Depending on the business, rules and logic, each enterprise data has its own master data objects defined for the needs of the company. Based on some of the domains given above, example for master data would be: customer s date of birth, gender, name, address, product s name, SKU, price, supplier etc. Master data is entered once in the system and changes on rare occasions. Because business relies on this information, it s very important to maintain its consistencies through time. It is critical for company to lose sales records for some customer, but it would be more crucial if it loses personal information or contacting data for the same customer. Managing this type of data is discussed later in the thesis Metadata Another group of enterprise data worth mentioning is metadata. Data about data is wellknown definition for metadata, which is found throughout different literature. However, metadata has much broader value and meaning for the enterprise, especially when Master Data Management is discussed. Metadata helps enterprise to relate correct information to the appropriate business terms. For example, it helps in differentiating different concepts with similar meaning like client, customer, buyer etc. There are two types of metadata: (1) semantic and (2) syntactic (Sheth, 2003). 8

13 (1) Semantic metadata describes contextually relevant or domain-specific information about content (in the right context) based on an industry-specific or enterprisespecific custom meta data model or ontology; (2) In contrast, syntactic metadata focuses on elements such as size of the document, location of a document or date of document creation that do not provide a level of understanding about what the document says or implies. Another categorization of metadata is based on its type of usage. In this case there are three broad categories (Berson and Dubov, 2007, p. 129): (1) Business metadata includes definitions of data files and attributes in business terms. It may also contain definitions of business rules that apply to these attributes, data owners and stewards, data quality metrics, and similar information that helps business users to navigate the information ocean. (2) Technical metadata is created and used by the tools and applications that create, manage, and use data. Technical metadata typically includes database system names, table and column names and sizes, data types and allowed values, and structural information such as primary and foreign key attributes and indices. (3) Operational metadata contains information that is available in operational systems and run-time environments. It may contain data file size, date and time of last load, updates, and backups, names of the operational procedures and scripts that have to be used to create, update, restore, or otherwise access data, etc. Based on the definitions and categorization of metadata I can conclude that this type of enterprise data supports MDM in two ways: (1) It contains background information for the context and technical properties of data, which helps MDM in more precise data modeling, and also appropriate mapping of data with master domains; (2) It sets general data rules for business and technical definitions of data, which supports data standardization, another process in managing master data Data quality dimensions Companies need to be acquainted with data quality standards, so they can easily detect deficiencies in their data. In my opinion, quality in general associates to how much we can expect to gain from something and how reliable or useful that is. With this being said, data quality shows how much information we can gain from given data and how reliable that information is for us as users. Classic definition found in literature defines data quality as fitness for use, i.e. the extent to which some data successfully serves the purposes of users. (e.g. Tayi and Ballou, 1998; Cappiello et al., 2003; Lederman et al., 2003; Watts et al., 2009) 9

14 Defining data quality is very subjective and is not seen equally by everyone. Some users may consider data very reliable, whereas others may argue that there are still improvements to be done. To avoid such opposite views, literature sets some common standards for data quality defined through data dimensions. Data dimensions define data quality as multidimensional concept and help in determining data s fitness for use. According to Strong and Wang (1996, p. 6), data quality dimensions are set of data quality attributes that represent a single aspect or construct of data quality. Attributes are characteristics of data and the easiest way to define them is when answering simple data related questions. For example, the question Which data is duplicated? returns uniqueness as an attribute, or What data is incorrect? imposes accuracy, and so on. The table below lists several questions for determining data attributes. Figure 4: List of data attributes Source:R. Hillard.Information-Driven Business : How to Manage Data and Information for Maximum Advantage, 2010, p. 136 There are many attempts in the literature to determine which attributes are most important and best define data quality. For example, Strong and Wang (1996, p. 7) took (1) intuitive, (2) theoretical and (3) empirical approach to find out what are the most important data characteristics. (1) Intuitive approach is based on authors intuitive understanding of importance of attribute. They take the freedom of choosing which attributes are most important to define data quality and in this case researchers don t question what data attributes are important for system users; (2) Theoretical approach on the other hand, does not rely on researcher s subjectivity, seen in the previous example, and defines data characteristics based on data deficiencies that can be found in a system. Data attributes are defined in reverse connotation, based on data deficiencies one system has. If there is duplicate data in the system for example, then uniqueness is the attribute that is missing and it s crucial for quality data. This approach, same as the previous example, doesn t consider what data attributes are important for users; 10

15 (3) In the third empirical approach, data quality is defined in terms of data attributes that are only important for system users. Even though this approach tries to be as objective as it can and use general opinion of consumers, still the final results can be very diverse and inconsistent because of the different opinions collected by different people. The difficulty in this approach is setting some basic rules upon which dimensions would be compared. General conclusion from all these approaches is that there are no strict rules or certain attributes that define data quality. Data quality dimensions are relative to user requirements and often times these requirements are subject to change, therefore priority and importance of data quality dimensions can change as well. Wang and Strong (1996, p. 21) used a two-stage survey and a two-phase sorting study to develop hierarchical framework that consolidates 118 data-quality attributes collected from data consumers into fifteen dimensions, which in turn are grouped into following four categories, each focusing on a key issue: 1. Intrinsic - What degree of care was taken in the creation and preparation of information? ; 2. Contextual - To what degree does the information provided meet the needs of the user? ; 3. Representational - What degree of care was taken in the presentation and organization of information for users? ; 4. Accessibility - What degree of freedom do users have to use data and to define and/or refine the manner in which information is entered, processed, or presented to them? Intrinsic data quality Intrinsic data quality, according to Wang et al (2005, p. 7), implies that information has quality in its own right. Attributes in this category show how truthful and real data describes objects around us. This group refers to data that comes along with the object and don t change due to some requirements. For example, name of a person is given as it is, and doesn t change because of some business requirements. Same as person s weight, height or eye color. Such values are intrinsic for person and the only anomalies that are found with this data are NULLs or badly formatted data. So, quality in this case is measured in the existence and correctness of the input, not whether it satisfies business needs. Below is a list of the most commonly used dimensions along with their definitions (Kahn, Strong and Wang, 2002, p ): - Believability - The extent to which data are accepted or regarded as true, real and credible; - Accuracy - The extent to which data are correct, reliable and certified free of error; - Objectivity - The extent to which data are unbiased (unprejudiced) and impartial; 11

16 - Reputation - The extent to which data are trusted or highly regarded in terms of their source or content Contextual data quality Contextual data dimensions highlight the requirements that information quality should be considered within the context of the task at hand (Wang et al, 2005, p. 7). Based on category s name, dimensions define how precise data captures the context of business objects. If again I take person as example and his address as representative data element, what I will be interested in is if this is the only address that can be assigned to the person, and if this address is current for the time being. In order to improve quality in contextual terms, every business needs to increase the amount of data related to its business objects, and update them in appropriate time, to avoid old and obsolete information in the system. Following are some dimensions that are defined in this group (Kahn, Strong and Wang, 2002, p ): - Value-added - The extent to which data are beneficial and provide advantages from their use; - Relevancy - The extent to which data are applicable and helpful for the task at hand; - Timeliness - The extent to which the age of the data is appropriate for the task at hand; - Completeness - The extent to which data are of sufficient depth, breadth, and scope for the task at hand; - Appropriate amount of data - The extent to which the quantity and volume of available data is appropriate Representational data quality Representational data dimensions address the way computer systems store and present information (Wang et al, 2005, p. 8). This category is explored more from technical rather than content aspect. Data quality in this case depends on how well data model and business logic are integrated in systems. If database model is well designed then business objects are represented by correct and unique data. Otherwise, there are duplicates, orphan records, obsolete data that just use database space and have no use in particular. In order to meet these data dimensions correctly, companies need to focus on technical planning and development of their information systems. Representational data quality category includes the following dimensions(kahn, Strong and Wang, 2002, p ): - Interpretability - The extent to which data are in appropriate language and units and the data definitions are clear; - Ease of understanding - The extent to which data are clear without ambiguity and easily comprehended; - Representational consistency - The extent to which data are always presented in the same format and are compatible with previous data; 12

17 - Concise representation - The extent to which data are compactly represented without being overwhelming (i.e., brief in presentation, yet complete and to the point) Accessibility data quality Accessibility data quality is another category that defines dimensions from technical perspective. This multidimensional nature of information quality means that organizations must use multiple measures to fully evaluate whether their data are fit to use for a given purpose by a given consumer at a given time (Wang et al, 2005, p. 8). The ability of today s systems to serve multiple users at the same time in many occasions can cause erroneous data. Duplicates, overwriting of important information, database changes are some of the risks that systems undertake in their every day usage. In order to lower such risk, companies spent some quality time building security model and limit the access to system s data. Data dimensions of this type are not defined by the content of data, but by the system s security model. There are two known dimensions from accessibility group (Kahn, Strong and Wang, 2002, p ): - Accessibility - The extent to which data are available or easily and quickly retrievable; - Access security - The extent to which access to data can be restricted and hence kept secure 1.3. Data inconsistency Data inconsistencies are irregularities found in data, such as: duplicates, misspellings, undefined values. They are the bad data in systems. Any data that is obsolete, incorrect, and unuseful falls into this category. Bad data can have tangible and intangible effect for a business. According to some older researches by Haapasalo et al (2010, p. 147), it is estimated that incorrect data in retail business costs alone $40 billion annually and at the organizational level, costs are approximately 10 percent of revenues. It is said that the decisions company makes are no better than the data on which they are based and better data leads to better business decisions. Looking into the intangible consequences, data inconsistencies also cause mistrust in existing data. Working with different data versions for the same enterprise object is time consuming, requires additional work for tracking the errors, and causes frustration among employees. Incorrect data cannot give accurate picture for a business, and it cannot help in bringing the right business decision for future progress and success. Two factors play major role in producing bad data: human factor and system design. Human errors occur every day, usually on input or during various calculations. From my personal experience, most of the work I do is data analysis, and high number of errors I see are misspellings or wrong data imported into inadequate data fields. Database cannot track if 13

18 customer s last name was entered in the field for first name or vice versa therefore, erroneous data is produced, unless detected on time and corrected at that moment of input. System design is another reason for producing bad data. Wand and Wang (1996, p. 91) discuss four states of design deficiencies in systems: (1) incomplete, (2) ambiguous, (3) meaningless and (4) incorrect. These states are based on deficiencies that appear when user definitions (what users expect to see in the system) are improperly mapped to the system s values. (1) Incomplete state occurs when there is no system value to represent user definitions. This state can lead to inaccurate and incomplete data; (2) Ambiguous state appears when two or more user definitions are represented by same value. In this case precision and accuracy are affected; (3) Meaningless state produces irrelevant data which can t be used for any of the requirements. It s an orphan value that stays in the system and it s not used. This state may not have immediate effect on data, but in future may lead to ambiguity or incorrectness if new user definition is required and it happens to map to that same orphan value; (4) Incorrectness appears when data refers to the wrong user definition. Therefore data is incorrect and unreliable. Data issues can be of technical or business character. Technical data issues refer to data structure and representation. An example of such technical errors would be (Gryz and Pawluk, 2011,p. 3): - Different or inconsistent standards in structure, format, or values; - Missing data, default values; - Spelling errors; - Data in wrong fields; - Buried information in free-form fields. Business issues, on the other hand, are unique for each organization. They refer to the context of the data and appear as a result of incorrect representation of business terms and relations. For example, address for one customer is entered as home instead of work, or person is related to transactions that he never committed. It s hard to define some general list of business data inconsistencies that will cover irregularities of data for all organizations as it was the case with technical issues. Best way to detect and define such business errors, is through data analysis, which will reveal if the entered data corresponds to the defined business concepts. Despite the fact that bad data lowers data quality and produces incorrect information, in many cases it s an advantage and can predict what changes need to be undertaken to improve the work of systems. As seen from the previous chapter, another way to explore data quality is through data deficiencies that can be found in systems. Based on this approach, existence of 14

19 data inconsistencies can predict the missing factors for data quality. Data errors are the starting point for solving the problems that produce them. Once these problems are detected, then there are appropriate measures that can be used to fix them and improve data quality in the system. 1.4 Data quality improvement Data quality improvement is a systematic process that occurs in several phases. It starts with looking for the source of the problem, through cleansing of the errors to setting data standardization rules that will prevent future problems. Data quality improvement executes in the following order (Rivard et al, 2009, p. 62): (1) data profiling - analyzes data to find inconsistencies, data redundancy and incomplete information; (2) data cleansing - corrects, standardizes and verifies data; (3) data integration - semantically links data; reconciles, merges and associates; (4) data augmentation - improves data by using internal and external sources, and removes duplicates; (5) data monitoring - monitors and checks the integrity of data over time. Figure 5: List of techniques for solving data inconsistencies Source: F. Rivard et al, Transverse Information Systems : New Solutions for IS and Business Performance, 2009, p. 62 There are various tools that support DQ improvement stages listed in fig.4., leaders among them are: Informatica, SAP, IBM, SAS/DataFlux (Gartner, 2012, p. 2). DQ improvement phases are unified in master data management. Its concepts and goals will be discusses in the second part of this thesis. 15

20 2. MASTER DATA MANAGEMENT 2.1. Definition The problem of bad data is well known to every company. There aren t any enterprises that have perfect data without errors, therefore, they are constantly trying to improve data quality and prevent their systems from further data inconsistencies. Earlier I discussed four types of enterprise data: transactional, analytical, master and metadata. All of these types are equally important in every organization, but core data that describes organizational business domains is master data. Therefore management of this type of enterprise data (master data) will be discussed further in the thesis. There are number of data stewards, administrators of databases, software architects, business analysts, who work with different software platforms, data methods and techniques used for data cleansing and governance. All these people, software and methods that are involved in solving master data errors are united in a discipline called Master Data Management (MDM). Often Master Data Management (MDM) is defined as software package for improving data quality. But in fact, MDM is much more than just a software application for data cleansing. It is special IT discipline that includes people, software tools, and business rules for managing master data. Different literatures share same views on what MDM is. For example, Berson and Dubov (2010, p. 79) define MDM as framework of processes and technologies aimed at creating and maintaining an authoritative, reliable, sustainable, accurate and secure data environment that represents a single and holistic version of the truth, for master data and its relationships, as well as an accepted benchmarks used within an enterprise as well as across enterprises and spanning a diverse set of applications, lines of business, channels, and user communities. Loshin (2008, p. 8) defines MDM as a collection of best data management practices that orchestrate key stakeholders, participants, and business clients in incorporating the business applications, information management methods, and data management tools to implement the policies, procedures, services, and infrastructure to support the capture, integration, and subsequent shared use of accurate, timely, consistent, and complete master data. Figure 6: Workflow of MDM Source: D. Loshin, MDM -Paradigms and Architectures, 2008, p. 9 In other words, MDM is developed to improve, maintain and govern company s master data following the business rules of that enterprise. 16

21 Even though there are three types of enterprise data, MDM s main concern is master data. This fact should not underestimate the significance of the other two types of data, analytical and transactional, but the choice is made because every company s business processes are designed and develop around master data. Master data holds information for the key objects of every enterprise. There has always been a need for MDM, but in the recent years the interest constantly grows, especially in large and complex companies. Many reasons can be found for this urge for new management standards for data quality. Examples would be: (1) lines of business, (2) mergers and acquisitions and (3) new packaged software. (Dreibelbis et al, 2008, p. 6-11) (1) Lines of business - Common thing about these reasons is that they bring in additional data in the system, which in often cases is a different version of already existing data. Lines of business, for example, create different modules in the same enterprise and each module functions independently. They work with the same master business domains, but each line of business keeps its own master data for the common enterprise objects. In one sales company, customers can make purchase through different channels like store, online, catalogue etc. If each sales channel represents different line of business then it can happen that there are several versions of same customer created for each sales channel. (2) Mergers and acquisitions it is more very common nowadays one company to purchase another, or they merge their business and become large enterprise. In such cases, master domains from both companies are included in the new business. The same problem as in the first example can show here. Even though I m taking an example of large businesses that may work with different sets of customers, it can happen that there is still a group of people that are stored in both systems. With merge of two data storages, duplicate data is automatically created. (3) Packaged software - As a result of the SOA architecture and all the independent platforms on the market, companies often time decide to use different applications for different business processes. They can use Enterprise Resource Planning (ERP) software for managing their sales, purchases and stocks, or Customer Relationship Management (CRM) software to manage their customers. In both cases, there needs to be some connection between these different applications, so they can communicate and share same data for the key objects of the company. MDM is the link in this case. Among all the existing ERP, CRM, SCM solutions, often comes up the question why companies need another management tool, when there are already so many on the market? Why can t the existing management solutions, which have been on the market long before MDM appeared, solve the problems that were just explained above? Answer to this question is described in the following four factors (Loshin, 2008, p. 13): (1) Despite the recognition of their expected business value, to some extent many of the aspects of these earlier projects were technology driven and the technical challenges 17

22 often eclipsed the original business need, creating an environment that was information technology centric. IT-driven projects had characteristics that suggest impending doom: large budgets, little oversight, long schedules, and few early business deliverables; (2) MDM s focus is not necessarily to create yet another silo consisting of copies of enterprise data (which would then itself be subject to inconsistency) but rather to integrate methods for managed access to a consistent, unified view of enterprise data objects; (3) These systems are seen as independent applications that address a particular standalone solution, with limited ability to embed the technologies within a set of business processes guided by policies for data governance, data quality, and information sharing; (4) An analytical application s results are only as good as the organization s ability both to take action on discovered knowledge and to measure performance improvements attributable to those decisions. Most of these early projects did not properly prepare the organization along these lines. From all that was stated above, MDM is no new technology or approach for improving data quality but some standardization of a workflow for data management, something that wasn t formally defined before. There were data stewards, data management methods used in different systems, times and places, but they didn t belong to any category, even though were doing the same job which was data integration, cleansing, governance and sharing. MDM is now this category which expands with every new master data management method that is defined. Considering the serious role it has in governing master data, MDM has yet to develop and prove as efficient tool for data quality improvement Goals of MDM Most of the literature researches refer to creation of single source of trust for master data to be the main goal of MDM. Yang (2005, p. 3), for example, stated that the main goal of MDM is to allow unrelated applications to share a common pool of synchronized data. Per Berson and Dubov (2007, pg. 3), the focus of MDM is to create an integrated, accurate, timely and complete set of data needed to manage and grow business. Other than this goal for golden record, MDM focuses on lowering cost and complexity through standards, and supporting business intelligence and information integration (Otto, 2011, p. 2). Some of the most important goals of MDM include (Mauri and Sarka, 2011, p. 17): - Unifying or at least harmonizing master data between different transactional, or operational systems; - Maintaining multiple versions of master data for different needs in different operational systems; 18

23 - Integrating master data for analytical and CRM systems; - Maintaining history for analytical systems; - Capturing information about hierarchies in master data, especially useful for analytical applications; - Supporting compliance with government prescriptions (e.g., Sarbanes-Oxley) through auditing and versioning ; - Having a clear CRUD process through prescribed workflow; - Maximizing Return Of Investment (ROI) through re-usage of master data. MDM goals in the list above can be summarized into two main goals one that strives to cleanse the data and another goal that tries to maintain the data clean. This being said, goal of MDM is a two-step process that helps increasing data quality. The first step in achieving this goal is to review, organize and cleanse existing data. The second step is its maintenance and governance. As stated in the definition by Mauri and Sarka (2011, p. 17), MDM has a list of numerous goals that need to be accomplished, therefore defining MDM just as creating single version of data for enterprise key objects, is a partial explanation which doesn t cover the whole issue of data quality. This may be the final point that needs to be accomplished when MDM is implemented for the first time, but data management doesn t stop here. Knowing that data changes on daily basis, one time data reorganization and cleansing doesn t solve the problem for bad data because with the new data load, previously mentioned problem may reappear again. What one company needs is some long-term solution for its bad data problem and this is achieved in the second step of the MDM goal realization which is constant governance of data quality standards. MDM is a long-term solution for keeping enterprise data quality on satisfactory level. Each MDM project should strive to achieve the top data quality level and show proactiveness towards managing data. Creating single source of data and governing with the same, provides flexibility for one organization to grow and increase its information pool without confronting issues of redundant data. Figure 7: The data quality activity levels Source: H. Haapasalo et al, Managing one master data challenges and preconditions, 2011, p

24 In order to accomplish the desired goal, MDM should have a business focus instead of technology focus (Loshin, 2009). Main concern of MDM is master data; type of data that defines the business in each enterprise, therefore technology in this case is just tool for realization of the management, whereas business standards are the core issues MDM should deal with. In addition to this, Smith and McKeen (2008) have defined four prerequisites for successful MDM: (1) developing an enterprise information policy, (2) defining business ownership, (3) data governance and (4) the role of IT systems. In this list, only the last prerequisite includes IT as a requirement, the other three points are all business focused MDM Activites MDM provides the following activities to accomplish the goals discussed: (1) profile, (2) consolidate, (3) govern, (4) share and (5) leverage. These five categories contain different methods, techniques and tools that support activities appropriate for each of the groups (An Oracle White Paper, 2011, p. 14). (1) Profile This is the first phase of data management which examines the current data quality state of all sources. It s nothing more than data assessment to check if the current data follows some predefined rules in the master repository. Examples would be: the completeness of the data, the distribution of occurrence of values, the acceptable rang of values etc. Profiling can be also done during data import as well as data integration tasks; (2) Consolidate In this phase data from different sources is integrated. Depending on the MDM architecture, data can be integrated in the master repository, or key references to external applications are updated or created; (3) Govern Major changes can happen in this phase because the actual data updates occur in this stage. Deduplication, cleansing, update and deletion are done based on the assessment results provided from data profiling; (4) Share Once data is cleansed, it is passed on to external sources. Master data synchronization between master repository and external applications is supported by SOA, architecture that allows sharing data between different system platforms; (5) Leverage This last phase is used for analytical purposes. Enriched master data is great source for BI tools and gives complete view of the master business domains. Figure 8: MDM Activities 20

25 Source: An Oracle White Paper, 2011, p. 14 Managing master data follows this order. It s logical that this workflow starts with data analyzes and ends with data reporting. However, all phases don t always occur at the same time. It would be very expensive and time consuming if one organization runs column analyzes or matching on daily basis. Sharing of data, on the other hand, may be more frequent, especially if external applications send direct request for data retrieval Benefits from MDM Successful MDM solution can be of positive value to an enterprise providing benefits of intangible as well as tangible character. Intangible benefits are seen in the following areas: (1) data quality (2) business processes and (3) users and customers (Dreibelbis, 2008, p. 37). (1) MDM offers improved data quality, seen through some of the dimensions discussed at the beginning. Better accuracy, consistency, completeness are some of the few dimensions that are improved with this strategy. Also, same version of data is shared across the system and used by various applications; (2) Business process and workflows are better organized due to correct data. They are not improved just because of the data they worked with, but are also reorganized to produce and maintain correct data that will result in reliable information. This reorganization of business processes is also of predictable nature, to detect most valuable data and trigger new business innovations and more profitable decisions for future progress of the enterprise; (3) Users trust in data is returned back because now they can rely on the same version of data across the information system. Customers are also more satisfied because most of the delays and irregularities that were present as a result of wrong data, are greatly reduced due to MDM. Tangible benefits are seen in the actual profit that organizations gain after implementing MDM. An example of this quantitative data is shown in Table 1. Benefits with the highest amounts in the table are sales, customer loyalty (which again leads to increased sales) and 21

26 efficiency (of sales representatives and IT systems). Based on these facts, MDM improves the organizational work from business and technical perspective. Table 1: An example estimating the positive impact of customer MDM Source: Building the Business Case for Master Data Management (MDM), 2011, p. 9 Often times, MDM is identified with MDM software. This confusion appears because MDM applications present data management processes in the most accurate manner. Following is a review of four MDM lines of products. Discussion about them would cover architecture, processes and usage of MDM in enterprises. 3. MASTER DATA MANAGEMENT SOLUTIONS 3.1. Historical review of MDM solutions As seen from the definitions from MDM in the previous chapter, every process or person who is involved in data quality improvement is part of MDM. Data mining, cleansing, redefining business rules, changing application logic; it s all part of managing data. Therefore, I can say data management appears with the first introduction of databases. However, standardization of methods and rules is becoming more popular in the recent years. First attempts of managing data are found in CRM and ERP applications. However, the main problem with these applications was that they were managing their own data and they couldn t provide solution for single common source of master data between different solutions. Master Data Management appeared in the late 1990 s with the release of Customer Data Integration (CDI) and Product Information Management (PMI) on the market. Development of MDM applications historically goes into two directions: (1) functionality centric and (2) domain centric. (1) First approaches of MDM were made through data warehouses. However, this type of managing data didn t prove as efficient. The idea was to centralize data in one place. But, managing doesn t mean just keeping everything in same storage, it also requests for some functionality implementation, which was missing in the data warehouse approach. 22

27 The second idea for managing data was through enterprise application integration (EAI). Development of this new technology made it possible for different application to work together and exchange data. Missing part in this case was the central storage that would keep the single source of truth. MDM evolved from these two ideas, as common ground that creates and maintain the single source of truth, and also shares it with various applications in the system; (2) Because master data represents the key objects of one business, customer and product are the main domains found in enterprise data. Understandable, MDM started with customer master data implemented in the well known customer data integration (CDI) applications. Customer data models were initially of account-centric design, which means that they were designed based on different customers roles in the system (buyer, sales person, administrator etc.). Because the number of such business models was growing with the growth of customer s data, it was more difficult for organizations to maintain several databases just for one type of entity, and consolidate data from all of them. Therefore, an account-centric model was replaced with entitycentric model, which represents one schema-design for buyer, sales person, administrators or client-organizations. They were all covered included in Customer domain. After solution for customer domain, vendors came up with product management information (PMI), applications that support product master data. Nowadays, the latest trends implement several domain types into one master data management application, called multi-domain MDMs. The evolution of IBM MDM applications is great example for these two development directions (IBM Multiform Master Data Management, 2007). In the development cycle of IBM MDM applications, there are two significant points : the first one is the transition from data-centric approach tool to functional-centric application and the second point is the transition from singe focused usage of style or domain, to multiform application. These two points are important in MDM evolution, because they represent the culmination of problems found in current data management tools at that time which caused the transition from one approach to another. The first approach that MDM used was through indexed and reference tools. In this case, there wasn t any significant storage for keeping the master data, but only the indexes (IDs, references) were kept in single repository. This approach was showing the various versions of data for master domains, but it had lack of functionalities to deal with them and solve the same ones. This is the point when first evolution chasm appeared, and caused MDM solutions to develop as applications from that time on. The second chasm appeared during MDM being developed as functional approach that has its own physical storage of data as well as functions that will manage data. Initial MDM applications were focused either on unique usage style or one domain. Such approach created difficulties in exchange of master data between different domains. Knowing that enterprises have different lines of business, and multiple domains, it was hard to merge and maintain 23

28 data from uniform MDMs application. This problem introduced the next step in development of MDM applications, launching multiform MDM applications. Multiform MDM applications are functional-centric solutions that support various domains as well as various usage styles. There are several vendors that still produce single domain applications, but there mission and vision is directed towards multiform application. An example of MDM evolution can be seen in the graph below. Figure 9: Evolution of IBM MDM applications Source: IBM Multiform Master Data Management: The evolution of MDM applications, 2007, pg 9 MDM solutions grow and develop along with technology. The newest trends of cloud computing is also present among this discipline. Focus of MDM nowadays is towards developing multi-domain cloud solutions Functionalities, concepts and architecture The benefits of Master Data Management are best seen when MDM solution is implemented in an enterprise and used to manage its data. Systems, applications, hubs are terms that refer to MDM system in general, and that is why they will be interchangeably used further in the text. MDM system represents solution for creating single version of master data, maintains master data through various processes and makes it available for other legacy applications in the information systems. 24

29 Per Gryz and Pawluk (2011, p. 2-3), MDM solutions should offer the following functionalities: - Consolidate data locked within the native systems and applications; - Manage common data and common data processes independently with functionality for use in business processes; - Trigger business processes that originate from data change; - Provide a single understanding of the domain-customer, product, account, location for the enterprise. Functionalities of MDM are mainly developed to support data unification and are manifested though import and export of data, business rules, validation and any other method that assists in data consolidation and transfer. Even though different vendors try to provide different functionalities so they can be leaders in the MDM area, there are still some basic concepts on which MDM solutions are built. Best way to describe MDM system s functionality for data management, concepts of work and their architecture, is through the three dimensional model. This model is a shortened version of the 30 viewpoints framework proposed by Zachman. Main dimension that describe MDM systems are: (1) domain, (2) methods of use and (3) implementation styles. There are three main guidelines that define the scope of the three dimensional model (Dreibelbis et al, 2008, p. 12): (1) Business scope determines the number of domains; (2) Primary business drivers determine the methods of use; (3) Data volatility (instability) determines the implementation styles. Figure 10: Dimensions of master data management 25

30 Source: A. Dreibelbis et al, Enterprise Master Data Management, 2008, p. 12 (1) First dimension, domain, is based on the business nature and the type of master data domain works with. Each enterprise has different lines of business which work with various key objects. Most common domains are: customer, product and account. The domain Location is often times added to this list. However, this classification can be further expanded with new domains, depending on enterprise requirements. Names of these domains are pretty much self explanatory and describe the business objects they cover. Customer covers: people, organizations, and all the roles in the system they can have. For example, supplier, buyers, employee, employer etc. Products depend on the lines of business and they cover various items company may work with. Account domains cover relationships between customers and products. Depending on the type of business there are different types of accounts as well, checking, savings, student accounts etc. Based on the number of domains MDM can support, there are singledomain MDM solutions as well as multi-domain solutions which work with several different domains; (2) Second dimension, methods of use, is defined according to the different purposes of use each business have. Based on this dimension, MDM systems can belong to three groups: collaborative, operational and analytical. Collaborative MDM is used to support complex workflows and the data that comes from different sources. Best example of such usage would be when introducing new item (product) in the system. In this case there is a list of people involved for defining product properties, approving them and launching this product on the market. Data validations, integration of different properties as well as triggering approvals for this item are all supported by collaborative MDM. Main functionalities that this style of MDM solution should have are: task management, data validation, data integration of properties from different legacy applications. Operational MDM acts as an Online-Transaction Processing (OLTP) system that responds to requests from multiple applications and users. However, this type of MDM is used to support processes that are predefined by the MDM users, and doesn t have this decisive role as Collaborative MDM. Operational MDM method of use is best seen in SOA services as well as main database operations, where MDM supports transactions to retrieve data, update, create and delete. Analytical MDM has completely different method of use and it is about the intersection between Business Intelligence (BI) and MDM. It s a one way communication where data from different systems is sent to the MDM hub for data consolidation and preparation for analytical systems. Knowing that MDM repository stores all master data, cleansed, organized and managed, this is an excellent source for OLAP, star schema for data warehouses, data mining, predictive analysis based on scoring etc. 26

31 However, MDM systems cannot be strictly divided in these three categories. Depending on different business processes in each enterprise and the frequency of their change, often times MDM systems can cross over from one type to another. (3) The last dimension, implementation styles, is based on the different ways data attributes can be stored in the system. This dimension covers various architectural styles of MDM.There are four general implementation styles defined throughout the literature: external reference, registry, reconciliation engine and transaction hub. External reference architecture is the simplest solution for MDM. It acts as system of reference instead of system of record, because it doesn t contain actual data, but reference to data which remains in the legacy systems. External reference architecture may be simple and easy for implementation, but it lacks control over its data. All this architecture can provide is just reference for data in its legacy system, but any functionality is disabled because MDM doesn t have access to it. Registry style is on higher architectural level where MDM solution is represented as limited size data storage that contains only unique identity attributes. What this means is that instead of containing all data from several applications in one storage, MDM system stores only unique attributes for an object such as: ID, name and description, and references the other data attributes that remain in the legacy systems. This implementation style is step forward in MDM development because it stores some basic info and also integrates data from different system with the help of references. Disadvantage of this architecture is that MDM still doesn t have all data available. Keeping references still doesn t solve the problem of bad data. Also, often times cannot retrieve all information due to legacy systems failure. Reconciliation engine In this architectural style, there is an opportunity of exchanging data in both directions: MDM database to legacy applications and vice versa. MDM system can store complete set of data attributes for some domain, but it s not the only place that manages data. Legacy applications still manage their data and synchronize it with the one that is stored in the MDM system. The ongoing matching and synchronization in MDM repository keeps the master data up to date. The only challenge that appears in this architecture is that data is still changed in other systems, and can often cause unreliable data in MDM system. With the growth of data attributes in other external sources, it is more difficult and complex to keep up with the synchronization updates. Transaction hub is the most sophisticated architectural style of MDM systems. This implementation style is the actual system of record for other applications. Central data storage is placed in the MDM system, where master data is cleansed, organized and managed. All the other external systems are using the data from the MDM repository. This architectural concept is the core of master data management, and achieves all goals for single version of data that should be accomplished. However, the complexity 27

32 of this structure carries some difficulties when implementing among external legacy systems. There are two major changes that need to be done during implementation of this architecture:(1) data needs to be integrated and centralized into one data storage and (2) other systems need to be changed to work with the new transactional hub. The idea of transaction hub fulfills all requirements for data quality improvements but the realization can be impossible in some large enterprises with complex systems. The forth implementation style, transaction hub, is shown in fig 11 and 12. As seen in both of these figures, data from external processes is imported with Extract, Transform and Load (ETL) process and accessed through different user interfaces (UI). Figure 11: Traditional MDM architecture Source: A. Berson and L. Dubov, Master Data Management and Data Governance, 2011, p. 108 Figure 12 shows more advanced model practiced in the latest solutions, where MDM architects try to solve the problem of data sharing among MDM repository and external applications including various SOA services. The goal is to make MDM solution as a metadata-driven SOA platform that provides and consumes services that allow the enterprise to resolve master entities and relationships and move from traditional account-centric legacy systems to a new entity-centric model rapidly and incrementally (Berson and Dubov, 2011, p.85). Figure 12: MDM architecture with additional published services 28

33 Source: A. Berson and L. Dubov, Master Data Management and Data Governance, 2011, p. 108 As seen from the discussion above, there are different styles of MDM systems based on the three dimensional model. Which approach is chosen depends on the business requirements of one enterprise. In the recent years, trying to serve all types of business, vendors are moving towards multiform MDM systems, solutions that implement various domains, methods of use and implementations styles in order to develop solution that would be suitable for every type of business Architecture of MDM described through selected MDM solutions Despite the great variety of MDM solutions on the market, I chose Microsoft, IBM, Oracle and SAP because they are already known and well established vendors for database software as well as Business Intelligence (BI) solutions. According to Gartner (2012, p. 2), they are placed in the leader s quadrant for 2012 for Data Warehouse Database Management Systems and also BI platforms. Since MDM systems main concern is data management and they are also involved in BI processes, I was curious to find out what these leaders have to offer for the MDM market. Table 2: Gartner s Magic Quadrant for Data Magic Quadrant for Data Warehouse Database Management Systems Magic Quadrant for BI platforms 29

34 Source: Gartner, 2012, p. 2 I will present short overview of the MDM solutions for the four vendors mentioned above. Following concepts will be covered for each one of them. History of development; Data modeling; Data import and export; Data validation; Data security; Advantages and disadvantages Microsoft Master Data Services Master Data Services (MDS) is a product which Microsoft acquired from Stratature in Already a customer of Stratature, Microsoft had been impressed with the rapid time to value and the ease of customization that Stratature s +EDM product provided. Microsoft initially planned to ship its MDM solution as part of SharePoint, because information workers are the primary consumers of master data. However, because IT plays a significant role in managing MDM solutions, MDS moved to the Server and Tools division and became a feature of SQL Server 2008 R2 (Graham and Selhorn, 2011, p. 6). MDS can be installed as additional feature of SQLServer 2008 R2 or any newer version that follows. Data modeling MDS system comes with blank database, which means that there is no data in the MDM repository, no tables and no pre-defined data models. There is metadata model that comes with every installation and also sample models for Product, Customer and Account, but they serve more as an example rather than template which can be used as starting point for developing data model. MDM model is based on the model of RDBS, just different 30

35 terminology is used. There are four data objects that are made available for users: entity, attribute, member and hierarchy and correspond to certain data object in relational data model (Graham and Selhorn, 2011, p. 56). Below is a table that relates MDS and relational database objects. Table 3: MDS repository objects vs. Relational database objects MDS repository Entity Attribute Member Hierarchy Relational database Table Column Row Relationship MDS supports several models in a repository, however it allows relationships only between entities from same models. Hierarchical relationships are supported, and this parent-child structure allows grouping of data in collections, hierarchical groups for better organization and maintenance of data. Fg.13 shows an example of MDS Model explorer. As seen in this picture there are several models: Chart of Accounts, Customer, Product each having their objects organized in tree structure. Hierarchies are supported only between entities of the Product model, but relationships between objects from Product and any other model are not allowed. Figure 13: MDS data model Source: Bullerwell, Kashel & Kent, Microsoft SQL Server 2008 R2 Master Data Services, 2011, p. 75 Data import is done through ETL package created in SSIS. There isn t any feature that allows direct connection between SSIS and ETL therefore special skills are required to setup the whole loading processes. However, MDS doesn't just rely on Microsoft SSIS, but also can use ETLs from other vendors such as Informatica or Infosphere DataStage. In order to protect data, each repository comes with staging tables that are copy of the tables for the existing 31

36 data model objects. Staging tables are used during data load, to store newly imported data before it enters in the production data model. Data export is done with publishing subscription views on defined server. Subscription views are nothing else but views from the tables in MDS. Once exported, data from these views can be queried in SQL Management Studio with simple select statement. These subscription views can be set up to run every night in case we need frequent update of the data. In order to be used in other systems, they can be exported as flat files and afterwards imported in different databases. Web services are also made available in this system, so data import and export can be performed programmatically as calls of such services. Data validation is done in different areas of MDS. Data managing, validation and cleansing can be done on data import and also using matching techniques or validations rules. The first attempt for data management is done on data import. Since data is first imported in staging tables, here is a point of precaution for bad data. Also, using the same structure of predefined tables for every model sets general standards for each data domain organization. Data is additionally checked when loaded from staging table to the actual data model. Batches that run in the MDS, check for data compatibility to the MDM model structure and error on any inconsistencies, like Nulls, improper format, length etc. Main tool for detection duplicates and data cleansing is the matching operator in MDS. This operator has two values: Match and Does Not Match that works on user defined Similarity Level. Similarity level is a decimal number which defines how precise user should consider that two values match. If the value is closer to 1, then the match is equal to the entered value. As prevention for entering erroneous data, MDS uses validation rules, which are also logical operators, and return an error when wrong data is entered. Examples for such attributes would be detection of NULLs, defining range of values for some attributes etc. MDS is also involved in speeding up some workflow processes by sending s or notifications to users who are in charge of some action or approval. These notifications are usually triggered by data changes. This feature is an attempt toward collaborative method of use, which is a step forward towards improving MDS as fully supportive solution for all types of methods of use that are defined. Data security is supported through different roles assigned to users of MDS. There is admin role that has full permissions in the system, and specific groups of roles that are with limited access to data in MDS such as for browse, edit etc. Another way of securing data is through creation of versions, which are data snapshots of the current information that exists in MDS repository. Versions are used (1) to track changes that were made in the past, (2) to rollback changes and (3) to track the version of model that each external system is using (Graham and Selhorn, 2011, p. 234). Even though versioning doesn t link to roles and permissions, the ability to save data in certain period can save company s time and work in case of major database failures. 32

37 New addition in SQL Server 2012 is the DQS Data Quality services. These services work on knowledge base that the organizational data steward maintains. Based on this knowledge database DQS cleanse, match and validates data according some business rules. These services are integrated with MDS through the MDS Excel Add-In for Matching. Advantages and disadvantages of MDS are presented in table 4. Based on the discussed functionalities and architectures there are many ways in which MDS can improve data quality of organizational data. However due to various limits in its design, user still needs to use workarounds and extend the data model to support certain business requirements related to the enterprise data. Table 4: Advantages and disadvantages of MDS Advantages Domain neutral, doesn t limit user to specific types of master objects Familiar database structure similar to RDBMS Simple interface that doesn t require IT skills or programming knowledge Versioning of data is enabled Disadvantages No prebuilt data model, which requires time and work to define data model for each domainuser knowledge No relationships are allowed between different domains No support for multi valued attributes, this causes additional tables and relationships for their implementation Data import and export are done with different tools and require special skills and knowledge to setup the whole loading environment Despite the fact that Microsoft is long time in the database management software market and also one of the leaders according to the Gartner s Magic quadrant for 2012; still it didn t keep its place in the Magic quadrant for MDM applications. MDS is simple solution for SME enterprises but due to the limited number of functionalities discussed earlier, it cannot support complex enterprise businesses. The way MDS is designed now, all it can offer is simple user interface for data model structure and maintenance for limited database scenarios, but it requires a lot of remodeling and additional functionalities to fully develop in the following areas: data import and export tool, integration among other external system, support of complex decisive workflows, thorough data analysis functionalities and support of relationships among different domains SAP Netweaver SAP Netweaver MDM is part of the Netweaver computing platform consisted of several core products such as Application Server, Business Intelligence, Enterprise Portal etc. MDM was introduced to this family of products in 2004 when SAP purchased a small vendor in the PIM space called A2i (Ferguson, 2004). Because this code was specifically intended for product domain, the first release, SAP MDM 5.5, was considered to be PIM solution instead of general MDM system. In 2008 SAP released enhanced version of MDM, called SAP NetWeaver MDM 7.1 and a year later they launched full MDM suite containing various 33

38 applications as well as improved MDM technology to build pre-packaged business scenarios and integration. Current version of SAP MDM NetWeaver Suite contains the following components (Rao, 2011, p. 21): MDM Import Manager; MDM Import Server ; MDM Data Manager; MDM Syndication Server; MDM Syndicator. Data modeling SAP MDM solution stores its data into a central MDM repository. It s a complex structure of several types of tables so they can store different type of data, from simple integers to pdf files and pictures. Figure 14: Table types Source: L. Heilig et al, SAP NetWeaver Master Data Management, 2007, p. 192 SAP data model reminds of star schema. Master data attributes are stored in main tables, which are flat type in most of the cases. They contain main data attributes and references to subtables where additional data attributes for the master objects are stored. MDM repository supports various types of data fields. Novelty here is that multi value attributes are supported, which is a plus because it saves the repository from additional tables that should be created to store the values for such attributes. Also, relationships and hierarchies are implemented in the same way as in relational model (Heilig et al, 2007, p. 193). This MDM system can support any type of domain. Despite the neutral environment that SAP offers, there is also possibility of several predefined repositories for Customer, Employee, Supplier, Material, Business Partner, and Product domain which can be used as starting point for data model development and extension depending on the business requirements each enterprise has. Data import SAP MDM suite has automated the process of data load in the repository. Also, importing is done field by field instead record by record, which significantly speeds up the 34

39 process. There are two ways when data is imported in the system. Both of them are done in the Import Manager, but SAP MDM allows either load on the actual data in its tables, or assigning key mapping pairs, used mostly for the external systems. In the first case of data load, SAP supports different sources of data such as database servers, xml, text, excel files etc. During import, preparation process of source data requires more time and work instead of the import itself. Data needs to be validated, matched and mapped to the existing fields in the MDM repository. However, this is done only once, and the whole process can be saved as import map and reused on the next import. Another way of integrating data from different systems is through key mapping. Instead of loading the actual data from the legacy repositories, key-value pairs are created in SAP MDM databases. They contain unique MDM ID that is same for same records in different external systems. In this case, the original record ID is kept in the legacy repository and the MDM unique ID is the link between the external source and the master data stored in the central database (SAP NetWeaver Master Data Management (MDM). MDM Import Manager, 2011, p. 407). Data export is done in the similar manner as data load. SAP MDM calls this process data syndication. Exporting can be also done automatically, but in this case users need to create export maps that will define the flow of data between MDM repository and destination items. The final product from this export is xml schema or flat files that are further on imported in other systems. What users need to be careful of is the changes in master data that may happen during export. Incase master data is updated while export is being executed, exported file may contain mix of old and new data. Figure 15: Key mapping during import and export Source: L. Heilig et al, SAP NetWeaver Master Data Management, 2007, p. 201 Another way of using MDM data is with the key mapping technique, discussed earlier during import. External systems can access master data in SAP using the unique MDM IDs assigned to each record from the legacy systems (Fg. 15) (SAP NetWeaver Master Data Management (MDM). MDM Syndicator, 2012, p ). Data validation This MDM suite validates data on several occasions. During data import, data export and data management. SAP is built in such way, that any work with data is related to data validation and management. Matching is the core functionality used to check 35

40 for duplicates, cleanse them from the repository and prevent their import in the system. In order to detect duplicates and validate data, SAP has put a lot of thought in this process and developed it as complex set of rules and strategies. All processes that fall under matching are used for data validation and cleansing wrong values from the database. Transformations, matching functions, matching rules, strategies, and substitutions are some of the features that are part of MDM matching. Same records are detected during matching based on user defined scores for similarity. Other matching rules and functions use logical operators to determine equality between values. The whole process is record centric, which means that for each record there is a group of zero or more potential matches. Once matches are found, they are further merged into one record. Additional advantage for better data management is the architectural structure of the MDM database which supports various types of tables, fields and relationships. Data security By default, MDM servers are not password protected and everyone can access them. Therefore, there has to be some admin user that will create passwords and restricts user access to the system. There are two levels of password protection: server level, which includes password protection to various applications in the system, and repository level, which covers repository passwords and access. User roles and permissions are stored in separate tables in the MDM repository. For example, record for user of the system contains username, password and reference to the privileges table where additional permission values are stored. Another way of keeping the central data safe is thorough copies of the repository supported with master/slave concept. Master is the place where changes occur and slave is an auxiliary repository that gets updated by synchronizing with the master repository. There is another type of slave repository, called publication slave that acts as backup version of the master repository. Once data is loaded in publication slave repository it stays unchanged unless this repository is loaded in the system again and put online for work. Another way to keep versions of master data is by making duplication of the existing MDM repository. This copy of the data can be saved on other disks and loaded anytime user needs it (Heilig et al, 2007, p. 211). Advantages and disadvantages of SAP are listed in table 5. Most of the advantages are related to the various types f data objects supported in the database as well as the automation of import and export which greatly facilitates user s work. Disadvantages refer to the complex interface and the great load of work that needs to be done when preparing automated imports and exports. Table 5: Advantages and disadvantages of SAP Domain neutral Advantages Disadvantages Complex interface Offers prebuilt repositories for certain domains Supports different types of tables, data types Requires time for user to get acquainted with it Time consuming preparation process for 36

41 and relationships, multivalued attributes Automated import and export Effective matching rules that cleanse and prevent duplicates in the repository Various IT and business scenarios for MDM implementation and usage Security architecture that enables different roles for work with the data import and export Inconsistent updates and export of data can lead to old and new data export Key mapping approach can bring in data inconsistency Not suitable for small enterprises SAP MDM system offers a lot of functions for mastering data. Complex matching processes, various table objects, domain neutral data model create solution that gives great freedom for users to manage any kind and any type of data. Key mapping functionality allows data communication with external systems without changes in the code of these legacy applications. Automated imports and exports make data loads and distribution to occur much faster and more precisely. However, all this freedom of choices bring additional burden to users during preparation of the data. There is a long checkpoint list that needs to be done before processes are ready for execution. Helpful circumstance is the ability to save this preparation work for similar scenarios in the future. General overview for this suite is that SAP succeeded in great part of its intention to automate master data management processes, but system functionality needs to be improved so users can have less work in the preparation process. Due to the massiveness and complexity, this solution is not appropriate for small enterprises but for large and complex businesses IBM InfoSphere IBM offers great variety of products for data integration and management. InfoSphere is the line of such applications that supports these processes. Therefore, I cannot limit to just one application when reviewing implementation of MDM through IBM solutions, but I have to mention several of them to explain different MDM processes. First IBM MDM developments started off in 2004 with acquisitions of products from different vendors. For example, launching of IBM InfoSphere Information Server was first made in 2004, when IBM purchased the data integration company Ascential Software and rebranded their suite to IBM Information Server. The same year IBM also acquired Trigo, a product MDM software vendor and renamed their software in WebSphere Product Center. The next year, IBM acquired Customer Data Integration software from DWL a, and rebranded the product as WebSphere Customer Center. In 2008 IBM released full version of InfoSphere Information Server. IBM Master Data Management Server has similar development history. It was released in 2008 and it s a combination of IBM s customer integration tools from WebSphere Customer Center (WCC) with workflow capabilities from WebSphere Product Center (WPC) (Press release notes from IBM, Retrieved February 7, 2013, from 37

42 Other known products that fall under IBM InfoSphere brand and are used for managing data are (Zhu et al., 2011, pg. 47): IBM InfoSphere Blueprint Director; IBM InfoSphere Business Glossary; IBM InfoSphere Discovery; IBM InfoSphere Metadata Workbench; IBM InfoSphere Asset Manager; IBM InfoSphere Information Analyzer; IBM InfoSphere QualityStage; IBM InfoSphere Audit Stage; IBM InfoSphere FastTrack; IBM InfoSphere DataStage; InfoSphere Data Architect; IBM offers rich applications suite that covers all processes in master data management, from documenting business rules, workflows and terminology to cleansing, merging duplicate records and their distribution to external systems or files. From the list above, each component performs different functionalities, and same functionality can be supported in different applications. Data modeling There is no particular database vendor or database schema that IBM follows during data modeling. Trying to provide product that is platform independent, IBM made MDM solution that is domain, software platform and database neutral. IBM MDM repository can be prebuilt in case there is data model for specific domain, or blank, where user can start building its database from scratch. Planning and building master repository is a three-step process supported by three different types of models: (1) Logical, (2) Domain and (3) Physical (Wilson et al., 2011, p ): (1) Logical model - is the first step of data model development and it s where the planning process occurs. It s a diagram of entities, attributes and relationships which represent the database structure and the workflows of business processes that work with master data; (2) Domain model - is used to define data tables that will store the future master data. It follows the logical model, defined above, to draw data objects of the master repository. Lowest level of data domain that can be modeled here is data field, along with its data type, length and restrictions (if there are any). Same as the logical model, domain model is vendor non-related and it s used to set some general standards for master database architecture. (3) Physical model this is the final step of data modeling when the actual database is created. It is vendor related model so users have to choose the appropriate database management system. Data objects and rules are created based on the concepts defined in the first two models. 38

43 Data models are built in IBM InfoSphere Data Architect solution. Once the modeling process is done, IBM InfoSphere Data Architect is capable of creating database-specific data definition language (DDL) scripts based on the physical model. DDL scripts contain queries to create, update or drop data objects and can be run on a specific database server. Figure 16: Logical model Figure 17: Domain model and physical model Source: E.Wilson et al, InfoSphere Data Architect, 2011, p. 60, 82 Figure 18: Physical model Source: E.Wilson et al, InfoSphere Data Architect, 2011, p.90 Once created, existing database models and objects can be updated with another application called IBM InfoSphere Asset Manager. The Asset Manager is used to import physical models, create data objects or update the existing ones. And as every advanced MDM system, IBM also uses staging area where all data changes are stored first, and once validations are passed the changes are implemented in the actual central data storage. Data modeling is not the only option for designing master database. IBM also supports reverse engineering that allows users to convert already existing data objects to physical model. This feature allows reuse of existing database structures when building central repository, instead of just starting from scratch. Another advantage is the possibility to easily compare different databases before merging data from different sources (Wilson et al., 2011, p. 129). 39

44 Data import IBM InfoSphere provides different ways for importing data. Depending on database structure and business scenarios, data can be loaded through batch transactions or one of the applications mentioned earlier. Batch transaction processing is used in cases of empty database, when large amounts of data need to be loaded in the repository. Each record to be imported is read, parsed and distributed in the appropriate business objects. MDM assigns unique identification key that serves as internal key to every record imported in the master repository. Data files for this type of import must be in SIF (Standard Interface Format) which is pipe delimited file format. InfoSphere FastTrack is application used for data import. It s mostly used during updates, merges and data smaller data loads. The most important thing in this import is mapping of the source file data to the appropriate database columns in the master repository. The whole process is similar to the already known ETL process. Figure 19: Example of field mappings during data import Source: IBM InfoSphere FastTrack, 2011, p. 10 Data export Data from the master repository can be shared through direct transfer from master repository tables to external applications tables or through web services. The first option for data export is available in InfoSphere FastTrack and the export process is similar to the import, just the data flow is in the opposite direction. The mapping that is done is from master data objects to external system tables. Since IBM MDM architecture is based on SOA, another way of master data sharing is through web services. External applications can retrieve master data with web service requests for certain entity. There is no specific rule which approach is used; it all depends on business scenarios and the choice of users. Data validation IBM Infosphere validates data in the same manner as the other MDM solutions. Data is validated before import, during import and afterwards. Techniques for data cleansing and management are organized in four steps: (1) understand organizational goals and how they determine user s requirements, (2) understand and analyze the nature and 40

45 content of the source data, (3) design and develop the jobs that cleanse the data and (4) evaluate the results (IBM InfoSphere QualityStage, 2011, p. 2 5) (1) In order to properly manage master data, users need to get acquainted with business requirements. The role IBM InfoSphere has in this first step is assisting users in graphical representation of their business rules. As discussed earlier, this is done while building logical model and defining business entities; (2) IBM Infosphere applications offer different kinds of data analysis. One application that is mostly used for analyzing data content is InfoSphere Analyzer. This application provides different kinds of analysis among which, column, cross-domain and key analysis are the best known. Column analysis is performed on data in certain column and gives general overview of the column properties as well as detects anomalies in column data records. Cross-domain analysis matches data between different tables in order to find duplicate and redundant data. Key column analysis is used to detect relationships between tables and columns and define primary and foreign keys based on uniqueness of data. Another way to explore data content is through matching. IBM InfoSphere tools provide matching by value and pattern. Value matching is similar to free-form lookup where data is matched to given value. Pattern matching looks for data that matches to given data format like SSN or . IBM uses regular expression to perform pattern matching. Below is an example of results from SSN pattern matching in which results list all tables that contain fields with SSN format. Figure 20: Example of SSN pattern match Source: J. Zhu, Metadata Management with IBM InfoSphere Information Server, 2011, p. 241 (3) Once data is analyzed, the next step is to define jobs that will match and cleanse data. IBM MDM offers prebuilt matching jobs that ship with its product. However, users can define their own matching jobs and rules based on business requirements. Matching process is similar to the ones defined in other vendors. It s based on some starting points (cutoffs) and weights that measure similarity of data. An interesting attempt that IBM tries to introduce here is speeding up data matching jobs by setting up some rules which group data in different block. Such approach is used to lower the number of combinations that appear when two columns are to be matched. Blocking works on the rule of sort-group-divide. However, this may turn into costly operation 41

46 which requires building complex subqueries for data processing and comparison. Also, incorrect data blocks may result into false negatives, when a record pair that does represent the same entity is not a match, because the records are not members of the same block. (4) Last step in data validation is evaluation of results and setting up rules to prevent further inconsistencies. In case of several duplicates found for one master entity, IBM MDM rules make merge of all the unique data representation that refer to the same master object. The goal is to retrieve as more information as they can for the master record. Figure 21: Example of record merge Source: IBM InfoSphere QualityStage, 2011, p. 150 Similar to the matching engine that contains predefined matching jobs, IBM also offers rule engine to save all user defined rule jobs that can be further on reused. Other than data rules, consistency of master data can be accomplished with data transformations. It usually applies for some common values like gender codes, streets and address. Data is transformed to some general format that is used across whole master database. Data security Security in IBM Information Suite is based on user/password authentication, role based permissions and monitoring. User permissions are questioned and checked on several levels. As mentioned earlier IBM Infosphere platform is based on SOA architecture and user transactions are web service requests to the MDM server. Therefore, the security system checks if user has permissions to invoke such requests, if user has permissions to make updates and it also controls visibility of data objects in master repository to which users have or don t have access. Another benefit of this system is that MDM is configured to keep history log of all changes, so changed records can be reconstructed at anytime. Monitoring is done when users connect to the system. Administrator can observe and control their actions (sessions of work). Advantages and disadvantages of IBM InfoSphere - as table 6 shows, there are far more advantages than disadvantagses due the size and variety of this solution. These characteristics support different models and functionalities that are compatible for different types of business requirements. 42

47 Table 6: Advantages and disadvantages of IBM InfoSphere Advantages Domain neutral, but also offers prebuilt models for Party, Product and Account domains. Provides documenting of workflow process and further reuse of the same. Systematic planning of master data model through several types of models: logical, domain, physical. Exports of models in reusable files: XML and DDL scripts. Offers prebuilt matching jobs, rule engine for rule definitions and their common use. Variety of data content analysis, data transformations and standardizations. Compatible with different kinds of databases and platforms. Blocking process techniques for more efficient matching. Data can be shared through web requests, so external applications don t have to make major changes to their databases. Reverse engineering Security provided at different levels in the system Disadvantages Same functionalities are repeating in different applications in the InfoSphere portfolio. Many of the applications are intertwined between themselves, which can be confusing to users often times Special file adjustments to SIF format during batch transaction processing. Excessive mapping needs to be done before importing data from external sources. Same process happens on export, too. Great variety of data transformations and standardization can change the data in great form and may produce completely new records. Such transformations can result in false positives, matches that are not actual matches. Blocking process during matching can be efficient but also complex and time consuming. Irregular block division can cause false negatives, records that match but are not detected because they were placed in different blocks. IBM InfoSphere is a rich portfolio of tools for data management and integration. It offers great variety of applications that cover all the processes throughout data management, from planning and modeling to cleansing, merging and distribution to external systems. It supports all domains, implementations styles and methods of use as well as is platform independent and can be compatible with all types of software. IBM didn t create solution just for the time being, but they also include features that allow users to save all that documentation, modeling and rules into some common knowledge database to which they can recall afterwards. Another novelty IBM can be proud of, is the possibility for reverse engineering, which facilitates system integration during merging or acquisition. However, this perfect solution has few disadvantages. According to the various data transformation techniques supported in the QualityStage, users are given freedom to make data transformation for easier matching. However, there is no limit how far user can go in 43

48 changing data and often times such transformations can change data in a way that it loses its context and no longer represents the correct business object. Defining complicated subsets of data for faster matching can be expensive, time consuming and create wrong results, false negatives and positives that were mentioned earlier. It s good that users have freedom to work with master data anyway they want, but there still need to be some system restrictions that will give users some guidance and warn them for the possible mistakes. Another thing I noticed is that many similar features can be found in different applications. For example, data cleaning can be done in DataStage and QualityStage; data analysis in Information Analyzer and Information Discovery, import and export of data in Metadata Workbench and Asset Manager but also in any other component. IBM s intention for this shared functionality was maybe to broaden the application set of feature and don t allow user to work on several application to get clean data, but there should be either one application that will support the whole data management process, or several components with precise set of features so it would be less confusing for the user. Overall, IBM InfoSphere is a mature solution that implements great techniques for data management. Both Information Server and MDM Server can be used for managing data from large and complex systems. Many modules from InfoSphere can be acquired and used independently, for data analysis and cleansing. Therefore, InfoSphere line of products is suitable for all sorts of enterprises and lines of business Oracle MDM Suite Oracle introduced its MDM products ten years ago, starting first with programs for managing customer and product data, and ending up with solutions for data management, called Customer and Product Data Hubs. The whole idea for developing applications in MDM area started internally when Oracle s E-Business Suite was dealing with customer data quality issues. They first developed program to manage customer data model, called Oracle Customers Online, and shortly after its release they built Oracle Advanced Product Catalogue, another program for the same suite to manage product data. Adding data quality, source system management, and application integration capabilities, these two products grew into the Oracle Customer Data Hub and the Oracle Product Hub. Major breakthrough on the MDM market happened when Oracle acquired Siebel and Hyperion Data Relationship Management (DRM). After releasing Customer and Product hub, Oracle expanded its MDM line of products to Finance, Site and Supplier Hubs (Butler, 2011). Oracle is currently focused on developing fusion versions for its existing Hubs. These fusion applications are combination of SOA and MDM. They provide integration, management and distribution of master data among applications from external systems. So far, Customer, Product and Accounting Fusion Hubs are available on the market. 44

49 MDM solutions from this vendor will be discussed through several products from the Oracle MDM Suite. Below is a list of applications that belong to the Oracle MDM portfolio (Oracle Master Data Management. Retrieved March 5, 2012, from , p. 1) Enterprise Data Quality; Oracle Customer Hub; Oracle Product Hub; Oracle Supplier Hub; Oracle Site Hub; Oracle Higher Education Constituent Hub; Hyperion Data Relationship Management. Data modeling Oracle s MDM products come with already predefined data models for each entity. Users don t have the ability to start from blank database, but what they can do is update already existing tables in the master repository with new columns. Data models that Oracle uses are based on the Trading Community Architecture (TCA) model. Oracle Trading Community Architecture (TCA) is a data model that allows users to manage complex information about customers, organizations and customer s accounts. The base of this model is used and readjusted when designing models for other types of domains such as product, site etc. Tables in the master repository have standardized names; each starting with HZ prefix followed by the name of entity which attributes are stored. For example, HZ_PARTIES stores data for Parties, HZ_CONTACT_POINTS for party s contact points etc. This database is of relational type, organized in tables (entity), columns (attributes) and relationships (hierarchies) (Oracle Trading Community Architecture, 2006, p. 1). Figure 22: List of predefined tables for Customer entity Source: S. Anand, Trading Community Architecture,

50 Data import Since Oracle provides predefined data model, data is imported in the HZ tables discussed earlier. There are several different ways for data import: (1) SQL/ETL Load, (2) D&B Load and (3) File Load (Oracle Trading Community Architecture, 2006, p. 8-18): (1) SQL/ETL Load: data is first extracted with scripts or tools, values are transformed to meet the data requirements of the interface tables, and afterwards data is loaded; (2) D&B Load: data is prepared by D&B sent in standard D&B bulk file which is next run through the D&B Import Adapter and automatically mapped and loaded into the interface tables; (3) File Load: data is loaded from a comma-separated value (CSV) file, or file delimited by another allowed character with Oracle Customers Online (OCO) or Oracle Customer Data Librarian (CDL). Before loading data in the master repository, data is first imported into staging tables, matched and cleansed and afterwards imported in the interface tables. The staging tables are copies of the existing tables and are temporary storage for the external data that is being imported in the repository. Even after importing data, TCA runs post import processes for data standardization. There are various data transformations such as: name conversions to meet database standards, replacement of letters in phone numbers, removing of NULLs etc. Data export Data for certain entity can be exported in Excel spreadsheet. However, for data distribution to external applications Oracle uses cross referencing. Cross referencing is approach that assigns unique key IDs for each record in the central repository and maps it to the appropriate record from the external systems (similar to the key mapping discussed in the SAP solution earlier) Figure 23: Example of cross reference between PARTIES (master table) and SYS_REFERENCES (external systems) Source: Better Information through Master Data Management MDM as a Foundation for BI, 2011, p. 9 With the help of Application Integration Architecture (AIA), data can be shared with other application through web services. This enables external applications from different platform to receive managed data from Oracle MDM Hubs. 46

51 Since every hub that Oracle MDM offers has its own domain of concern and different architecture, there is also difference in the cross referencing processes. There are two possibilities for cross reference: one way and two way cross reference. In the first approach data flow occurs one way, from Hub to external applications. This means that data is managed and updated only in the Hub and afterwards sent out to the other systems. This approach is used in the Product Hub. The two way cross reference is implemented in the Customer Hub and data flow is managed in two directions: from hub to external systems and vice versa. Data is managed in the hub, but can be also updated in the external systems and sent to the master repository for import. Changed data that is sent from external systems needs to pass predefined validations before is loaded in the central database. This type of data sharing gives freedom to external systems to use managed data from Oracle Hubs without major changes made to their legacy applications (Cross-Referencing for Master Data Management with Oracle Application Integration Architecture Foundation Pack, 2008, p. 5). Data validation As mentioned earlier, data is checked for errors right after being imported in the repository. There are several techniques that Oracle uses to validate, cleanse and manage data. Most of them are similar to the ones discussed in the previous MDM solutions. Data validation techniques are based on transformation, matching and merging and are part of the Data Quality Management (DQM), mechanism for managing data found in the TCA model. Figure 24: Example of data validation workflow Source: Data Quality Management, 2012, p. 9 Oracle MDM examines data through several steps (Data Quality Management, 2002, p. 2-25): (1) Step one - Transformation functions. These functions include characters or blank space replacement, removing double letter, or any other data changes that will achieve certain standards throughout the database. Also, Oracle uses word replacement which replaces similar word variations with one standard word. Often times user enter 47