Enterprise Information Flow

Enterprise Information Flow White paper

Table of Contents 1. Why EIF 1 Answers to Tough Questions 1 2. Description and Scope of Enterprise Information Flow 3 Data and Information Structures 3 Data Attributes 4 Data Sources 4 Data Targets 5 Data Flows 5 Information Transformations 5 Analytic Processing 6 3. Attributes 8 4. Terms and Abbreviations 9 Authors Ondrej Zyka, Head of Data Management in Profinit Ondrej has more than 15 years of Data Management experience and spend his entire career focusing on Data Management, Data Quality and Application Integration. Ivo Mouka, Senior Consultant in Profinit Ivo is Profinit s Senior Consultant with 20+ years of experience in Information Management, Business Intelligence and Data Quality with background from Healthcare and Digital Publishing.

1. Why EIF Enterprise Information Flow (EIF) is a new field that studies the life cycle of information in its entirety, including all interactions between the information and the surrounding environment. The huge increase in the amount of data processed in recent decades can be attributed to enterprises hunger for ever more precise information to give them a competitive edge, to regulatory requirements, and also to the fact that previously unavailable information is now accessible through new information channels such as the big data distilled from websites, public databases, or the information automatically collected on users of various systems and devices. In an environment so complex, traditional solutions are bound to fail. A more intelligent approach is needed. The aim of EIF is to prevent wrong decisions by exactly analyzing the sources and processing of information, identifying critical spots in the process, and eliminating emotions and the human factor when evaluating information quality. The vision for EIF is to provide help in situations similar to the following cases: When in 1999 NASA lost a $125 million Mars orbiter because a Lockheed Martin engineering team used English units of measurement, while the agency s team used the more conventional metric system, for a key spacecraft operation, Tom Gavin, the JPL administrator to whom all project managers reported said: This is an end-to-end process problem. A single error like this should not have caused the loss of Climate Orbiter. Something went wrong in our system processes, in checks and balances that we have, that should have caught this and fixed it. Experiences with high risk operations on financial markets led the EU to introduce a Solvency II Directive for insurance companies. It is scheduled to come into effect in January 2016. Exact descriptions of the data flow and data calculations within the organization are essential to prove the appropriateness, completeness, and accuracy of information to regulatory bodies. This is one of the key capabilities of EIF. Answers to Tough Questions Could we prevent similar sudden crises that could result in considerable losses or even the fall of a company? Is our asset valuation sufficiently transparent? How correct and relevant is the data we use for cash flow predictions; are the algorithms we use proven enough? If these cases appear to be isolated events which rarely occur in your field of business, there is the common danger of creeping inefficiency that affects many large or growing companies and negatively impacts their financial results: John Schmidt of Informatica, in his blog article of May 2013, presents the following finding based on data in the Forrester s 2013 IT Budget Planning Guide for CIOs: The cost of IT as a percent of revenue increases as organizations get larger. What is going on here? What happened to economies of scale? A key part of his answer is that IT spends a lot of time helping the organization to automate business processes, but spends almost nothing on automating its own IT processes. IT is largely a manual activity. The main benefit of EIF is that it introduces order into work with information, which is the precondition for automation and the effective use of tools. EIF seeks to answer a number of questions: 1

At the organizational level Where and how was the information we use for our decisions created? Who is responsible for its quality? Is the analysis trustworthy? Who did it? On what basis was the client classified as credible? Are the estimates based only on our company figures, or have other sources such as Central Bank estimates or EU predictions also been taken into consideration? The introduction of an Enterprise Information Flow discipline is a standard part of Enterprise Information Management in any organization which is striving to raise its Data Management Maturity level. One area where the EIF approach comes to good use is in the compliance with directives, such as Solvency II for the insurance sector or BASEL III for banks, with their strict demands on data quality. Who did the data enrichment and what information did they use? At the personal level Who is using information that concerns my person and for what purpose? Is sensitive information being passed on to other parties? Is the information adequately aggregated or anonymized? Who is getting information about my phone calls and how detailed is this information? Who can see my data on Facebook? 2

2. Description and Scope of Enterprise Information Flow Enterprise Information Flow overlaps, to some extent, with existing and welldefined areas dealing with enterprise data: Data Management, Metadata Management, and Data Quality. To get a closer look at the behavior of information in an enterprise, it is necessary to expand the reach of individual disciplines, to broaden the range of observed attributes, and to use more sophisticated methods for their processing. Data management essentially means setting the rules, establishing the organizational structure, and defining and performing the required processes with the help of suitable tools. The Enterprise Information Flow approach involves looking at these segments of management from the perspective of individual data elements. In other words, it means monitoring what rules apply to each individual data element, who is responsible for its availability, quality, etc. The basic entities Enterprise Information Flow is concerned with are: Data and information structures Large collections of data attributes Sources and targets Data flows Information transformations Analytic processing Let s describe the individual entities, their place in the Enterprise Information Flow concept, and the contribution of EIF compared to established approaches. The key features are described in Data Flows and Information Transformation chapters, but let s not get ahead of ourselves and let s start with basics. Data and Information Structures Metadata management deals with data structures in detail, including their technical realization and business purpose (Business Glossary and its link to the data structures). This concept is well suited for structured data. However, we must be able to work equally effectively with semi-structured or non-structured data. In the case of semi-structured data, we can use information contained in definition files such as XSD, DTD, RDF, Dublin Core, e-gms, and AGLS. Similarly, we can use definitions from systems where the data description is a part of the data file; examples include XML, UML, HTML and other similar types. In these cases, the definition is not static as in the case of structured data; it is directly linked to the data contents. Consequently, it is necessary to deal not only with the structures that store the data, but also with the data itself. The description of non-structured data and Big Data datasets is a major problem. We can often only obtain information on the existence and the location of the dataset and a description of its contents or perhaps the way it was created. Some examples are data from emails, discussion forums, or document management systems. 3

It is not difficult to technologically obtain analytical data such as the number of records, the average size of a record, the creation time, or the structure of key identifiers of individual records. But to receive a detailed description, as we are accustomed to with the structured data, is not feasible. It is necessary to be able to work with and to integrate technical metadata at various levels of granularity. Sometimes we have an exact description as to where to find the required information, at other times a note on the existence of the required information in particular records, and in some situations the only possibility is to analyze particular datasets without any guarantee that they contain the information we are interested in. Business metadata poses a similar granularity problem. On one hand, we need very precise domain specific definitions of terms such as nominal value, sale discount, liquidity risk, or commission rate. On other hand, general definitions, such as customer data, document, or observation, permit defining the contents of data that are not precisely described or have not yet been analyzed. The value of a solution depends on its capability to maintain links across all layers of description granularity. Data Attributes Metadata management can be understood as a collection of various attributes on data. Enterprise Information Flow is characterized by collecting a large breadth of attributes on individual data components, and it puts particular emphasis on processing these sets of parameters. The objective is to obtain and process all attributes, concerning not only the structure of data, but also the significance for individual users, security, sources, processing, utilization, and data quality. The structure of the information obtained for particular types of data can be quite varied. Similarly, the sources of individual attributes are not only data repositories, but also all other systems that process data, and even the users of data. The emphasis on processing the attributes leads to the necessity of an open solution, where, based on the processing and analysis of attributes, the new attributes for individual data items are dynamically created and the values of existing attributes are supplied or modified. Attributes may describe individual data fields or records (as with XML documents), or provide a description for a structure (as with relational databases) or for specific data sets (for example all orders from one day, results of a census, Big Data datasets, etc.). Data Sources The key to the description of a data source is the description of its structure. Moreover, EIF requires a capability to precisely define all source data and information. Data sources must be identified primarily from the point of reliability. It is necessary to identify the owners of the sources and their trustworthiness. For every source, the important data indicators must be defined together with their historical values recorded on a regular basis. From the organizational point of view, we distinguish internal and external sources. Other features of data sources we can observe are: Is the data recorded manually or automatically? Is the data the result of analytical processes or direct input? Who are the other users of the data? What is the opinion of the other users on data quality? The concept of data source description must be sufficiently open to support possible substitutions of one source for another, more particularly to allow the modelling of how such a substitution may affect the entire life cycle of the data and its interaction with other systems and surroundings. 4

A practical example is the capability to analyze available sources of data, such as the Open Data Directory, and to assist in the search for alternative sources of the data utilized. Data Targets As with data sources, the key to the description of data targets is the description of their data structure. Other important attributes of data targets concern their usability and ability to interact with data users and the environment. The main features of data targets we can observe are similar to those of data sources. They include: Is it possible to process target data automatically? What is the latency of target data? Who uses the data? What about user satisfaction? What types of decisions are based on the data? Is the data for internal use only? Data Flows The description and analysis of data and information flows is a principal foundation of EIF. Description analyses technologically formalized data flows (FTP, ETL procedures, data replication, XSLT, use of ESB and web services, BPM systems and BPEL transformation, etc.), and they are proficiently handled and supported by contemporary metadata tools. Appropriate metadata tools should be able to analyze data flows formed by scripts and SQL procedures. Another big challenge for processing data flows lies in mastering the issues of repositories which unify data from various sources. It may concern MDM systems, integrated entities in data warehouses, or shared file systems. It is a situation where a plain data flow, based on data structures and static transformation analyses, links all the source and target data files transferred through the concentrator. Such information on source to target links is entirely inadequate. The control of data flows requires the exact identification of the data lineage through concentrators for each individual data record. Two methods can be used: Either the required information can be added to every data record, or more sophisticated methods based on identification and transformation of specific data attributes can be used for data going through concentrators. Yet another challenge of EIF is to process and document not only the data actually transmitted (or transformed in the process), but also the other data that is required for any particular transmission even though it itself is not transmitted. An example from relational databases would be the data used in the WHERE clause of the SQL statement. This data is not included in the transmitted data stream, but it is necessary for understanding the formation of the resulting data sets. Information Transformations Unlike the data transformations and data flows where there are usable technical solutions, the situation at the information level is considerably less developed. EIF concentrates on the following areas: The overall description of the information flow must be able to deal with information flows on multiple levels and to detect information in the flows such as decisions based on watching data on a monitor or reading reports, telephone communication, etc. Even the mere responsibility for data files must be considered a specific transformation, as it may considerably affect the trustworthiness or availability of data. Overall, it is necessary to work with all transformations of information where users are engaged. Such transformations include manual fixes of data, transformations 5

started by a staff member, and transformations with parameters entered by a user. With growing frequency, the information processing must deal with transformations that are difficult to describe technically and that significantly change the quality of the information. The use of models based on artificial intelligence for scoring customers may serve as an example. Analyses performed by data scientists are another example of transformations that create entirely new information which did not exist in the system before. At the same time, it is hard to determine what data the result was based on. Another type of transformations that are hard to describe are the integration transformations used in the area of Master Data Management, such as the transformations used for the cleansing and enrichment of data, algorithms for masking and anonymizing data, or algorithms for aggregating data. EIF requires mastering these tasks concerning transformations: Efficient descriptions, which are ideally automatically generated based on code for the widest possible range of languages and tools used for transformations Descriptions of input and output data set transformations at both the technological and the business level The key is the ability to define a method which will generate attributes of output data from the attributes of input data (security attributes, attributes of data quality, description of data sources and ownership, etc.) Analytic Processing The strength of EIF depends on the organization s ability to analyze all the information gathered. The standard analyses, such as where used, data lineage, impact analyses, profiling, and historical searches on metadata or profiling data, must be supplemented with many other types of analyses that will draw from the richness of the collected attributes. One very useful feature is the ability to quickly and easily create new types of impacts and analyses over the collected attributes. These analyses may modify existing attributes or create new ones. The analyses take into consideration the data attributes, data flows, and transformations. The analytical procedures can vary widely. They can be as simple as the following examples: During the transmission of data strings, does any truncation occur? During the transmission of numeric values, does rounding occur? During transmission, are numeric values transformed into strings? During transmission, is there any reduction in data security requirements? Does the data that leaves the department or the company contain any sensitive information? What is the overall computational complexity (cost) of the resulting files? What is the critical path for obtaining the target data files? Does the same data quality criteria apply to all integrated data? How many people are involved in the process of obtaining the target data file? What are all the technologies utilized for obtaining the target data file? Which technologies contain sensitive data? They can be also complex: What part of the data processing, from sources to results, is the most expensive, the slowest, has the weakest technical support, and is the most vulnerable from the security point of view? If an ETL tool were replaced or if the enterprise switched to the ELT solution, how would the system parameters change? 6

Is there a critical spot (person, department or system) through which all information regarding a particular decision process passes? What is the relationship between the costs of individual transformations and the number of decision processes for which they are used? 7

3. Attributes EIF is based on the collection and processing of a large number of various attributes. Let s list some examples: Structural attributes Placement of data in systems and environments Data structures Precision of the description of data structures Definition of data types Technical level descriptions Business or conceptual level descriptions Data storage technology Security attributes Physical access rights for modifications and the use of data Assignment of data security levels according to the security rules of the enterprise Access rules for modifying and using data Logical rights at the access level for modifying and using data Flags indicating that data has a link to a security incident Legal security requirements Requirements for anonymization and encryption Data source attributes What is the source of the data What transformation created the data Was the data manually obtained or collected using technology Data quality rules used in data creation External or internal source Who else uses the source Known incidents associated with the source of data Evaluation of the source by other users Processing attributes Last transaction used Granularity of primary data Technology used for processing Computational complexity of processing Latency of availability compared to the original data Latency of availability compared to the last time data was stored Use of manually controlled transformations Up-to-date status of the data Administration attributes Data owner Data steward Rules for data administration Last audit of administration Possibility of manual interventions Previous data owner Mode of receiving / handing over data Usage attributes Users of data User satisfaction Known incidents related to the use of data Possible ways of publishing data Rules for the use of data Attributes for the availability of data Automated process ability of outputs Data quality attributes Data quality indicators Data quality indicator values (profiling results) Opinions of users DQ related issues List of corrections of data Duplicity and rules for integration 8

4. Terms and Abbreviations Term BPEL BPM ESB ETL ELT MDM SOA Explanation Business Process Execution Language Business Process Management Enterprise Service Bus - The open standards-based distributed messaging middleware Extract, Transform, and Load - A key process for data acquisition in data warehousing Extract, Load, Transform - An alternate process for data manipulation; unlike ETL in that data is extracted, loaded into the database, and only then transformed Master Data Management Service Oriented Architecture 9

About Profinit and Manta Tools Profinit is a member of the multi-national New Frontier Group, which is a leader in the field of digital transformation of organizations and companies in Central and Eastern Europe. With more than 1,600 employees in 16 countries, we are also one of the 10 largest providers of ICT services in the entire CEE region and belong among the top in the field of made-to-order software development, data management, data storage, and business intelligence. Profinit s key Enterprise Information Management product is Manta Tools, the only solution which tells you what is really happening inside your BI environment. Manta Tools optimizes enterprise data flow, reveals data lineage, helps to perform impact analyses and ensure the stability of data architecture. It also works with systems based on multiple technologies (Teradata, Informatica, Oracle and IBM Cognos). 10