IBM Software Thought Leadership White Paper May 2014 The IBM Agile Information Governance Process
2 The IBM Agile Information Governance Process We are literally drowning in data today. IDC estimates that the amount of information in the digital universe exceeded 1.8 zettabytes, or 1.8 trillion gigabytes, in 2011, and is doubling every two years. 1 Companies are increasingly turning to big data to drive analytic and operational applications. For example, a credit card company may analyze transactions in real time to uncover fraud. A retailer may engage in listening efforts to understand what its customers are saying in social media. Finally, a utility may use smart meter readings to incent customers to move their electricity consumption to off-peak hours. Big data has the following characteristics: Volume: Enterprises are awash with data, easily amassing terabytes and petabytes of information, and even zettabytes in the future. Velocity: Often time-sensitive, streaming data must be analyzed with millisecond response times to bolster real-time decisions. Variety: Big data includes structured, semi-structured and unstructured data such as emails, audio, video, clickstreams, log files and biometrics. Veracity: Veracity is different from the other Vs. As volume, velocity and variety grow, the veracity, or confidence in your data, decreases. If organizations do not have confidence in the underlying big data, then they will not be able to trust the analytics and insights that emanate from this data. Companies are also using big data technologies to modernize their legacy infrastructures. As they do this, information governance best practices, including metadata, data quality, master data, data security and data lifecycle management, are critical to the success of these initiatives. This white paper will focus on the veracity (or confidence) dimension of big data. 1 2 3 4 5 6 Define business problem Obtain executive sponsors Align teams Understand data risk and value Implement project(s) Measure results Plan Act Assess Enhanced 360-degree view of the customer Big data exploration Security/ intelligence extension Application development and testing Application efficiency Application consolidation and retirement Security and compliance Data warehouse augmentation Operations analysis Figure 1. The IBM Agile Information Governance Process.
IBM Software 3 The IBM Agile Information Governance Process, shown in Figure 1, consists of six steps across three distinct phases. In the Plan phase, information governance teams define the business problem, obtain executive sponsorship, align teams and understand data risk and value. In the Act phase, organizations implement one or more projects based on common use cases. Finally, in the Assess phase, information governance teams measure results. The IBM Agile Information Governance Process is built as a continuous loop. As information governance teams measure results on one project, they start anew by defining the business problem that may spawn additional projects. This white paper will explore the steps in the IBM Agile Information Governance Process. Step one: Define the business problem The information governance team should begin by defining the business problem. Best practices across clients often line up with nine common use cases (see Step five: Implement projects for more on the use cases). Here are a few business problems with a big data orientation: Increasing system performance based on log analytics IT departments are turning to big data to analyze application logs for slivers of insight that can improve system performance. Because application vendors log files are in different formats, they need to be standardized first to promote IT s confidence in the results. Optimizing water, gas and electricity consumption through smart meters Several utilities are rolling out smart meters to measure the consumption of water, gas and electricity at regular intervals of an hour or less. These smart meters generate copious amounts of interval data that needs to be governed appropriately. Utilities must safeguard the privacy of this interval data because it can potentially indicate a subscriber s household activities, as well as the comings and goings from his or her home. In addition, utilities need to establish policies for the archiving and deletion of interval data to reduce storage costs. Masking sensitive data within call-center voice recordings Many call centers make voice recordings of some or all of their calls for quality assurance purposes or to comply with regulations. These voice recordings may contain sensitive or personally identifiable information (PII) such as Social Security numbers and Payment Card Industry data such as the three-digit or fourdigit card verification code and primary account numbers. Therefore, they must be safeguarded against unauthorized access and use. Step two: Obtain executive sponsorship Information governance programs historically focused on traditional types of data. Bringing big data within the scope of the information governance program requires strong executive sponsorship. The executive sponsor may be someone from business or IT, but must have the ability to talk to both constituencies. The executive sponsor should be able to prioritize projects, obtain funding and manage headcount. Examples of a cross-functional executive sponsor include the chief information officer, the chief data officer or the vice president of enterprise data management. Specific functional executives from marketing, risk and supply chain departments may also provide executive sponsorship if big data constitutes a competitive advantage for the organization. Step three: Align teams Organizations also need to update certain information governance roles to account for big data. For example, the Information Governance Lead may need to assume the following additional responsibilities to govern big data: Determine the types of big data that need to be governed Assist with the development of a business case to support big data governance Evangelize big data with business stakeholders
4 The IBM Agile Information Governance Process Support activities that integrate big data with master data management (MDM) Expand the scope of the business glossary to support terms relating to big data (for example, a session relating to clickstream analytics) Align with multiple organizations, including legal, marketing, privacy and senior management, to establish policies regarding the acceptable use of big data Drive policies regarding the retention, compression and archiving of big data Foster policies to improve the security and privacy of big data Oversee data steward activities relating to big data In addition, the role of a data steward may need to change to accommodate big data. For example, a customer data steward may need to assume the following additional responsibilities to govern social media: Provide input into the attributes to match semi-structured and unstructured data, such as social media profiles, with MDM records Work with internal teams to determine what big data should be moved, federated, archived and ignored Leverage the data stewardship console to link, deduplicate and merge unstructured data with customer MDM records (for example, determine if the Susie Smith from Facebook is the same as Susan Smith in the customer MDM hub) Work with legal counsel, the privacy department and business stakeholders to establish privacy policies regarding the acceptable use of social media IBM InfoSphere Business Glossary and IBM InfoSphere Master Data Management (InfoSphere MDM) support the ownership of data by the business by allowing data stewards to manage business terms, policies and data rules. Step four: Understand data risk and value Data discovery and profiling are a basic requirement of an information governance program. The following examples highlight some of the ways that understanding data can impact a broader information governance program. Support data stewardship IBM InfoSphere Information Analyzer supports data profiling, including analysis of data at the column, key, source and crossdomain levels. The IBM InfoSphere Data Quality Console extends the capabilities of InfoSphere Information Analyzer by providing data stewards with the ability to drill down into exceptions. A data steward can use the InfoSphere Data Quality Console to drill down into expired life insurance policies by date of expiration. Manage reference data InfoSphere Information Analyzer can discover reference tables during the data profiling process. These reference tables can then be exported into the IBM InfoSphere Master Data Management Reference Data Management Hub. If these tables are updated in the latter, they can then be imported back into InfoSphere Information Analyzer. Link data rules with business terms Data rules within InfoSphere Information Analyzer can be linked to information governance rules in InfoSphere Business Glossary. For example, a data rule called CITY_EXISTS in InfoSphere Information Analyzer can be linked to an information governance rule called Data quality rules for customer address in InfoSphere Business Glossary. Identify sensitive data Organizations struggle with hidden sensitive data such as Social Security numbers located in a field called EMP_NUM. IBM InfoSphere Discovery can locate sensitive information such as PII so that data privacy rules can be enforced appropriately.
IBM Software 5 Discover complete business objects Complete business objects are logical groupings of related objects such as customers. InfoSphere Discovery can locate complete business objects that can then be archived using the IBM InfoSphere Optim Data Growth Solution. Because of its extreme volume, velocity and variety, big data should be handled differently than traditional types of data. Table 1 compares and contrasts the differences between traditional and big data quality programs. Here is an example scenario: Acme Corporation uses IBM InfoSphere BigInsights to conduct sentiment analysis of Twitter feeds. The social listening department may adopt the following business rules to determine whether mentions of acme refer to Acme Corporation or are noise that needs to be filtered out. Step five: Implement projects In working with customers across many industries, the following sweet spots emerge as the most common starting points in big data and governance projects. These are represented as part of the Implement phase of the IBM Agile Information Governance Process shown in Figure 1. The information governance implications of these big data projects are discussed in the following pages. Enhanced 360-degree view of the customer Gaining a full understanding of customers what makes them tick, why they buy, how they prefer to shop, why they switch and so on is strategic for virtually every company. In fact, in a recent IBM study, the number-one recommendation was that organizations should focus their big data efforts first on customer analytics that enable them to truly understand customer needs and anticipate future behaviors. 2 If tweet contains @Acme, then confidence level = 99 percent If tweet contains Acme and Acme product names, then confidence level = 75 percent If tweet is on the ignore list, then confidence level = 0 percent Forward-thinking organizations recognize the need to equip their customer-facing professionals with the right information in context to help them solve customer problems and improve up-selling and cross-selling. However, they need to consider several information governance implications as well. InfoSphere MDM and IBM Watson Explorer combine structured and unstructured information in context from customer relationship management, content management, supply chain, order tracking databases, email and many more systems to present a 360-degree view of the client. These offerings can also integrate in-context analytics from social media and other types of big data from InfoSphere BigInsights and IBM InfoSphere Streams.
6 The IBM Agile Information Governance Process Dimension Traditional data quality Big data quality Frequency of processing Batch-oriented Real-time and batch-oriented Variety of data Data format is largely structured Data format may be structured, semi- structured or unstructured Confidence levels Timing of data cleansing Critical data elements Location of analysis Stewardship Data needs to be in pristine condition for analytics in the data warehouse Data is cleansed prior to loading into the data warehouse Data quality is assessed for critical data elements such as customer address Data moves to the data quality and analytics engines Stewards can manage a high percentage of the data Noise needs to be filtered out but data needs to be good enough Depending upon confidence levels and intended use, poor data quality may or may not impede analytics and insights Data can be loaded as- is because the critical data elements and relationships may not be fully understood Volume and velocity of data may require streaming, in- memory analytics to cleanse data, thus reducing storage requirements Data may be quasi- or ill- defined and subject to further exploration, hence critical data elements may change iteratively Data quality and analytics engines may move to the data to ensure speed of processing Stewards can manage only a smaller percentage of data due to high volumes and/or velocity Table 1. Traditional versus big data quality programs. Marketing organizations often need to match lists of prospects against internal records to remove any customers who have made do-not-call elections. These large data sets push the limits of existing computational resources when IT needs to match 200 million prospects against a database of 100 million customers and return the results to marketing in 24 hours. IBM InfoSphere MDM can manage massive data sets comprising up to as many as one billion records. IBM has also implemented the InfoSphere MDM probabilistic matching engine within a MapReduce framework on InfoSphere BigInsights. This has helped organizations implement probabilistic matching on ultra-large data sets in hours rather than days or weeks. Big data exploration The first step in leveraging big data is to find out what you have and to establish your ability to access it and use it to support decision making and day-to-day operations. Big data exploration is the way to get started.
IBM Software 7 Users should be able to explore big data in the context of operational and analytical applications. For example, call center agents should be able to search for content within the company s intranet portal. Watson Explorer automates the discovery of big data, regardless of its format or where it resides, providing a federated view of key business information necessary to drive new initiatives. The technology is characterized by its unique index and search capabilities that uncover data from multiple repositories. As a result, customer service representatives are able to reduce Average Handle Time by doing text searches while on the call with the customer. IBM has also integrated Watson Explorer with InfoSphere Business Glossary. As shown in Figure 2, a customer is now able to search on the term taxation in the business glossary and pull up instances of that term within unstructured content. Security/intelligence extension To combat sophisticated threats, organizations must adopt new approaches that help spot anomalies and subtle indicators of attack by leveraging all available data. This may include: Traditional log and event data Network flow data Vulnerability and configuration information Identity context Threat intelligence data For example, InfoSphere Streams can correlate real-time feeds from multiple motion sensors to detect any threats to a physical environment. The software provides specialized data quality techniques when handling high volumes of data in real time without landing interim results to disk. Figure 2. Watson Explorer supports text search from InfoSphere Business Glossary.
8 The IBM Agile Information Governance Process InfoSphere Streams can also discover the temporal offset when joining, correlating and matching data from different sources. For example, a streaming application that needs to combine data from two sensors needs to know that Sensor A generates events every second, while Sensor B generates events every three seconds. If InfoSphere Streams does not receive a sensor event as expected, it can generate an alarm. Application development and testing The tremendous size and complexity of big data projects create challenges for testers. Because production data contains PII, organizations need to mask that data in test environments. However, big data applications must be delivered rapidly and testing teams must create realistic, right-sized, masked test data sets in short order. The IBM InfoSphere Optim Test Data Management solution streamlines the creation and management of test environments; subsets and migrates data to build realistic and right-sized test databases; masks sensitive data; automates test result comparisons; and helps eliminate the expense and effort of maintaining multiple database clones. The InfoSphere Optim Test Data Management solution can facilitate the following tangible financial benefits: Increased revenues from faster time-to-market due to automated generation of test data sets Reduced storage space for test data Fewer production defects due to better testing Reduced downtime as developers spend less time waiting for the refresh of their test environments Application efficiency Data growth often has an adverse impact on application performance and costs. Big data at rest includes smart meter readings, sensor data, RFID data and web logs that might reside in relational databases, file systems, NoSQL databases and Apache Hadoop. However, there is a myth that more data equals better analytics. According to the CGOC Summit 2012 Survey, approximately 69 percent of enterprise information has outlived its usefulness and can be subject to defensible disposition practices. Organizations can improve the signal-to-noise ratio in their big data environments moving from data swamps to data lakes by retaining only the right data. Companies should compress and archive big data at rest to reduce storage costs and to improve application performance. Because Hadoop avoids data loss by replicating the same data across multiple nodes in a cluster, organizations should consider InfoSphere BigInsights for fault-tolerant data archiving. The InfoSphere Optim Data Growth solution helps organizations reduce storage costs and improve application performance by archiving structured data. Compared with Hadoop, the InfoSphere Optim Data Growth solution can archive structured data in an immutable format where user access is tightly controlled and audited. Archived data can be subject to legal holds and is easily retrievable during legal proceedings. This data can be defensibly disposed to minimize legal impact. Archived data in the InfoSphere Optim Data Growth solution is accessible to business intelligence and enterprise applications, supports search and can be easily restored back to the source. Application consolidation and retirement Chief information officers are always on the lookout for ways to reduce costs, including application consolidation and retirement. For example, one organization had eight instances of SAP across several versions. It was expensive to maintain these versions and to aggregate data for corporate reporting. However, application teams were reluctant to retire legacy applications for fear that they might need the underlying data for legal, regulatory or analytic purposes.
IBM Software 9 InfoSphere Optim Data Growth and InfoSphere Information Server facilitate application consolidation and retirement by archiving data from decommissioned applications while providing ongoing access to the underlying data. Tangible cost savings accrue from lower software license and maintenance, hardware and labor costs. Organizations can now leverage the power of Hadoop to perform blended analytics of archived data in InfoSphere Optim as well as structured and unstructured data from other sources. Organizations can store their data in immutable format in the InfoSphere Optim Data Growth solution, which can now create query-able data archive files for storage in HBase. As a result, organizations can combine the immutability of an InfoSphere Optim archive with the processing power and cost-effectiveness of InfoSphere BigInsights Hadoop capabilities. Plus, Watson Explorer can also search for data within InfoSphere Optim archive files. Security and compliance Securing sensitive data complements the security/intelligence extension project mentioned above and focuses on securing and protecting sensitive data, such as credit card numbers, health records and so on. Companies are leveraging new types of internal and external data, which are now being consumed by innovative applications and new users. Because this data may be sensitive, organizations have to consider compliance and reputational risks. This is especially true in the US, as the Securities and Exchange Commission has ordered publicly traded companies to declare their security breaches. Although this sensitive data may be embedded within production, test, training and business intelligence environments, it needs to be protected regardless of location. in real time irrespective of the data type and for data-at-rest and data-in-motion. Developers can invoke this data masking functionality directly from MapReduce routines and Jaql scripts. In addition, this feature has been packaged as database-specific user-defined functions (UDFs) so that data moving into and out of the Hadoop Distributed File System (HDFS), HBase, InfoSphere BigInsights, InfoSphere Streams, the IBM PureData System for Analytics, IBM DB2, IBM DB2 for z/os and Oracle can be masked on demand. Organizations also need to establish policies to monitor access to sensitive big data by privileged users. The Hadoop Activity Monitoring feature of IBM InfoSphere Guardium allows activity monitoring on Hadoop just like traditional environments. This not only has minimal impact on the network, but offers an audit trail with granular details of big data activity. Data warehouse augmentation Data warehouses are built for massive scale. However, many organizations are struggling to manage data storage costs as their volumes grow. As a result, they are integrating Hadoop file systems and data warehouse capabilities to increase operational efficiency. Organizations are taking advantage of Hadoop s relatively inexpensive data storage by using it as a staging area before determining what data should be moved to the warehouse. They can process and analyze streaming data in real time to determine what should be stored, either in Hadoop or directly in the warehouse. Additionally, data can be cleansed and transformed before loading into the warehouse, enabling data exploration and ad hoc queries (see Figure 3). Data masking is the process of systematically transforming confidential data elements, such as trade secrets and PII, into realistic, but fictionalized, values. IBM InfoSphere Optim Data Masking on Demand allows organizations to invoke masking algorithms
10 The IBM Agile Information Governance Process All data IBM Watson Foundations New/enhanced applications Transaction and application data Machine and sensor data Enterprise content Image, geospatial and video data Social data Real-time data processing and analytics Operational data zone Landing, exploration and archive data zone Deep analytics data zone Enterprise data warehouse and data mart zone Information integration and governance What action should I take? Decision management What is happening? Discovery and exploration What did I learn? What s best? Cognitive What could happen? Predictive analytics and modeling Why did it happen? Reporting and analytics Customer experience New business models Financial performance Risk Operations, threats and fraud Thirdparty data Systems Security Storage On-premise, cloud, as a service IBM Big Data & Analytics infrastructure Maximize insight, improve IT economics Figure 3. Data flows through multiple zones as it is ingested, transformed and analyzed. IBM InfoSphere DataStage includes the Big Data File Stage, which supports reading and writing multiple files in parallel from and to the HDFS. The Big Data File Stage leverages the parallel engine within InfoSphere DataStage to provide massive scalability. Developers can also use the Balanced Optimization functionality to design a job in the InfoSphere DataStage environment and then deploy all or part of it in InfoSphere BigInsights or Cloudera Enterprise. InfoSphere DataStage autogenerates Jaql for MapReduce through the Balanced Optimization technology. IBM InfoSphere Business Information Exchange supports data lineage and impact analysis in highly heterogeneous environments. It helps organizations create, manage and share enterprise-wide common business terminology, policies and rules relating to all types of data, including big data. For example, InfoSphere Business Information Exchange may define the term unique visitor as a unit used to count individual users of a website for the purposes of clickstream analytics. It may also include a business rule that governs how unique visitors are calculated. All of these decisions are made with the help of an easy-to-use web interface designed to simplify collaboration between business and IT.
IBM Software 11 Operations analysis InfoSphere BigInsights provides robust sentiment analysis, enabling companies to leverage vast volumes of social media, machine data and other types of big data to improve the efficiency of their day-to-day operations. A popular brand-name global retailer was experiencing declining product profit margins due to increased promotional activity. In order to address this business challenge, the company decided to collect and analyze product feedback from customers in social media such as Twitter and other websites to determine the pricing strategy for new products. If the so-called sentiment analysis was not very positive during the product launch, the company decided to update its pricing in the master product catalog and offer discounts of 30 percent. This would replace its usual practice of selling merchandise at the end of the season at a 70 percent discount. As a result, the retailer was able to significantly improve its profit margins. The same retailer also piloted a flash event lasting just one afternoon to promote a new line of swimwear. The marketing team used only social media to attract customers to the event and anticipated that the communication would go viral. While the event was extraordinarily successful and sales exceeded projections, the marketing team uncovered some issues when analyzing clickstream data. Customers who had taken pictures of the new line could not easily find the product on line. After examining the root cause, the retailer modified its product hierarchy so that boardshorts could be found in shorts, swimwear and within its own subclass of boardshorts. InfoSphere MDM provides strong capabilities to manage product hierarchies and other attributes of product master data. Step six: Measure results As a final step, the organization should assess the results of the information governance program and make adjustments. After this assessment is completed, the information governance program loops back to define new business problems or to make adjustments to existing business problems. The process then starts over again. Many organizations can leverage their existing governance programs to both accelerate big data initiatives and reduce the time to implement further projects. IBM InfoSphere provides a foundation for big data, integration and governance to support these initiatives and their success. For more information Want to learn more about IBM InfoSphere capabilities? Call your IBM sales representative to schedule a Client Value Engagement at no cost or visit: ibm.com/software/data/infosphere About the author Sunil Soares is the founder and managing partner of Information Asset, LLC, a consulting firm that specializes in helping organizations build out their data governance programs. Prior to this role, he was the Director of Information Governance at IBM, and worked with clients across six continents and multiple industries. Soares has written four books on information governance, including The IBM Data Governance Unified Process, Selling Information Governance to the Business, Big Data Governance and IBM InfoSphere: A Platform for Big Data Governance and Process Data Governance.
Copyright IBM Corporation 2014 IBM Corporation Software Group Route 100 Somers, NY 10589 Produced in the United States of America May 2014 IBM, the IBM logo, ibm.com, BigInsights, DataStage, DB2, Guardium, IBM Watson, InfoSphere, Optim, PureData, and z/os are trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the web at Copyright and trademark information at ibm.com/legal/copytrade.shtml This document is current as of the initial date of publication and may be changed by IBM at any time. Not all offerings are available in every country in which IBM operates. It is the user s responsibility to evaluate and verify the operation of any other products or programs with IBM products and programs. THE INFORMATION IN THIS DOCUMENT IS PROVIDED AS IS WITHOUT ANY WARRANTY, EXPRESS OR IMPLIED, INCLUDING WITHOUT ANY WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND ANY WARRANTY OR CONDITION OF NON-INFRINGEMENT. IBM products are warranted according to the terms and conditions of the agreements under which they are provided. The client is responsible for ensuring compliance with laws and regulations applicable to it. IBM does not provide legal advice or represent or warrant that its services or products will ensure that the client is in compliance with any law or regulation. 1 2011 Digital Universe Study. Extracting Value from Chaos. IDC, 2011. 2 IBM Institute for Business Value Executive Report. Analytics: Real-World Use of Big Data in Telecommunications. IBM, 2013. Please Recycle IMW14737-USEN-01