The Infrastructure for Information Management: A Brave New World for the CIO WHITE PAPER
SAS White Paper Table of Contents Trends and Drivers for Information Infrastructure.... 1 Objectives for Organizational Information Management: The New Dial Tone.... 2 Emerging Drivers for Common Enterprise Information Analytics.... 3 Industry Interests Also Drive Enterprise Information Management.. 4 Financial Services... 5 Health Care and Life Sciences.... 5 Government.... 6 Telecommunications... 6 Energy and Utilities... 7 Implications for Information Management.... 7 Data Integration.... 7 Data Virtualization... 8 Event Stream Processing.... 9 Metadata Management.... 9 Data Quality Management.... 10 Data Governance.... 10 Summary... 11 Content for this paper was provided by David Loshin, President of Knowledge Integrity Inc. Loshin is a recognized thought leader and expert consultant in the areas of data quality, master data management and business intelligence. He has written numerous books, white papers and Web seminars on a variety of data management best practices. His book Master Data Management, has been endorsed by data management industry leaders, and his MDM insights can be reviewed at mdmbook.com. He can be reached at loshin@knowledge-integrity.com.
The Infrastructure for Information Management: A Brave New World for the CIO The concept of the chief information officer (CIO) title has been well-established for many years. And although the role the CIO plays has slowly evolved in alignment with (and sometimes in reaction to) changes in the world of technology, dramatic changes in the perception of the creation, use and employment of information have somewhat skewed the direction that the CIO role has taken. In the past, the main focus of information management was subsidiary to the execution of business as usual, typically framed within the development framework for applications that implement operational or transactional business processes. Here, the data acquired, created, modified and used was solely intended to guarantee the proper completion of the process. This allowed siloed business functions to develop the same or similar data models, interfaces and functionality. In this context, the CIO s main focus was system infrastructure ensuring that the system (including processing engines, storage and networking) was configured to meet business needs. Trends and Drivers for Information Infrastructure Organizations continued to recognize that data sets (previously presumed to be byproducts of the operational environments) actually held significant value. As data sets were collected and combined for reporting or analysis, their repurposing introduced new demands and uncovered new constraints in the information infrastructure. There are a number of trends that are worth noting: Technology adaptation: Innovative technologies can disrupt the presumed information infrastructure needs, such as the explosive use of smartphones, which both generate and consume information. Some industries are particularly sensitive to technical changes, such as the energy industry s adoption of new smart meters that generate orders of magnitude more data than before. Big data and big data analytics: This trend only confirms the need, as more organizations seek to absorb larger volumes of data sets from varied sources and of varied structure. Integrated predictive analytics: The time gap for exploiting information is rapidly closing as organizations focus on competitiveness. Many organizations are tightly coupling their analytics engines to their operational systems to inform decision making in real time. Management of auditable compliance: Whether one examines the result of the recent financial credit crisis, deregulation of industries, new laws enacted governing health care reform, or numerous other legislative initiatives, the implication is that demonstrating compliance with regulations requires access to historical data. Data governance: Increased reuse and repurposing of information, coupled with the expanded scope of information management, has highlighted the gaps in which the absence of defined and enforced data policies can impede the business. Operational data governance requires retooling of the environment to enable inspection, monitoring and reporting of data policy compliance. 1
SAS White Paper Any one of these trends would imply the need for sound information management practices. However, the nexus of all the trends creates the impetus to institute the proper information management policies and infrastructure to capture, filter and analyze data and turn it into knowledge that drives positive business results. Objectives for Organizational Information Management: The New Dial Tone The stark reality is that as information ascends in perception of value and importance to the organization, there is an ever-growing need for the CIO to demonstrate more immediate value from data. Adopting innovative technology is one way to provide the perception of progress. However, introducing analytical appliances or implementing Hadoop while ignoring the critical information infrastructure aspects necessary to support the collection and management of data may lead to questionable results when the level of trust in the usability of the information can easily be challenged. That means that increased attention to fundamental capabilities for information management must accompany any adoption of new technology. As opposed to implementing data management components on a project-by-project basis, the time has come to view information management as an organizational business imperative. Business user expectations for data accessibility, availability and quality are approaching the sustained need for standard services, like telephony and network access. This dial-tone approach to information management services establishes a baseline enterprisewide capability for data utility, and includes components for: Data integration. What used to be called extraction, transformation and loading (ETL) has evolved beyond the original scope of data warehouse population to include the end-to-end mechanisms for data sharing, access and delivery. Data federation and virtualization. The desire for real-time, integrated analytics has ramped up the demand for high-speed data access to heterogeneous sources. Data federation enables semantically correct mappings across data assets and makes heterogeneous data access transparent to the end users. Virtualization smooths the delivery and presentation of federated data and provides caching to make access times predictable. Event stream processing. With the desire to absorb data from numerous sources, the business may want to apply filters or trigger actions based on streaming data. Event stream processing provides the infrastructure to support these types of actions. Managed metadata. Merging a variety of data sources without a common agreement to definitions and meanings will always lead to confusion. Establishing a metadata management practice using the right components will help alleviate some of these concerns. Data quality management. Any business environment will be compromised without establishing a level of trust in the usability and quality of the data. Parsing, standardization and cleansing all contribute to a predictable level of data quality. Data governance. These technologies enable inspection, monitoring and reporting of compliance with data quality rules and policies. In addition, tools to alert data stewards to data issues and monitor remediation progress help deploy corporate data policies. 2
The Infrastructure for Information Management: A Brave New World for the CIO The demand for analytics incorporates a wide variety of data sources, including social media data, machine-generated data, and mixed-format content (such as documents and websites containing text, images, video, etc.). It is worthwhile examining how the demand for analytics, both in common use cases as well as industry-specific cases, drives the need for predictable and trustworthy information management. Emerging Drivers for Common Enterprise Information Analytics It would be unusual to suggest that some businesses are not continuously seeking better ways of increasing revenues, decreasing operational costs and extending profitable customer relationships. A closer inspection of the popular approaches for achieving these goals centers on what could be called common analytics that are not specific to any particular industry. Some examples include: Customer profiling and segmentation, which divides the customer community into categories based on key variables as a way of developing predictive models for behavior analysis. Customer/product affinity analysis, which examines which customer segments have affinities to specific products (or products within organized categories). Market basket analysis, which looks at predispositions to purchasing certain products at the same time. All of these are examples of analytical approaches to drive increased product sales via up-selling, cross-selling, understanding customer price sensitivity, or through the purchase of product bundles with higher profit margins. And while the ability to execute projects enabling these analyses has typically been reserved by only the largest organizations with the biggest analytics budgets, a combination of factors is increasingly enabling a much broader spectrum of companies to be able to benefit from analytics, including: Data volumes: Not only are the volumes of data expanding, but the rate of expansion of newly created digital content continues to increase. Positive marketing: The information management industry has done a good job in marketing the purported benefits of analytics, effectively generating a blossoming demand. Feasibility: Larger organizations may have already had the resources to implement large-scale analytics programs, but with high-performance platforms deployed on collections of easily acquired commodity hardware components, the barrier to entry for implementing an analytics program has been significantly lowered. Right-time delivery: As the time windows for responding to emerging opportunities continues to shrink, there is a growing appetite for near-real-time delivery of useful knowledge to drive trustworthy decision making. 3
SAS White Paper The result is that a greater number of smaller organizations are seeking to employ more sophisticated analysis techniques over a broader variety of digital content that spans both structured and unstructured sources. For example, these types of digital content, among others, are growing in importance: Structured data sets acquired either directly through the Web or through data aggregator vendors. Social media data, such as the unstructured comments and posts streamed through Twitter or Facebook. Machine-generated data, such as periodic reading of smart energy meters installed across a residential network. Mixed-format content, such as documents and websites containing text, photo images, graphic images, video, etc. Analytic applications such as customer profiling, segmentation and classification can be greatly enhanced with data from a wider variety of sources. But as the demand grows for applications incorporating different types of data sources, the data management environment must be able to scale with the size and complexity of the data, and not just from a strict throughput performance perspective. There must be processes for extracting entity data from an unstructured source, identifying that entity, and augmenting the entity s profile with discovered characteristics. All aspects of data utility have to be taken into account and aligned with the idea of data management as a dial-tone service. Enabling predictive analytics implies that data management services must meet a base level of expectations: The performance for data delivery must be predictable; the framework must provide trustworthy information; there must be ways to ensure that commonly-used terms are not confused by downstream reinterpretation; and data and business rules must be effectively incorporated directly into developed applications as part of the system development life cycle. Industry Interests Also Drive Enterprise Information Management Distinct industries have particular business challenges whose solutions can be impeded by the absence of a sound information management strategy. The business objectives within any industry remain aligned with the same core dimensions of value (namely increasing revenues, decreasing operational costs, managing risk and enhancing profitable customer experiences). Reviewing some examples of discrete challenges facing different industries allows one to see how the business drivers can be specifically linked to establishing best practices for information management. Naturally, all of these industries have different business applications that rely on good data management practices. And they all share key dependencies: predictability of data availability, data accessibility, timeliness, and importantly, consistency from a structural and semantic perspective. 4
The Infrastructure for Information Management: A Brave New World for the CIO Financial Services In the banking and financial services industry, the aftermath of the recent credit crisis has led to increased concern about accuracy in assessing and managing risk. A number of governmental agencies, as well as international oversight authorities, have drafted regulations intended to ensure that financial institutions are properly capitalized to guard against excessive risks. Most of these regulations and guidelines are tightly coupled with data delivery and information management. For example: Dodd-Frank: In the United States, the Dodd-Frank Wall Street Reform and Consumer Protection Act empowered new government research agencies to request data from banking institutions to ensure financial stability. Solvency II: These insurance industry regulations empower a supervisory authority to request information that can comprise qualitative or quantitative elements, historic, current or prospective elements, and data from internal or external sources. This information must reflect the nature, scale and complexity of the business, must be accessible, complete in all material respects, comparable and consistent over time, and must be relevant, reliable and comprehensible. (See the Solvency II site for more details.) Basel III: These international banking accords establish a standard for bank capital adequacy, stress testing and market liquidity risk. Basel III accords rely on the availability of accurate information for analytical calculations of capitalization requirements intended to improve risk management and provide enhanced predictability of financial stability. Health Care and Life Sciences In the United States, there are certainly issues for Medicare and Medicaid driven by the Affordable Care Act after it was upheld by the Supreme Court in 2012. This affects the information management expectations for government, health care providers, payers, pharmaceutical companies and medical device manufacturers. Some examples include: Health information exchanges: The desire to migrate toward electronic health records requires the creation of HIEs that enable the electronic exchange of health care information within a region or among a community of participants. Dual-eligibles: Both state and federal government agencies are charged with aligning the care provided to dual-eligibles, or individuals who are covered by both Medicare and Medicaid programs. Reducing duplicated service and payments requires an effective way of managing and merging data from multiple systems. Physician sunshine reporting: Increased scrutiny of pharmaceutical incentives to health care providers has led to increased requirements for reporting and analysis, both by the reporting companies and the agencies collecting the data. Accurate and complete reports require broad data visibility across multiple business functions and systems. At the same time, government agencies aggregating reported exchanges of value to providers must be able to uniquely identify providers as well as link their records across a multitude of reports filed by pharmaceutical and medical device companies. 5
SAS White Paper Government Aside from the examples we have already seen related to government information management, there are additional drivers within government to devote increased attention to improved information management. Some examples include: Management and retention: Agencies may collect significant amounts of information over time, requiring the implementation of data archiving and retention policies. Cross-agency integration: Cooperation among agencies for providing social services, assessing government debt obligations and analyzing security threats suggests the benefits of establishing methods for efficient and high-quality data exchanges. Transparency: Directives for transparency of government operations have driven the creation of data sets for public consumption. The processes for creating and publishing these data sets require data accessibility, data integration and organization. Telecommunications The telecommunications industry is no stranger to the need for information management. These companies have a long history of collecting massive amounts of transaction data associated with call detail records to both ensure high-quality communications connectivity and business operations (managing accounts, issuing statements and collecting payments). Yet the industry continues to adapt in ways that seem to expand beyond its original focus, including providing a wider range of services such as wireless, television and Internet, among others. The amount of information now available collides with the key drivers facing the industry, including: Coordination across provided services. Providing the perception of a unified provision of services requires the ability to unify customer/product information across all areas of the business. Improved marketing of bundled services. Telecommunications marketing becomes more complex as the array of services and potential product bundles widens. Again, visibility into customer account information, relationships among customers, and customer profiles and preferences will enable more productive marketing campaigns. Compliance with regional and local taxation. A variety of governmental taxes, fees and other charges must be applied according to geography, and this demands accuracy in account and invoicing data. Enhanced customer experience management. Customer profitability analysis helps drive VIP levels of service to the best customers, which again requires accuracy of insight into the customer/product mix. Fraud analysis. The growing range of provided services creates new opportunities for abuse and fraud, especially in the context of revenue leakage, which requires governed management of enterprise data for rapid and accurate analyses. 6
The Infrastructure for Information Management: A Brave New World for the CIO Energy and Utilities The combination of deregulation, eco-awareness and technical improvements in data collection and monitoring via smart meters is driving monumental changes in the energy and utilities industries, requiring improved information management capabilities. Some specific drivers include: Data volumes from smart metering. The trend of installing smart meters heralds a new age in data management for energy utilities, as both the volume and the speed of data generation will explode. The combination of size and velocity of this data will drive a re-engineering of the enterprise data management infrastructure. Increased needs for monitoring energy grids. With increased violent weather patterns, there is a corresponding increase in events leading to widespread outages. Proactive monitoring of the energy grid for sentinel patterns indicating imminent outages can help utilities more effectively allocate resources for remediation efforts. Monitoring pipeline sensor networks. As more cross-continental pipeline projects appear imminent, there is a corresponding need for the information management components to monitor and analyze data streaming from thousands more (or perhaps orders of magnitude more) sensors strategically placed along the entire length of the pipeline network. Eco-awareness and corresponding analytics. Energy companies, in reaction to increased eco-awareness, have begun to provide analytical feedback to customers in terms of absolute and relative energy consumption. This requires data accessibility and the ability to deliver reports to customers. Implications for Information Management We have considered two tracks of business drivers for deploying best practices in information management across the enterprise. First, there are common demands that are shared across many different industries, such as the need for useful knowledge about customers and products to drive increased revenues and longer customer relationships. Second, there are characteristics for operational and analytical needs associated with specific industries. Either of these business drivers points to the need for increased agility and maturity in coupling well-defined information management practices with the technologies that compose an end-to-end information management framework. That suggests the need for a core understanding of the fundamental components of an information management strategy. Data Integration Data integration has become the lifeblood of the enterprise. Organizations continually recognize how critical it is to share data across business functions, and that suggests a continued need for increasing reliability, performance and access speed for data integration, particularly in these fundamental capabilities: 7
SAS White Paper Data accessibility. Organizations must support a vast landscape of legacy data systems, especially due to the desire to scan historical data assets for potential business value. One key aspect of data integration is accessibility, and the information management framework must provide connectors to that wide variety of data sources, including file-based and tree-structured data sets, relational databases, and even streamed data sources. Data transformation, exchange and delivery. Once data sets can be accessed from their original sources, the data integration framework must be able to efficiently move the data from source to target. There must be a capability to transform the data from its original format into one that is suited to the target, with a means of verifying that the data sets are appropriately packaged and delivered. Data replication and change data capture. The need to regulate the accessibility and delivery of ever-growing data volumes within expected time frames is impeded by data delivery bottlenecks, especially in periodic extractions from source systems and loading into data warehouses. Data replication techniques enable rapid bulk transfers of large data sets. You can synchronize the process by using a method known as change data capture that monitors system logs and triggers updates to the target systems as changes happen in the source. Data Virtualization Efficient data integration can address some of the issues associated with increasing demands for accessing data from numerous sources and of varied structure and format. Yet some complications remain in populating data warehouses in a timely and consistent manner that meets the performance requirements of consuming systems. When the impediments are linked to the complexity of synchronous extraction and transformation, you run the risk of inconsistencies between the consumers of data and the original source systems. One way to address this is by reducing the perception of data latency and asynchrony. Data virtualization techniques have evolved and matured to address these concerns. Data virtualization tools and techniques provide three key capabilities: Federation: They enable federation of heterogeneous sources by mapping a standard data model to the access methods for the variety of sources constituting the federated model. Caching: Managing accessed and aggregated data within a virtual (cached) environment reduces data latency, thereby increasing system performance. Consistency: Together, federation and virtualization abstract the methods for access and combine them with the application of standards for data validation, cleansing and unification. A virtualized data environment can simplify how end-user applications and business analysts access data without forcing them to be aware of source data locations, data integration or application of business rules. 8
The Infrastructure for Information Management: A Brave New World for the CIO Event Stream Processing Traditional business intelligence systems may be insufficient to address the active capturing, monitoring and correlation of real-time event information into useful knowledge. To address this, a technique called event stream processing (ESP) enables real-time monitoring of patterns and sequences of events flowing through streams of information. ESP systems help organizations rapidly respond to emerging opportunities that can result from the confluence of multiple streams of information. These systems allow information management professionals to model how participants within an environment are influenced by many different data input streams, and analyze patterns that trigger desired outcomes. ESP systems can continuously monitor (in real time) all potentially influential streams of events against the expected patterns and provide low-latency event processing within defined event windows. When there is a variance from expectations or identification of new opportunities, the systems can generate alerts to the right individuals, who can take action much more rapidly than in a traditional data analysis scenario. ESP networks can monitor high data volumes from multiple input data sources with very low latencies for event processing. The ability to continuously monitor a wide variety of streaming inputs in a scalable manner allows you to recognize and respond to emerging scenarios because of the lower latencies and turnaround time for analysis. In essence, instead of running dynamic queries against static data, you can look at ESP as a method for simultaneously searching through massive amounts of dynamic data for many defined patterns. Metadata Management The drive for cross-functional data sharing and exchange exposed the inherent inconsistencies associated with data systems designed, developed and implemented separately within functional silos. And since early metadata management approaches only focused on structural, technical aspects of data models (to the exclusion of the meanings and semantics that are relevant to the business), metadata management projects often foundered. That says that the modern enterprise information management environment must enable business-oriented metadata management, including tools and methods for: Business term glossaries to capture frequently used business terms and their authoritative definitions. Data standards, such as naming conventions, defined reference data sets, and standards for storage and exchange. Data element definitions that reflect the connection to business terms and provide context-relevant definitions for use within business applications. Data lineage that shows the relationships between data element concepts and their instantiation across different models and applications. Integration with data governance policies to support validation, compliance and control. 9
SAS White Paper Data Quality Management Best practices for data quality management are intended to help organizations improve the precision of identifying data flaws and errors as well as simplify the analysis and remediation of root causes of data flaws. At the same time, data quality tools and techniques must support the ability to standardize and potentially correct data when possible, flag issues when they are identified, notify the appropriate data steward and facilitate the communication of potential data issues to the source data providers. These objectives can be met within a formal framework for data quality management that incorporates techniques for: Data parsing and standardization: Scanning data values with the intent of transforming nonstandard representations into standard formats. Data correction and cleansing: Applying data quality rules to correct recognized errors to cleanse the data and eliminate inconsistencies. Data quality rules management: Centrally manage data quality requirements and rules for validation and verification of compliance with data expectations. Data quality measurement and reporting: Provide a framework for invoking services to validate data against data rules and report anomalies and data flaws. Standardized data integration validation: Continual validation of existing data integration processes and embedded verification of newly developed data integration processes. Data quality assessment: Source data assessment and evaluation of data issues to identify potential data quality rules using data profiling and other statistical tools. Incident management: Standardized approaches to data quality incident management (reporting, analysis/evaluation, prioritization, remediation, tracking). Data Governance Finally, no modern enterprise information management environment would be complete without techniques for validating data rules and compliance with data policies. At the very least, that would be supported with tools for managing the data policy life cycle, which includes drafting policies, proposing policies to the data governance committee, providing reviews and revisions, seeking approval, and moving rules into production. These tasks must be aligned with the design and development tasks within the organization s system development life cycle. This permeates the lifetime of information management, from the analysis and synthesis of data consumer requirements through conceptual modeling, logical and physical design, and subsequent implementation. 10
The Infrastructure for Information Management: A Brave New World for the CIO Summary If businesses are recognizing the need for a dial-tone approach to establishing data utility services for meeting user expectations for data accessibility, availability and quality, it is incumbent upon the information management practitioners to ensure that the organization is properly prepared, from both a policy/process level and a technology level. This paper has provided an overview of the key drivers and then presented an enumeration of the critical capabilities that must compose the information management landscape. Organizational readiness implies three things: 1. There has been an evaluation of the current and future organizational needs for information management practices, services and technologies. 2. There has been an assessment to determine whether the existing capabilities and organizational maturity are sufficient to satisfy current and future needs, as well as identify potential gaps. 3. There is a plan to acquire and embed the methods and technologies that are required to eliminate any gaps. Enumerating the business needs for information management capabilities within your organization and assessing the capability gap will provide you with a strategy and a road map for developing an enterprisewide information management framework. 11
SAS White Paper 12
The Infrastructure for Information Management: A Brave New World for the CIO 13
About SAS SAS is the leader in business analytics software and services, and the largest independent vendor in the business intelligence market. Through innovative solutions, SAS helps customers at more than 65,000 sites improve performance and deliver value by making better decisions faster. Since 1976 SAS has been giving customers around the world THE POWER TO KNOW. SAS Institute Inc. World Headquarters +1 919 677 8000 To contact your local SAS office, please visit: sas.com/offices SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. Copyright 2013, SAS Institute Inc. All rights reserved. 106535_S107489_0713