Data virtualization: Delivering on-demand access to information throughout the enterprise

IBM Software Thought Leadership White Paper April 2013 Data virtualization: Delivering on-demand access to information throughout the enterprise

2 Data virtualization: Delivering on-demand access to information throughout the enterprise Contents 2 Introduction 2 Challenges and drivers of data virtualization 3 Data virtualization delivers transparent, on-demand access to information 4 Benefits of data virtualization 5 Key considerations for a data virtualization solution 6 Strategic data virtualization use cases 7 Conclusion Introduction It s no surprise that data is growing. But with information sources ranging from RFID tags and smart meters to mobile phones, GPS-enabled devices and social media, data growth isn t just a volume issue anymore it s also a complexity issue. The number of sources, the diversity of information consumers and the number of connections between them are also increasing. Massive upswings in the volume, variety and velocity of data are precipitating a new era of computing. Today, we have access to a wealth of data that previously was not available. If we harness this data effectively, it can create exciting opportunities for growth and positive social impact by providing deep insight into what is happening and how to influence it. In this environment, access to data is more important than ever before. Both business and technical users need the ability to explore, combine and analyze large volumes of information from across and beyond the enterprise. However, as the number of data sources grows, trust in data can decline. Data is often named as one of an organization s greatest sources of value but it can also be a great source of risk. Poor information management practices often lead to bad business decisions and greater exposure to compliance violations. Inconsistent information can cause a disconnect between business goals and IT programs. Data virtualization offers an approach that enables easy access to information on demand. When coupled with strong information integration and governance practices, data virtualization can help organizations achieve several objectives: Know what relevant data is available Understand the quality and trustworthiness of that data Access data easily based on how it will be used Challenges and drivers of data virtualization Although the new era of computing offers many new opportunities, the business and IT challenges are unchanged and in some cases, magnified by the increasing demand for data. Several common pain points lead organizations to investigate data virtualization solutions as a data integration approach: Slow implementation and adoption While all integration approaches have the same goal to effectively combine data from multiple sources and provide a unified view of that data the time to delivery and ease of adoption for each may vary. A lengthy delivery cycle means the business must make do with the status quo until the solution is implemented. And if the solution is too complex, users will adopt it slowly or not at all which means that the IT department may not be able to recoup the cost of implementation.

IBM Software 3 Lack of common access to and understanding of information As the number of data sources and applications increases, organizations that fail to provide a common method for understanding and accessing all available information will find that the massive volume of data in their systems is essentially useless. Multiple information access points also increase the total cost of ownership since each access point requires administration, security and management resources. Limited budgets Yesterday s information integration technologies were not designed with today s rapidly expanding data volumes and information-hungry consumers in mind. Existing solutions may struggle to handle the demanding environment and may lack functionality that was once optional but is now business-critical, such as data governance capabilities. Reliance on these older technologies eventually requires organizations to purchase more hardware, more databases and more software with more capabilities all of which require budget that is in short supply. Because an integration solution is just one of the many expenses on the balance sheet, it must address key challenges cost-effectively. Data growth Whether it originates internally or is driven by an acquisition, growth in an organization is positive. However, when there is no means to effectively manage the data associated with that growth, it can lead to information bottlenecks throughout the enterprise. Companies need a flexible information integration strategy that can scale seamlessly and transparently as the environment changes. Increasing competition The new era of computing has given consumers far more choices than they had in the past, as well as the convenience of switching services easily often online, in a matter of minutes. With so much choice at their fingertips, customers have higher expectations. To build loyalty and trust, enterprises must demonstrate a deep understanding of who their customers are and what they want. This means IT departments need ways to ensure that customer information is understood, cleansed, monitored, transformed, delivered and easily accessible. Source data complexity From social media posts to machine data such as temperature readings and bar code scans, source data is becoming more complex and varied. Extracting and storing this data is no trivial task, but deriving insight from both structured and unstructured information is quickly becoming a requirement for maintaining a competitive edge. The complexity of the data sources is irrelevant to end users; all they need are answers to their questions. Organizations need an integration approach that can simplify data access and help boost productivity. To address these challenges, organizations must make it easy for business users to leverage data confidently for any purpose. Data virtualization meets that goal. Data virtualization delivers transparent, on-demand access to information Designed to provide easy access to information on demand, data virtualization focuses on simplifying access to data by isolating the details of storage and retrieval and making the process transparent to data consumers. By simplifying access to data spread throughout an organization, data virtualization reduces the time required to take advantage of disparate data making it easier for users and processes to get the information they need.

4 Data virtualization: Delivering on-demand access to information throughout the enterprise Data federation is one approach to achieving data virtualization. In fact, a Gartner survey on the adoption and use of data integration tools in 321 organizations worldwide showed that the use of data federation/virtualization capabilities is increasing. 1 Given that access to information is becoming increasingly important due to the breadth of data sources available, combined with data virtualization s transparency and simplicity, it s no surprise that more and more organizations are implementing this approach. Data federation/virtualization, as an increasingly important component of a comprehensive data integration strategy, is gaining expanded interest as organizations begin to recognize its potential role in supporting the LDW, and in rendering data resources useful regardless of how they are deployed or where they are residing. Gartner, The Logical Data Warehouse Will Be a Key Scenario for Using Data Federation, September 26, 2012 Two primary strategies exist for data virtualization: data federation and data services. In both cases, data is exposed to be more consumable, accessible and easily reusable by users, customers or business processes throughout the enterprise. Data federation This data virtualization strategy involves virtually consolidating data from multiple sources, making them appear as a single data source to the end user. Data federation enables the end user to access data anywhere in the enterprise, regardless of its format or vendor. The complexities that are typically associated with querying data from multiple sources (such as database type, schema or structural differences) are hidden from the user. Data services With data services, data is provided to consumers on demand, regardless of where they are located. Details about where the data is located or how it is obtained are invisible to end users they simply send a data services request and the required data is retrieved and returned. This level of transparency makes it extremely easy to access information that is spread across the enterprise. Benefits of data virtualization Data virtualization eliminates the need to move or duplicate data. As a result, organizations can use third-party data more easily, simplify compliance efforts and reuse existing infrastructure more effectively. Data virtualization also delivers several other benefits for users across the enterprise, as described below. Cost-efficiency Because data virtualization involves virtual consolidation of data rather than physical movement, there is no need to create new databases or purchase additional hardware to store the consolidated data. This is a very attractive benefit for organizations looking for a cost-effective integration approach. In situations where physical data movement is not an option due to legal or compliance requirements or ownership issues for example, when data is external to the organization cost-effectiveness is an added bonus.

IBM Software 5 Versatility Once data has been virtualized, it is available to any authorized user in the organization. By providing a data access layer for multiple projects, ranging from a customer service portal to self-service analytics, data virtualization transforms information throughout the enterprise into a powerful asset. Agility As the volume and complexity of data increase, physical data migrations become more difficult and cumbersome. Some business requirements (for example, ad hoc querying or prototyping scenarios) demand faster data access than traditional solutions can provide. Removing the extraction, delivery and other steps associated with physically moving data means the data is available to the business more quickly which enables more agile business decision making. Low risk All integration approaches yield considerable benefits, but solutions that require invasive IT changes often introduce risk. Data virtualization does not require additional hardware or IT infrastructure changes, so it is a very low-risk integration solution that fits seamlessly into an existing IT environment. Complexity hidden from end users With data virtualization, the complexity and disparateness of the data sources do not matter because those details are hidden. The end user will never know and doesn t need to know the details about the data sources. This means changes to source systems will not affect downstream applications. The burden on IT is reduced since the solution does not have to be reconfigured each time a change is made. Fast adoption and time-to-value Data virtualization helps to enable faster results by making information easily accessible to the enterprise. Building virtual data stores is much quicker than creating physical ones since data does not have to be physically moved. Because of its positive impact and fast time-to-value, a data virtualization solution typically gains widespread adoption and support quickly, leading to faster business results. Key considerations for a data virtualization solution While data virtualization provides a unified data access layer for the enterprise, just virtualizing the data is not enough. To be valuable, data must be trusted. It must be understood, timely and accurate and it must be high in quality. There are several key areas to consider when building a successful and effective data virtualization solution. Data governance To deliver business value, data must be governed properly. Data governance requires business and IT to share a common understanding of the data. In addition, it needs the capability to trace data back to its source to verify its lineage, and a set of rules designed to ensure that specific criteria (such as required fields, acceptable values or integrity checks) are met. Without these data governance capabilities, a data virtualization solution will likely not be able to support self-service data access initiatives. Proper information governance is also becoming a business imperative because end users are creating their own views of data more frequently than in the past, introducing more chances for ungoverned data to enter the system and decisions made based on untrustworthy or incorrect information can have dire consequences for the organization.

6 Data virtualization: Delivering on-demand access to information throughout the enterprise Data quality The same quality rules apply to virtualized data as to nonvirtualized data. Data quality processes must be sustainable and ongoing, not just point-in-time solutions. In a virtualized data environment, high data quality needs to be ensured at the source, because new data is virtualized immediately when it is created. Therefore, poor-quality data can automatically trickle into the data virtualization solution and permeate the enterprise if it is not addressed before the data is virtualized. Query optimization As the amount of data increases, a data virtualization solution must scale accordingly to maintain acceptable levels of performance. Because there are many different ways to execute a single query, a query optimizer should be in place to determine the best approach whether that means the fastest route or the one that consumes the fewest system resources. The optimizer must also take into account several criteria, such as whether the operations should be executed by the federated server or by the source where the data is stored; the best order of the operations; and which implementations to use to perform local portions of the query. While small, simple queries can be handled easily by any virtualization solution, larger, more complex, enterprise-class data challenges demand a data virtualization solution that is exceptionally stable and robust. Data caching Caching facilitates both physical and virtual data consolidation. The localization of frequently accessed remote data enables queries to be executed locally and quickly, without the need for access to remote data sources. In the case of more complex queries that combine ad hoc and report-driven requests, some parts of the queries can be pre-populated to accelerate the execution. Data replication and ETL Organizations that cache remote data must synchronize locally stored data with the source data to ensure that it is accurate and up to date. A real-time data replication solution can address this concern by replicating data changes to the locally stored data to ensure it reflects the most current state. In some cases, data must be transformed (to ensure consistency with other data) before it can be virtualized. The transformation capabilities within an extract, transform and load (ETL) engine can perform this function. Strategic data virtualization use cases Trusted information is becoming a business imperative, no matter what industry you are in. Data virtualization is an excellent complement to other available integration approaches, helping to deliver trusted information simply, quickly and cost-effectively, as illustrated by the following use cases. Self-service business intelligence and analytics With the breadth and depth of data available, extracting insight through analytics becomes vitally important. As data volumes grow, analysts typically struggle with gathering and consolidating data before they can even begin analysis. In cases that do not require complex joins of tables, data virtualization can help analysts easily access the data they need without any IT intervention or time-consuming provisioning processes making their business intelligence (BI) and analytics truly self-service.

IBM Software 7 Cloud data integration The cost-effectiveness of the cloud makes it an attractive option to budget-conscious organizations that opt for software-as-a-service (SaaS) solutions. However, the decentralized nature of the cloud makes it challenging to integrate SaaS application data with the rest of the enterprise, given that this data is widely dispersed throughout multiple locations. In some instances, organizations may not be able to physically move the data from the cloud due to ownership issues. These scenarios are a great fit for a data virtualization solution. By virtually consolidating cloud application data with the rest of the enterprise, IT departments can give users a complete view of the business both on premises and in the cloud from a single access point. Master data management Data virtualization can enhance master data management (MDM) initiatives by providing a single view of master data throughout the enterprise. After master data is defined in a data model, it can be retrieved and made available throughout the organization on demand via data virtualization. This is a highly agile consolidation approach because no extensive coding or pre-built consolidation logic is required. Logical data warehousing In some cases, data beyond the warehouse is needed to complement the physical warehouse data, but doesn t need to be stored in the warehouse. For example, a user may need to run a product sales report that requires both historical sales information in the warehouse and real-time customer orders from the company s e-commerce application. For requests of this nature, data virtualization is an ideal fit. It can quickly extend and augment the aggregated warehouse data by virtually joining information from external data sources that could not be easily accessed otherwise. This approach enables a current, fast and complete view of the enterprise. In addition, making schema changes in a data warehouse is often a time-consuming and cumbersome process. Data virtualization is a good prototyping approach to use if the information does not require restructuring before it is combined. Because data virtualization does not involve physically moving data and does not require any IT changes, it can provide a way to prototype these changes by virtually representing a subset of data rather than physically and permanently moving data to the warehouse. Information as a service/data services Data virtualization is one of many enablers of information as a service and data services, which make information readily available and easily accessible to the entire business. Users do not need to spend time searching for, extracting and aggregating the information they need to do their jobs effectively. Data services make all of these processes seamless and transparent to end users creating a virtualized data layer that business processes, applications or any other method can draw upon. This approach significantly increases the availability and accessibility of information throughout the enterprise by eliminating the complexity caused by data silos. Search With data sources scattered throughout the enterprise, knowing where to look for information can be a challenge in itself. Data virtualization simplifies the search challenge significantly by creating a single data access layer for all sources, which makes the physical location of data irrelevant for end users. Conclusion An effectively governed data virtualization solution with high-quality data provides a flexible and cost-effective approach for making data easier to access and use. As part of a broad information integration architecture, not only does it simplify the typically complex data landscape for end users, but it also helps power self-service initiatives by greatly increasing the accessibility of data making it easily available to the entire enterprise.

IBM offers IBM InfoSphere Federation Server and other offerings to address your requirements for federation, virtualization and information integration across the enterprise. For more information To learn more about data virtualization and related IBM solutions, please contact your IBM representative or IBM business partner, or visit: ibm.com/software/products/us/en/ibminfofedeserv Copyright IBM Corporation 2013 IBM Corporation Software Group Route 100 Somers, NY 10589 Produced in the United States of America April 2013 IBM, the IBM logo, ibm.com, and InfoSphere are trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the web at Copyright and trademark information at ibm.com/legal/copytrade.shtml This document is current as of the initial date of publication and may be changed by IBM at any time. THE INFORMATION IN THIS DOCUMENT IS PROVIDED AS IS WITHOUT ANY WARRANTY, EXPRESS OR IMPLIED, INCLUDING WITHOUT ANY WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND ANY WARRANTY OR CONDITION OF NON-INFRINGEMENT. IBM products are warranted according to the terms and conditions of the agreements under which they are provided. The client is responsible for ensuring compliance with laws and regulations applicable to it. IBM does not provide legal advice or represent or warrant that its services or products will ensure that the client is in compliance with any law or regulation. 1 Gartner. The Logical Data Warehouse Will be a Key Scenario for using Data Federation. September 26, 2012 Please Recycle IMW14694-USEN-00