TDWI RESE A RCH TDWI CHECKLIST REPORT DATA REPLICATION FOR REAL-TIME DATA WAREHOUSING AND ANALYTICS By Philip Russom Sponsored by tdwi.org
APRIL 2012 TDWI CHECKLIST REPORT DATA REPLICATION FOR REAL-TIME DATA WAREHOUSING AND ANALYTICS By Philip Russom TABLE OF CONTENTS 2 FOREWORD 3 NUMBER ONE Know the compelling use cases for data replication. 4 NUMBER TWO Understand what modern data replication is and does. 5 NUMBER THREE Recognize the importance of real-time data integration. 6 NUMBER FOUR Connect data replication to heterogeneous sources and targets. 6 NUMBER FIVE Repurpose replicated data via light transformation. 7 NUMBER SIX Replicate data in multiple directions across multiple systems. 8 NUMBER SEVEN Choose data replication tools carefully. 9 ABOUT OUR SPONSOR 9 ABOUT THE TDWI CHECKLIST REPORT SERIES 9 ABOUT THE AUTHOR 9 ABOUT TDWI RESEARCH 1201 Monster Road SW, Suite 250 Renton, WA 98057 T 425.277.9126 F 425.687.2842 E info@tdwi.org tdwi.org 2012 by TDWI (The Data Warehousing Institute TM ), a division of 1105 Media, Inc. All rights reserved. Reproductions in whole or in part are prohibited except by written permission. E-mail requests or feedback to info@tdwi.org. Product and company names mentioned herein may be trademarks and/or registered trademarks of their respective companies.
FOREWORD According to a 2011 TDWI Best Practices Report, data replication is the second-most used data integration technique, second only to extract, transform, and load (ETL). 1 See Figure 1. In fact, almost half of data integration specialists are using some form of data replication today. Yet we seldom hear much about replication in the IT press. Instead of taking data replication for granted, you should nurture its use for several reasons: Data replication is a remarkably flexible technology. Fully modern replication tools can be configured to operate many different ways ranging from real time to batch, from single database brands to broadly heterogeneous environments, from one to many databases, from small data sets to big data, and from unaltered copies of data to transformed data. Furthermore, replication is straightforward to set up and maintain, it s less intrusive to source and target systems than most forms of data integration, and most data management professionals already have experience with it. Data replication can satisfy many business requirements for data. For example, real-time configurations support fast-paced business practices, such as operational business intelligence and just-in-time inventory. Replication can synchronize 360-degree views of customers and other business entities across heterogeneous applications. Data replicas are an important component of business continuity, and the use cases for replication span both operational and analytic applications. This TDWI Checklist Report gives data replication the recognition it deserves, as a highly useful data integration technique for realworld applications in business intelligence (BI), data warehousing (DW), analytics, and general data management. Your peers use data replication in these and other compelling business and technology use cases, as described in this report. You, too, should consider replication for those cases. But not all replication tools are created equal, and this report explains which capabilities you should look for in a tool. 1. See Figure 1 in the April 2011 TDWI Best Practices Report, Next Generation Data Integration, available via free download at tdwi.org/best-practices-reports. Which of the following DI techniques are you using in your DI solutions today? Extract, transform, and load (ETL) 95% Replication or data synchronization 46% Messaging or application integration 39% Extract, load, and transform (ELT) 36% Data federation or virtualization 31% Event processing 21% Figure 1. Based on 323 responses. (Source: TDWI) 2 TDWI RESE ARCH tdwi.org
NUMBER ONE KNOW THE COMPELLING USE CASES FOR DATA REPLICATION. Operational BI. Operational BI is the most widely adopted BI practice of recent years. Operational BI fetches fresh data from operational databases and applications, and then presents that data as metrics or key performance indicators (KPIs) in a management dashboard or scorecard. Although various types of tools can fetch operational data, replication is ideal for this use case, because it is relatively nonintrusive for the applications, it has interfaces to most application data, and it can operate in real time. Real-time analytics. Reporting has accelerated into real-time data (as seen in operational BI) and analytics is now experiencing the same acceleration. Hence, there s a growing need for real-time data in support of time-sensitive analytics, such as customer profiling, sales forecasting, price optimization, production yields in manufacturing, fraud detection, and risk calculations. As with operational BI, real-time analytics can be enabled by replication. Real-time data warehousing. The real-time data for operational BI and real-time analytics must be managed somehow. For this purpose, a real-time data warehouse integrates and aggregates data that will feed into reports and other BI products that are refreshed frequently or on demand. A real-time data warehouse also serves up time-series data to provide a historic context for real-time data. The secret sauce of a real-time data warehouse is a realtime or near-time data integration technique, typically federation, messaging, microbatch ETL, or replication. 2 360-degree views. Integrating data across business units and applications is on the rise, especially in the form of the 360-degree view. For example, complete views of customers can improve customer service and retention, as well as customer analytics. Similar benefits come from views of other business entities (e.g., products, financials, employees). Again, replication is ideal for 360-degree views because of its fast, noninvasive access to application data, plus its support for bidirectional data synchronization across heterogeneous systems. Database high availability (HA). A database management system (DBMS) or data warehouse cannot share data or operate in real time if it is not highly available. Hence, database HA is a requirement for any application involving real-time data or 360-degree views, as well as any time-sensitive or mission-critical application. Note that data replication is by far the most common enabler of database HA in use today. Big data. Some configurations of data replication tools can handle the trickling and streaming data that s common with big data coming from Web servers, robotics, and sensors. Although we automatically think of replication s real-time capabilities (due to its long service in HA), modern replication tools can also extract and load large data sets in near time (say an hour per terabyte or faster), which is imperative for analytics with big data. 2. Real-time data warehousing is defined and discussed in detail in the October 2010 TDWI Best Practices Report, Operational Data Warehousing, available via free download at tdwi.org/best-practices-reports. 3 TDWI RESE ARCH tdwi.org
NUMBER TWO UNDERSTAND WHAT MODERN DATA REPLICATION IS AND DOES. A handful of advanced functions distinguish modern data replication tools and techniques from simpler approaches. That replication solutions range from simple to complex shows what a truly flexible technology replication is. The advanced functions are worth knowing, because more and more data replication solutions need them. THE ADVANCED FUNCTIONS OF DATA REPLICATION Real-time and right-time operation. There are multiple approaches to replication, depending on which layer of the technology stack that replication extracts data from. For example, a replication tool may access data from a transaction log, through a changed data capture (CDC) mechanism, or directly from a DBMS via SQL. Depending on data volume and complexity, all these approaches can be configured to operate in real time or close to it. At the other end of the spectrum, many replication tools can treat the contents of a log (whether transactions, messages, or events) as a queue and then process that queue in an overnight batch or frequent microbatches. Real time is relative, especially with big data. Another possible configuration is to extract a multiterabyte snapshot of data and then replicate it into a target relatively quickly, within an hour or so; this use case can enable real-time DW and analytics. From real time to batch, a fully modern replication tool can integrate data at whatever the right time is for a given application. Multidirectional data flow. Most replication configurations move data one way from a master database to a replica database an approach that is typical of HA and some master data publication solutions. However, data replication is inherently bidirectional or multidirectional when it synchronizes data. By definition, data synchronization (or simply data sync) moves data two or more directions among multiple databases, files, and applications so that they all share the same information. For example, data sync often updates 360-degree views of various types. Conflict resolution. The multidirectional nature of data synchronization creates a need for resolving conflicting data values. After all, if data sources and targets are being updated regularly, it s inevitable that some data values will conflict when they are compared during synchronization. Note that the development of a data sync solution usually entails defining rules for resolving data conflicts. Heterogeneous sources and targets. Simple replication solutions (say, for database HA or data distribution) may only involve a single brand of DBMS, plus a few other common data sources and targets (such as flat files, popular SAP application modules, Microsoft SQL Server, and so on). However, a replication solution of any complexity will involve multiple brands of DBMSs, applications, and file types including emerging ones, such as Hadoop and analytic appliances because heterogeneous sources and targets are common in most IT environments. Data transformation. Basic replication configurations only need to copy data unaltered from a data source to a target. However, heterogeneous data environments demand data transformation capabilities, for the sake of normalizing and merging data coming from diverse schema. Even so, note that the transformations required in these situations are fairly light, perhaps just reordering the fields of a record or making a simple calculation. Figure 2 summarizes and compares the basic and advanced functions of replication. Subsequent sections of this Checklist Report drill into replication s advanced functions and relate them to tool selection. (Continues) 4 TDWI RESE ARCH tdwi.org
NUMBER THREE RECOGNIZE THE IMPORTANCE OF REAL-TIME DATA INTEGRATION. (Continued) Capability Basic Advanced Processing schedule Batch Real time Sources and targets Homogenous Heterogeneous Data repurposing Unaltered copy Light transformations Data flow One way Two or more directions Architecture One-to-one One-to-many, many-toone, or many-to-many Figure 2. Basic and advanced functions of data replication. (Source: TDWI) We all know that the pace of business just keeps accelerating. The rapid dissemination of operational BI over the last few years is the best evidence of this acceleration. We re now seeing a corresponding uptick in operational analytics, which has the same real-time and on-demand requirements as operational BI but applied to analytics. However, what a lot of people don t realize is that these real-time analytic practices (so easily seen at the user interface level) don t work without data handled in real time by data warehouses, appliances, and other databases (which are not so easily seen). To exacerbate perceptions further, real-time databases get their real-time functionality primarily from forms of real-time data integration especially data replication. According to a recent TDWI survey, the types of data integration functionality that are seeing the hottest adoption and growth today are those that involve real-time operation. 3 This includes DI techniques that inherently run in real time, such as replication, federation, and event processing. Even ETL is adapting to new realtime requirements by executing intraday microbatches that augment the usual overnight batch processing, as well as by integrating with messaging middleware and service buses. There are good reasons why real-time data integration is so popular among users right now. For example, data integration s ability to operate in real time makes it a good technology choice for timesensitive, data-driven analytic practices, such as operational BI, real-time analytics, and real-time DW. Real-time data integration also enables fast-paced operational practices, such as just-intime inventory, facility monitoring and analysis, and self-service information portals. Furthermore, big data is often big because it s fed continuously by streaming data, which replication and some other data integration technologies can handle. As business management practices accelerate into real-time decision making based on complete views of customers and other parties, real-time data integration becomes even more useful. 3. See Figure 13 in the April 2011 TDWI Best Practices Report, Next Generation Data Integration, available via free download at tdwi.org/best-practices-reports. 5 TDWI RESE ARCH tdwi.org
NUMBER FOUR CONNECT DATA REPLICATION TO HETEROGENEOUS SOURCES AND TARGETS. NUMBER FIVE REPURPOSE REPLICATED DATA VIA LIGHT TRANSFORMATION. IT is heterogeneous by nature. The average enterprise has multiple brands of DBMSs, applications, operating systems, and legacy platforms, plus multiple versions of each. The challenge for data replication is to support interfaces (both standard and proprietary) that enable efficient replication both in and out of all required IT systems. Related systems are sometimes diverse by design. For example, many database HA configurations involve a hefty hardware server and DBMS license for the primary database (or master), and the primary databases feed a replica on a less expensive and different DBMS brand. Similarly, a hefty DW platform may feed federated data marts on diverse platforms. Replication extends the life of legacy data. For example, replication disseminates hard-to-reach legacy data, so it is available to applications on more modern platforms. Replication can also help migrate and consolidate legacy data, when a platform comes to the end of its life cycle. In short, legacy platforms are part of the heterogeneous mix that replication must address. New sources and targets are increasing data heterogeneity. User organizations continue to deploy more applications, whether homegrown or packaged, which demands more replication to synchronize applications. Firms are updating their business-to-business data exchange solutions to include more modern data integration techniques, such as replication. Whether or not a company is Internet based, all are deepening their leverage of Web data, and most replication tools handle file-based data well, as in Web logs. In a similar vein, many firms look forward to leveraging social media. In the future, replication and other data integration techniques will more often communicate through various types of services and buses, to reach the growing number and growing heterogeneity of sources and targets. Diverse applications and databases have diverse schema. With replication in heterogeneous data environments, it s unavoidable that the data models of sources and targets differ from one another. A data replication solution must deal with this diversity by providing data transformation and mapping capabilities that move data from one data model to another not just one database to another. Transform replicated data to fit the purpose of the target system. The term replication suggests that replication technologies simply copy data without altering it. This is true for some applications of replication, especially database HA. But most other use cases require that data be transformed midstream to meet the purposes of target systems. The transformations of replication are usually light. For example, most replication solutions handle data one record at a time. A transformation may simply reorder the fields of a record and recalculate some fields to fit the record structure of the target system. If replication architecture is many-to-one, merging data (as when joining tables) typically involves one or more transformations, plus mappings from the source models to the target model. Despite some overlap, replication and ETL are complementary. Because of the evolutionary convergence of data integration tools, most replication tools are capable of light transformations, and some ETL tools are capable of light replication. Note that the light transformations of replication don t replace the need for an ETL tool or vice versa. An ETL tool is probably the best choice for complex transformations of large data volumes in heavily heterogeneous environments, as is typical of multidimensional data warehouses. However, data replication can complement ETL by providing real-time data with light transformations, as required for operational BI, real-time DW, and many analytic applications. Real-time data warehouses and analytics are heterogeneous by nature. After all, the point of these BI applications is to collect enterprise data from diverse systems and integrate it into data structures that are conducive to BI purposes. Replication can make significant contributions here, when the tool in use supports many heterogeneous platforms and can integrate data from them in real time or close to it. 6 TDWI RESE ARCH tdwi.org
NUMBER SIX REPLICATE DATA IN MULTIPLE DIRECTIONS ACROSS MULTIPLE SYSTEMS. Replication and synchronization are two slightly different techniques. Replication usually involves moving data in one direction, from a source to a target. Synchronization is where data flows in two directions. Admittedly, the distinction is somewhat academic, especially when you consider that a single tool can enable both replication and synchronization. But the distinction helps separate basic functions (simple replication) from advanced ones (multidirectional synchronization with conflict resolution). 4 Data synchronization must handle conflicting values and exceptions. With data flowing two or more directions among continuously updated databases, it s normal that values get out of sync. For this reason, a data synchronization solution must resolve such conflicts as automatically as possible. Even so, there may still be some exceptions that require human intervention to resolve. Data synchronization enables real-time data warehousing. As discussed earlier, a real-time data warehouse typically includes a table or similar structure that is kept synchronized with the operational data required for operational BI and real-time analytics. Depending on the design, data may replicate one-way from operational applications into the real-time data warehouse. Increasingly, however, users are updating their designs to close the loop. In other words, data from a real-time data warehouse flows back upstream to enhance operational applications. This way, the results of analytic models or metrics calculated in the warehouse provide additional information to the users of operational applications. With data flowing in two directions, data synchronization is required, instead of one-way replication. Customer data is a common application for data synchronization. Many organizations have multiple applications for customer relationship management (CRM) or similar customerfacing functions, such as sales force automation (SFA), call center, order entry, billing, and shipping. All these applications share common information about customers, and business units are increasingly under pressure to have a view of customers that is as complete as possible, for the sake of customer service, consistent operations, and cross-selling. Hence, many data replication solutions synchronize customer data across multiple customer-oriented applications. Data synchronization has many beneficial use cases. Similar to the situation with CRM applications, many firms have enterprise resource planning (ERP) applications from multiple vendors or multiple instances of one vendor s application. Data synchronization can make all the instances look like one global instance. Furthermore, some master data management (MDM) solutions use data replication to synchronize reference data across multiple applications. In addition, some replication tools synchronize mobile devices with enterprise databases. 4. For a discussion of data synchronization use cases, see the October 2010 TDWI Best Practices Report, Operational Data Warehousing, available via free download at tdwi.org/best-practices-reports. 7 TDWI RESE ARCH tdwi.org
NUMBER SEVEN CHOOSE DATA REPLICATION TOOLS CAREFULLY. When evaluating tools and platforms for data replication, it s best to judge a tool by how well it supports the advanced functions of replication. These advanced functions differentiate the available tools; selecting a tool with advanced functionality gives you ample room to grow in size and sophistication. Let s take another look at the advanced functions of data replication, but in terms of how they affect tool selection. ADVANCED DATA REPLICATION FUNCTIONALITY Real-time and right-time operation. Data sets have differing requirements relative to how quickly or frequently they need to be updated, ranging from true real time to overnight processing. Ideally, a data replication tool should support both extremes, plus some gradations in between. For a tool to support the full range, it may need to provide more than one approach to replication, including common approaches based on transaction logs, changed data time stamps, table dumps, bulk loading, snapshots, and SQL. Data synchronization and conflict resolution. Note that multidirectional data synchronization solutions may encounter conflicting data values among source and target systems during the data sync process. One approach to this potential problem which works for most data sync use cases is for users to design each solution such that conflicts are naturally avoided; this is accomplished by partitioning data or by controlling which sources can update which targets and when. A different approach may be required for more complex multidirectional, multimaster configurations. For these rare configurations, look for tools with strong user interfaces for designing the rules that detect, categorize, and resolve data conflicts. In deployment, resolution should be as automatic as possible, to avoid human intervention. Even so, look for tools that include functions where users can manually handle exceptions to the rules, but also turn their exception handling into rules for greater automation. Data transformation. The more heterogeneous your replication environment is, the more sophisticated data transformations need to be. Look for replication tools with a solid user interface for designing transformations, as well as ways to reuse transformations across solutions. However, remember that highly complex transformations may require an ETL tool. OTHER FACTORS FOR TOOL SELECTION Advanced functionality aside, other factors can affect tool selection: Light replication built into DBMSs. Most mature brands of DBMSs support robust data replication capabilities. Yet the replication utilities built into DBMSs are inherently limited to their database brand, plus a very short list of other sources and targets. DBMS-based replication is best applied in a homogeneous DBMS environment. Independent replication tools. Consider an independent, standalone data replication tool for the broadest heterogeneity and advanced features, without a DBMS-brand bias. Furthermore, a neutral and centralized replication solution fosters a number of desirable outcomes, such as organized architectures for data integration, enterprise data standards, reuse across solutions, and an enterprise view of data as a global asset. Integration between replication and other data management tools. Leading data integration vendors now offer broad integrated platforms that include tools for ETL, replication, and federation. Some platforms also include tools for data quality, MDM, event processing, and more. In this tool environment, developers working in multiple data management disciplines can share development artifacts, interfaces, and metadata, which fosters integrated solutions, governed data access, and developer productivity. An integrated tool platform is a good choice when data replication must be coordinated with other data disciplines. Heterogeneous data environments. Supporting these environments demands a tool with many interfaces, including standard ones (SQL over ODBC and JDBC) and proprietary ones (platform-specific APIs and call interfaces). Replication is regularly the link that connects legacy platforms and packaged applications, so check to see that a tool supports your inventory of these platform types. Also look for service-oriented interfaces; even if you don t need them today, you will soon enough. 8 TDWI RESE ARCH tdwi.org
ABOUT OUR SPONSOR ABOUT THE AUTHOR Informatica Corporation (NASDAQ: INFA) is the world s number-one independent provider of data integration software. Organizations around the world rely on Informatica for maximizing return on data to drive their top business imperatives. Worldwide, over 4,630 enterprises depend on Informatica to fully leverage their information assets residing on-premises, in the cloud, and across social networks. For more information, call +1 650-385-5000 (1-800-653-3871 in the U.S.), or visit www.informatica.com. Connect with Informatica at www.facebook.com/informaticacorporation, www.linkedin.com/company/informatica, and twitter.com/informaticacorp. Informatica Data Replication is a heterogeneous, real-time transaction replication solution that is highly scalable, reliable, and easy to configure. It allows IT organizations to share information across different systems in a heterogeneous environment by replicating data between different hardware platforms and data sources including appliances and big data while maintaining the transactional integrity of the data. It provides highly optimized information extraction from heterogeneous sources, rapid loading into destinations, and the ability to scale for large data volumes. Using log-based changed data capture (CDC) to minimize the impact on source systems, the software consumes few system resources while handling the transactional volumes required all without compromising performance. Philip Russom is the research director for data management at The Data Warehousing Institute (TDWI), where he oversees many of TDWI s research-oriented publications, services, and events. He s been an industry analyst at Forrester Research and Giga Information Group, where he researched, wrote, spoke, and consulted about BI issues. Before that, Russom worked in technical and marketing positions for various database vendors. Over the years, Russom has produced over 500 publications and speeches. You can reach him at prussom@tdwi.org. ABOUT TDWI RESEARCH TDWI Research provides research and advice for business intelligence and data warehousing professionals worldwide. TDWI Research focuses exclusively on BI/DW issues and teams up with industry thought leaders and practitioners to deliver both broad and deep understanding of the business and technical issues surrounding the deployment of business intelligence and data warehousing solutions. TDWI Research offers in-depth reports, commentary, inquiry services, and topical conferences as well as strategic planning services to user and vendor organizations. ABOUT THE TDWI CHECKLIST REPORT SERIES TDWI Checklist Reports provide an overview of success factors for a specific project in business intelligence, data warehousing, or a related data management discipline. Companies may use this overview to get organized before beginning a project or to identify goals and areas of improvement for current projects. 9 TDWI RESE ARCH tdwi.org