White Paper Informatica and the Vibe Virtual Data Machine Preparing for the Integrated Information Age
This document contains Confidential, Proprietary and Trade Secret Information ( Confidential Information ) of Informatica Corporation and may not be copied, distributed, duplicated, or otherwise reproduced in any manner without the prior written consent of Informatica. While every attempt has been made to ensure that the information in this document is accurate and complete, some typographical errors or technical inaccuracies may exist. Informatica does not accept responsibility for any kind of loss resulting from the use of information contained in this document. The information contained in this document is subject to change without notice. The incorporation of the product attributes discussed in these materials into any release or upgrade of any Informatica software product as well as the timing of any such release or upgrade is at the sole discretion of Informatica. Protected by one or more of the following U.S. Patents: 6,032,158; 5,794,246; 6,014,670; 6,339,775; 6,044,374; 6,208,990; 6,208,990; 6,850,947; 6,895,471; or by the following pending U.S. Patents: 09/644,280; 10/966,046; 10/727,700. This edition published May 2013
White Paper Table of Contents The History of the Virtual Data Machine... 2 The Power Driving the Informatica Platform: The Vibe VDM... 4 What Is the Vibe Virtual Data Machine and How Does It Work?... 5 The Vibe Virtual Data Machine at Work... 6 Automatically convert virtual prototypes to ETL for physical data movement...6 Develop Hadoop integration jobs without having to know Hadoop...7 Embed data quality in your application...8 Enabling hybrid IT with cloud and on-premise integration...8 Virtual Data Machine and the Integrated Information Age... 9 Informatica and the Vibe Virtual Data Machine: Preparing for the Integrated Information Age 1
The History of the Virtual Data Machine Since the founding of Informatica Corporation 20 years ago, we have always had a philosophy of separating the development of data integration from the actual run-time implementation. This is what Informatica means when we say that the Informatica PowerCenter data integration product is metadata driven. The term metadata driven means that a developer does not have to know C, C++, or Java to perform data integration. The developer operates in a graphical development environment using drag-and-drop tools to visualize how data will move from system A, then be combined with data from system B, and then ultimately be cleansed and transformed when it finally arrives at system C. At the most detailed level of the development process, you might see icons representing data sets, and lines representing relationships coming out of those data sets going into other data sets, with descriptions of how that data is transformed along the way. (See Figure 1). However, you do not see code, just the metadata describing how the data will be modified along the way. Figure 1: Informatica Developer drag-and-drop graphical development environment The idea is that a person who is knowledgeable about data integration concepts, but is not necessarily a software developer, can develop data integration jobs to convert raw data into high-quality information that allows organizations to put their data potential to work. The implication is that far more people are able to develop data integration jobs because through the use of graphical tools, we have democratized data integration development. 2
Over time, however, data integration has become more complicated. It has moved from just being extract, transform, and load (ETL) for batch movement of data to also include data quality, real-time data, data virtualization, and now Hadoop. In addition, the integration process can be deployed both on premise and in the cloud. As data integration has become more complex, it has forced the use of a blended approach that often requires the use of many or most of the capabilities and approaches just mentioned while the mix and match of underlying technologies keeps expanding. This entire time, Informatica has continued to separate the development environment from the underlying data movement and transformation technology. Why is this separation so important? It is important because as new data integration approaches come along, with new deployment models like software as a service (SaaS), new technologies such as Hadoop, and new languages such as Pig and Hive and even yet to be invented languages, existing Informatica users don t have to learn the details of how the new technology works in order to take advantage of it. In addition, the pace at which the underlying technologies are changing in the data integration and management market is increasing. So as this pace quickens, by separating development from deployment, end-users can continue to design and develop using the same interface, and under the covers, they can take advantage of new kinds of data movement and transformation engines to virtualize data, move it in batch, move it in real time, or integrate big data, without having to learn the details of the underlying language, system, or framework. Informatica and the Vibe Virtual Data Machine: Preparing for the Integrated Information Age 3
The Power Driving the Informatica Platform: The Vibe VDM So what does a metadata-driven approach and separation of development environments from run-time environments have to do with virtual data machines (VDMs)? The VDM enables this entire concept and provides the underlying technology that has been embedded in Informatica products for years. To separate the development from the run-time environment, Informatica had to create a run-time engine that could execute the instruction set independently from the development environment. Over time, this engine could not only handle batch data movement and transformation instructions but also could handle data cleansing, matching and masking instructions; expose RESTful and Web services APIs; and provide data virtualization services. It could run from the cloud or even run directly on a Hadoop cluster. The VDM is not the same as code generation. Other companies have data integration code generators. However, they have one code generator that works for Java, another that works for Hadoop, and so on. These code generators are environment-specific and require that you know Java to use the Java generator or that you know Hadoop to use the Hadoop generator. The expectation is that you will use their high-level graphical tool to create the integration framework, then generate Java, Hive, or Pig code that you would then manually modify. This means that if you use one of these code generators to create an ETL mapping that runs inside a Java container, the mapping work would have to be completely re-implemented if you wanted to run that same mapping with Hive on Hadoop. And it would have to be rewritten yet again if you wanted to run it with Pig on Hadoop. The difference between traditional code generators and a VDM is that with a VDM you can take the same code-less, graphical integration mapping and run it either virtually, as an ETL job, in the cloud or on Hadoop, without having to make coding changes. In addition, unlike a traditional code generator, you do not have to know the underlying language (Java, Pig, Hive, etc.). Technically speaking, eventually code is generated. But the key difference is that the developer never has to manually modify the code. In addition, the separation of the development environment from the run-time environment protects the developer from changes to the underlying technology and helps you deal with new use cases as they come along. As the number of new use cases has grown, and data integration has converged into the broader information management discipline, the value of the VDM has also grown. The VDM has evolved beyond just being metadata driven to actually delivering a machine that makes it possible to map once and deploy anywhere. Because of this expanded value, Informatica is now giving our VDM a name: Vibe. 4
What Is the Vibe Virtual Data Machine and How Does It Work? The Informatica Vibe virtual data machine is a data management engine that knows how to ingest data and then very efficiently transform, cleanse, manage, or combine it with other data. It is the core engine that drives the Informatica Platform. The Vibe VDM works by receiving a set of instructions that describe the data source(s) from which it will extract data, the rules and flow by which that data will be transformed, analyzed, masked, archived, matched, or cleansed, and ultimately where that data will be loaded when the processing is finished. Vibe consists of a number of fundamental components (see Figure 2): Transformation Library: This is a collection of useful, prebuilt transformations that the engine calls to combine, transform, cleanse, match, and mask data. For those familiar with PowerCenter or Informatica Data Quality, this library is represented by the icons that the developer can drag and drop onto the canvas to perform actions on data. Optimizer: The Optimizer compiles data processing logic into internal representation to ensure effective resource usage and efficient run time based on data characteristics and execution environment configurations. Virtual Data Machine Transformation Library Optimizer Executor Connectors Figure 2. Virtual data machine architecture, with a transformation library to define logic, the optimizer to deploy in the most efficient manner, the executor as the run-time engine for physical execution, and connectors to data sources. Executor: This is a run-time execution engine that orchestrates the data logic using the appropriate transformations. The engine reads/writes data from an adapter or directly streams the data from an application. Connectors: Informatica s connectivity extensions provide data access from various data sources. This is what allows Informatica Platform users to connect to almost any data source or application for use by a variety of data movement technologies and modes, including batch, request/response, and publish/subscribe. Vibe Software Development Kit (SDK): While not shown in Figure 1, Vibe provides APIs and extensions that allow third parties to add new connectors as well as transformations. So developers are not limited to the already extensive capabilities of the Vibe VDM. They can add their own capabilities as needed. Informatica and the Vibe Virtual Data Machine: Preparing for the Integrated Information Age 5
The Vibe Virtual Data Machine at Work Now that you understand the basic concepts behind Vibe, here are some examples of how Informatica has put Vibe to work to solve real problems facing our customers: Automatically convert virtual prototypes to ETL for physical data movement One of the well-known challenges in development of data integration projects is the classic problem of misalignment between business and IT. The business describes its problems and needs, then developers go off to develop the data integration code, generating a data warehouse schema, only to find that the requirements have changed. At that point the process starts over again. This lack of alignment typically has caused cost and time overruns of many data integration projects. The industry has responded to this issue by using data virtualization (aka federation) technology that allows business users to develop their own business intelligence reports without having to first create or modify a data warehouse. That way, business users can effectively prototype their final reports without a lengthy iterative process back and forth with IT. However, when they want to convert this environment into a physical data warehouse, typically the prototype then has to be entirely recoded. This is because virtualization technology and ETL technology generally are completely independent of each other. Not so with Vibe. The Informatica Data Services (IDS) product implements data virtualization using the Vibe VDM, the same VDM that runs PowerCenter ETL jobs. IDS allows users to perform the kinds of virtualized and federated integration described above, without having to create a physical data warehouse. However, because IDS shares the same VDM that is used for ETL, when customers want to convert a virtualized view of data into a physical view, they can do that with a few clicks of the mouse. No recoding is required. This integration of virtual to physical is what allows our customers to accelerate data integration projects literally from months to days, thanks to the power of Vibe. 6
Develop Hadoop integration jobs without having to know Hadoop Modern data scientists have come to rely on cost-effective and scalable technologies such as Hadoop to store and process the massive amounts of data now available to organizations for data mining and analysis. However, there are numerous challenges with Hadoop development currently. First, data scientists are very difficult to find and hire. Industry analysts estimate that over 60 percent of data scientist positions will go unfilled. In addition, Hadoop developers are as equally hard to find as data scientists. In fact, data scientists are often forced to learn Hadoop development out of necessity. Second, Hadoop technology is constantly changing. So while today s development might be done in Pig or Hive, tomorrow s is likely to be done in Yarn or something not yet invented. (So please do not complain that this paper is out of date; things are moving so fast in the Hadoop world that it is likely to be out of date within two months of publication.) Third, based on real-world experience, approximately 80 percent of the effort of a big data initiative on Hadoop is not actually data analysis, but data preparation. That means the very rare, very expensive data scientist is spending 80 percent of his or her time doing tasks that used to be performed by a data analyst, data architect, or data integration developer. The problem with this situation is that Hadoop development is not economically scalable. While the Linux/Intel hardware Hadoop runs on is much cheaper than a traditional data warehouse appliance, the development today is much more costly. The cost of data scientists and Hadoop developers is too high, and there are not enough of them available to do the work. Fortunately, thanks to Vibe, this situation is no longer true. Informatica has ported Vibe to run directly on Hadoop. Also, recall the discussion earlier in this paper about the separation between the development environment and the run-time environment. So if we can run the VDM directly on Hadoop, that means that data integration, data quality, data profiling, and data parsing jobs that can be described using the PowerCenter Designer graphical environment can also be run on a Hadoop cluster. As a result, any PowerCenter developer is now a Hadoop developer, with no additional training required. In addition, as the underlying technology of Hadoop keeps evolving, the PowerCenter developer does not have to learn all of the new languages, wasting time when those languages are likely to disappear. The separation of the development environment from the run-time environment of the VDM protects developers from refactoring the assets they build as data volumes grow and new data types and technologies emerge, thereby future-proofing their work. This is precisely the advantage delivered by the map once, deploy anywhere philosophy of the Vibe architecture. Informatica and the Vibe Virtual Data Machine: Preparing for the Integrated Information Age 7
Embed data quality in your application Another trend we have seen over the past few years is the desire to directly embed data quality into applications. For example, an application might have a data entry point in which the user will type something and then the application, if well-written, will check the quality of that information. Typically, applications might make an external call to a data quality application or service to perform this quality check. This step often meant the customer would have to buy this data quality solution (such as Informatica Data Quality) separately, or the application developer might include the product with its application. However, now with Vibe, data quality tasks can be programmed externally via the Informatica Designer, and the VDM and data quality rules can be embedded directly into the application without requiring the full Informatica Data Quality application at run time. These capabilities allow companies to improve the quality of the data at point of entry, without significantly increasing the footprint of their application. And, if they want to change the data cleansing rule set, they only have to change the metadata instructions that Vibe is running; they do not have to change the actual application. Enabling hybrid IT with cloud and on-premise integration With the ongoing growth of cloud-based software as a service (SaaS) applications such as Salesforce and Workday, companies need to integrate their on-premise and cloud applications. However, many organizations do not necessarily have IT professionals involved in the management of their cloud applications or even in the integration of these SaaS applications with other data sources. Additionally, enterprise-class data integration products such as PowerCenter are often not a good fit for SaaS administrators and users because the products run on premise, requiring installation and configuration, and because they have a feature set that far exceeds the needs of this use case. What these SaaS users really want are purpose-built cloud data integration services that, as with the SaaS applications they are integrating, provide an on-line point-and-click interface that steps the user through the creation and scheduling of data integration jobs with minimal IT infrastructure and training. Informatica met this need with Informatica Cloud, which enables users to schedule and manage integrations between cloud and on-premise data sources using on-line wizards. Additionally, the Vibe run-time engine can be installed either inside the firewall or hosted in the cloud. Vibe then manages the physical execution regardless of location. Moreover, because Informatica Cloud and PowerCenter are both built on Vibe, users can import and export mappings from their PowerCenter development environment to Informatica Cloud and vice versa, enabling reuse and consistency. This is yet another example of the power of Vibe and how it supports the concept of map once, deploy anywhere. 8
Virtual Data Machine and the Integrated Information Age It used to be that to develop an application, you had to develop your own database. Then the RDBMS was invented, and people stopped developing their own database. It used to be that to scale an application, you had to develop your own underlying redundancy, security, and scalability infrastructure. Then application servers came along, and people stopped doing that work as well. It used to be that applications were built to operate standalone and would expose APIs to let their customers do their own integration. Modern-day applications, whether on premise or in the cloud, now depend on data originating from multiple sources such as application databases (e.g., ERP or CRM), social media data (e.g., Facebook, Twitter, LinkedIn, Web sites, blogs), sensor device data (e.g., RFID, CDR, machine sensors), and third-party industry data (e.g., FIX, SWIFT, HL7, EDI). To build these composite applications, developers must integrate massive volumes of data of many different types using a wide variety of latencies (e.g., real-time or batch). Now with Vibe, the future of application development will be to embed Vibe as part of an application to include a full data integration stack, making it easier to build modern enterprise applications that source data from more than one system. Informatica believes Vibe will evolve even further with micro-vdms and nano-vdms that will run on sensors and intelligent devices, making it easier to integrate the Internet of things by aggregating and managing data as volumes continue to grow exponentially. About Informatica Informatica Corporation (NASDAQ: INFA) is the world s number one independent provider of data integration software. Organizations around the world rely on Informatica for maximizing return on data to drive their top business imperatives. Worldwide, over 4,630 enterprises depend on Informatica to fully leverage their information assets residing onpremise, in the Cloud and across social networks. The key is that as data grows and the challenges facing the use of data grow, the technologies used to manage data also continue to change. The Informatica Vibe virtual data machine makes it easier to manage converting raw data into actionable information and to deal with the constantly changing technological environment. Informatica helps its customers put the potential offered by today s information explosion to work, preparing them to compete in the age of integrated information. Informatica and the Vibe Virtual Data Machine: Preparing for the Integrated Information Age 9
Worldwide Headquarters, 100 Cardinal Way, Redwood City, CA 94063, USA Phone: 650.385.5000 Fax: 650.385.5500 Toll-free in the US: 1.800.653.3871 informatica.com linkedin.com/company/informatica twitter.com/informaticacorp 2013 Informatica Corporation. All rights reserved. Informatica and Put potential to work are trademarks or registered trademarks of Informatica Corporation in the United States and in jurisdictions throughout the world. All other company and product names may be trade names or trademarks. IN09_0513_02460