The New Rules for Integration A Unified Integration Approach for Big Data, the Cloud, and the Enterprise A Whitepaper Rick F. van der Lans Independent Business Intelligence Analyst R20/Consultancy September 2013 Sponsored by
Copyright 2013 R20/Consultancy. All rights reserved. The Talend Platform for Data Services, Talend Open Studio, and The Talend Unified Platform are registered trademarks or trademarks of Talend Inc.. Trademarks of other companies referenced in this document are the sole property of their respective owners.
Table of Contents 1 Management Summary 1 2 Dispersion of Business Data Across a Labyrinth of Systems 1 3 From Integration Silos to an Integration Labyrinth 2 4 The New Rules for Integration 3 5 Rule 1: Unified Integration Platform 4 6 Rule 2: Generating Integration Specifications 5 7 Rule 3: Big Data-Ready 6 8 Rule 4: Cloud-Ready (Hybrid Integration) 8 9 Rule 5: Enterprise-Ready 10 10 Talend and the New Rules for Integration 12 About the Author Rick F. van der Lans 15 About Talend Inc. 15
The New Rules for Integration A Unified Integration Approach for Big Data, the Cloud, and the Enterprise 1 1 Management Summary Data is increasingly becoming a crucial asset for organizations to survive in today s fast moving business world. And data becomes more valuable if enriched and/or fused with other data. Unfortunately, enterprise data is dispersed by most organizations over numerous systems all using different technologies. To bring all that data together is and has always been a major technological challenge. For each system that requires data from different systems, different integration solutions are deployed. In other words, integration silos have been developed that over time has led to a complex integration labyrinth. The disadvantages are clear: Inconsistent integration specifications Inconsistent results Decreased time to market Increased development costs Increased maintenance costs The bar for integration tools and technology has been raised: the integration labyrinth has to disappear. It must become easier to integrate systems, and integration solutions should be easier to design and maintain to keep up with the fast changing business world. In addition, organizations are now confronted with new technologies such as big data systems and applications running in the cloud. All these new demands are changing the rules of the integration game. This whitepaper discusses the following five crucial new rules for integration: 1. Unified integration platform 2. Generating integration specifications 3. Big data-ready 4. Cloud-ready (Hybrid integration) 5. Enterprise-ready Data becomes more valuable if enriched and/or fused with other data. In addition, the whitepaper describes The Talend Platform for Data Services which fully supports data integration and application integration with one unified platform. It also explains how Talend s product meets the new rules for integration. 2 Dispersion of Business Data Across a Labyrinth of Systems The Synergetic Effect of Data Integration The term synergetic effect applies very strongly to the world of business data. By bringing data from multiple IT systems together, the business value of that integrated data is greater than the sum of the individual data elements; one plus one is clearly three here.
The New Rules for Integration A Unified Integration Approach for Big Data, the Cloud, and the Enterprise 2 For example, knowing which customers have purchased which products is valuable. Knowing which customers have returned products is valuable as well. But more valuable could be to bring these data elements together. This may reveal, for example, that a particular customer, who purchases many products, returns most of them. This customer is probably not For different reasons, business data has been dispersed across many systems. as valuable as some may think. Data becomes more valuable if enriched and/or fused with other data. Dispersion of Business Data Across Many IT Systems When all the business data is stored in one IT system, bringing data together is technically easy. Unfortunately, in most organizations business data has been dispersed over many different systems. For example, data on a particular customer may be distributed across the sales system, the finance system, the complaints systems, the customer accounts system, the data warehouse, the master data management system, and so on. Usually, the underlying reasons for this situation are historical. Through the years, organizations have created and acquired new systems; they have merged with other organizations bringing their own systems; and they have rented systems in the cloud. In addition, when new systems were developed to replace older ones, rarely ever did these new systems fully replace the older ones. In most cases, these legacy systems have been kept alive and are still operational. The consequence of this all is a dispersion of business data. Besides the fact that data is stored in many systems, an additional complexity is that many systems use different implementation technologies. Some use SQL databases to store data, others use pre-sql systems, such as IDMS, IDS, and Total, and there is more and more data available through API s such as SOAP and REST. And don t forget this new generation of NoSQL systems. The use of heterogeneous technologies for storing data elevates the complexity to integrate data. Conclusion, today it s hard for users to find and integrate the data they need to get the desired synergetic effect. To them, it feels as if their business data has been hidden deeply in a complex data labyrinth. 3 From Integration Silos to an Integration Labyrinth The Integration Silos More and more IT systems need to retrieve and manipulate data stored in multiple systems. For example, a new portal for supporting customer questions needs access to a production database and an ERP system to get all the required data to handle the customer requests. Another example is a website designed for customers to order products online needs to query data from and insert and update data in various production applications. A third example is a report that shows what has happened with a particular business process and which also requires integrated data from multiple systems. For most of these systems dedicated integration solutions have been developed. For example, the website may use an integration solution developed with an ESB (Enterprise Service Bus). This bus is used to extract data from and insert data in the underlying production systems. The company portal, on For many new IT systems dedicated integration silos have been developed.
The New Rules for Integration A Unified Integration Approach for Big Data, the Cloud, and the Enterprise 3 the other hand, may use a dedicated portal server, whereas the reporting environment may be supported by an integration solution based on a data warehouse and an ETL-tool. Some newer self-service reporting tools deploy their own light-weight data integration technology. The Integration Labyrinth If we consider all these data integration efforts, one can easily speak of integration silos, because for each application or group of applications a dedicated integration solution is developed. This is a highly undesirable situation, because eventually this approach leads to an integration labyrinth. Disadvantages of Integration Silos Although having a dedicated integration solution may be handy for the system involved, this approach clearly has some weaknesses: Inconsistent integration specifications: Because the integration specifications are distributed over many integration solutions, it s difficult to guarantee that rules in different solutions for integrating the same data are implemented consistently. Inconsistent results: If different sets of integration specifications are applied, the results from different integration solutions may be inconsistent. In addition, this inconsistency will reduce trust in data and the supporting systems. Decreased time to market: Because the integration specifications are replicated, changing them enterprise-wide in all relevant solutions is time-consuming. This slows down the implementation and thus the time-to-market of new systems. Increased development costs: When the same systems are integrated by different integration solutions, the same integration specifications have to be implemented multiple times, thus increasing the development costs. Increased maintenance costs: Changing integration specifications in multiple different solutions implies changing them in many different tools and programming languages and requires different development skills. This raises the costs for changing integration specifications considerably. 4 The New Rules for Integration The new business demands described in the previous section have raised the bar for integration tools and technology; the New business demands integration labyrinth has to disappear. It must become easier have raised the bar for to integrate systems and integration specifications should be integration technology. easier to change and maintain to keep up with the fast changing business world. These new demands are changing the rules of the integration game.
The New Rules for Integration A Unified Integration Approach for Big Data, the Cloud, and the Enterprise 4 This whitepaper describes the following five crucial new rules for integration: 1. Unified integration platform 2. Generating integration specifications 3. Big data-ready 4. Cloud-ready (Hybrid integration) 5. Enterprise-ready In the next sections each of these five rules are explained in detail. 5 Rule 1: Unified Integration Platform Multitude of Integration Technologies Organizations can select from a multitude of technologies for integrating systems and data sources, such as ESBs (Enterprise Service Bus), ETL tools, data replicators, portals, data virtualization servers, and homemade code. Today, they can even use the lightweight integration technology embedded in self-service reporting tools to integrate data sources. It s good that this wide range of integration styles exists, because no one integration style exists that is perfect for every integration problem. For example, when data is integrated from multiple data sources and copied to one data source using a batch-oriented approach, ETL is the preferred approach, whereas when individual data messages must be transmitted from one application to another, ESB is undoubtedly the recommended solution. So, different problems require different solutions. However, the current situation is that organizations are really deploying all these different solutions which leads to a duplication of integration specifications. For example, when an ETL solution has been designed to extract data from a particular database, and a data replicator is using that same database, there is a big chance that for both solutions comparable integration specifications have been entered. Or, when an application is accessed by an ESB and a portal to retrieve data, comparable specifications have probably been developed for each of them. A Unified Integration Platform In the situation described here, the wheel is reinvented over and over again. It leads to integration silos. The previous section indicates the disadvantages of the integration silos. To solve this problem, it s essential that integration tools support many different integration styles with No integration style exists that is perfect for every integration problem. Rule 1: Integration tools must support unified integration capabilities. one single design and runtime platform. Developers should not have to switch to another tool if they want to switch from, for example, ESB-style to ETL-style integration. Nor should they have to switch to another tool if data is moved to another database platform, or if the applications use another API type. There should be one integrated platform in which all the integration specifications (logical and technical) are stored only once.
The New Rules for Integration A Unified Integration Approach for Big Data, the Cloud, and the Enterprise 5 6 Rule 2: Generating Integration Specifications Logical Versus Technical Integration Specifications Regardless of the integration technology in use, developers have to enter integration specifications. These specifications can be classified in two groups: logical and technical integration specifications. The former group deals with the what and the latter with the how. Logical specifications describe the structure of source and target systems, required transformation, merge and join specifications, cleansing rules, and so on. They purely indicate what should be done, not how. That s where the technical specifications come in. They deal with the specific APIs of source and target systems, performance and efficiency aspects, etcetera. It s everyone s dream that developers of integration solutions would only need to focus on the logical aspects of integration, and not on the technical aspects. For example, it should be irrelevant for developers whether data has to be extracted from a classic SQL system, from an Hadoop system, or from a Developers should be focusing on logical and not technical specifications. Salesforce.com application running in the cloud. They should be focusing on the logical structure of the source data, the logical structure of the target system, and which transformations to apply. They should not have to focus on specific APIs, database concepts used, encryption aspects, etcetera. Unfortunately, in many integration tools, this is not the case. Developers do have to know how to extract data from Hadoop using Hive, Pig, or HBase, via an ESB using a SOAP-based interface, via a REST interface, via JMS, or via one of the many different alternatives. Abstraction Through Code Generation An integration solution should hide all the technical integration aspects and should let developers focus on the logical aspects. Code generation is a proven technique to hide technical aspects. Code generation has been applied very successfully in the IT industry for many years. Numerous examples exist where code generation is used to hide technical aspects from developers: For example, starting in the 1960s, Cobol compilers generated assembler code, and by doing that they concealed many of the technical difficulties of assembler programming. Another successful example is SQL. The distinction between the what and how has been the basis for SQL. For example, SQL queries only deal with the what: what data should be retrieved from the database? Queries do not indicate how the data should be retrieved efficiently and quickly from disk. SQL database servers generate internal code to access the data and with that hide the technical aspects of data storage and data access aspects. Probably the most popular example is Java. Java compilers generate Java byte code. By generating code, an abstraction layer is created that hides technical details and thus increases the productivity of developers.
The New Rules for Integration A Unified Integration Approach for Big Data, the Cloud, and the Enterprise 6 Advantages of Code Generation In general, the advantages of code generation are: Transparency Portability Productivity Maintenance The perfect integration tool allows integration developers to focus on the logical specifications and hides all the technical details. For example, developers should not have to learn all the technical peculiarities of Hadoop to be able to extract data, they should not have to study all the details of an ESB, they should not have to investigate how to extract data from a legacy database, or how to insert data in a document store such as Cassandra. Integration tools should understand what the most efficient and fastest way is to work with all those source and target systems. Is Generated Code Efficient? The following question has been raised countless times since tools started to generate code: Is generated code efficient? Is it as efficient as code written by hand? Maybe the answer is no. Maybe generated code is, in most cases, generic code and therefore not the most efficient code possible. However, if code has to be written by hand, does an organization have the specialists on board who can write more efficient code? And if they can, what are the costs of writing code by hand? In addition, how maintainable is code written by hand? Imagine that the IT industry would not have outgrown the world of assembler languages, productivity would have been horrendous. The discussion on code generation should not be limited to the efficiency of code. A comparison should be made that doesn t limit the focus on performance and efficiency, but also includes productivity and maintenance. Nowadays, these latter two aspects are considered more important than brute performance and efficiency. Integration Speed-Up Through Generating Integration Code The need to integrate systems keeps increasing. In addition, the need to finish these integration projects is also increasing. What is needed are tools that offer integration speed-up. Because a generator hides all the technical details and developers only need to focus on the how, less time has to be spent on development. Rule 2: Integration tools must offer integration speed-up through code generation. 7 Rule 3: Big Data-Ready The Big Data Train keeps Rolling There is no stopping, the big data train left the station a few years ago and continues to travel the world. Many organizations have already adopted big data, some are already relying on these systems, some are in the process of adopting big data, or are studying what big data could mean for them. In a nutshell, big data systems enrich the analytical capabilities of an organization.
The New Rules for Integration A Unified Integration Approach for Big Data, the Cloud, and the Enterprise 7 Gartner 1 predicts that big data will drive $232 billion in spending through 2016, Wikibon 2 claims that by 2017 big data revenue will have grown to $47.8 billion, and McKinsey Global Institute 3 indicates that big data has the potential to increase the value of the US Health Care industry by $300 Billion and to increase the industry value of Europe s public sector administration by 250 Billion. NoSQL and Hadoop Developing big data systems is not easy for the following reasons: The sheer size of a big data system makes it a technological challenge. A large portion of big data is unstructured, multi-structured, or semi-structured. This means that to analyze big data, structure must be assigned to it when it s being read. This is called schema-on-read and can be complex and resource intensive. Some big data is sensor data. In most cases, sensor data is highly cryptic and heavily coded. These codes may indicate machines, customers, sensor devices, and so on. To be able to analyze this coded data it must be enriched with meaningful data that exists in other, more traditional data stores which requires some form of integration. To tackle the above aspects, many organizations have decided to deploy NoSQL systems for storing and managing these massive amounts of data. NoSQL systems are designed for big data workloads, are powerful and scalable, but they are different from the well-known SQL systems that many developers are familiar with. Here are some of the differences: A NoSQL system, as the name implies, does not support the popular SQL database language nor the familiar relational concepts, such as tables, columns, and records. This means that developers of integration solutions must learn how to handle these new concepts and how to merge them with classic concepts. Each NoSQL system supports its own API, database language, and set of database concepts. This means that expertise with one product can t easily be reused with another. NoSQL skills are still scarce. Most organizations don t have these skills, and external specialists are not easy to find. Integration Tools must be Big Data-Ready To be able to exploit the value hidden in these big data systems, it has to be analyzed and enriched. In other words, data from big data systems has to be integrated with data from traditional systems. Thus, Rule 3: Integration tools must be big-data ready. 1 Gartner, October 2012; see http://techcrunch.com/2012/10/17/big-data-to-drive-232-billion-in-it-spending-through- 2016/ 2 Wikibon, Big Data Vendor Revenue and Market Forecast 2012-21017, August 26, 2013; See http://wikibon.org/wiki/v/big_data_vendor_revenue_and_market_forecast_2012-2017 3 McKinsey Global Institute, Big Data: The Next Frontier for Innovation, Competition, and Productivity, June 2011; see http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovation
The New Rules for Integration A Unified Integration Approach for Big Data, the Cloud, and the Enterprise 8 integration tools must be able to integrate these two types of data sources, and this requires that they support NoSQL systems. In other words, today integration tools must be big dataready. Requirements for being big data-ready include: Data Scalability: Integration tools must be able to process massive amounts of data. Technically this means that when an ETL-style of integration is deployed, they must support native load and unload facilities offered by these NoSQL systems (if available). Pushdown: NoSQL systems are highly scalable platforms because they can distribute the processing of logic over an enormous set of processors. To be able to exploit this power, integration tools must be able to pushdown as much of the integration logic into the NoSQL systems as possible. For example, if Hadoop MapReduce is used, an integration tool must be able to generate a MapReduce program for extracting data from HDFS and that executes as much of the transformation processing as it can. NoSQL Concepts: Integration tools must understand the new database concepts supported by NoSQL systems, such as hierarchical data structures, multi-structured tables, column families, and repeating groups. They must be able to transform such concepts to flat relational concepts, and vice versa. Bi-directional: Integration tools must be able to read from and write to NoSQL systems. Schema-on-read: Integration tools must support schema-on-read. In other words, integration tools must be able to assign a schema to big data when it s unloaded from a NoSQL system. 8 Rule 4: Cloud-Ready (Hybrid Integration) The Success of the Cloud More and more applications and data are moving to the cloud. To illustrate the growing success of cloud-based systems, here are two quotes from an IDC 2012 study 4 : Worldwide spending on public IT cloud services will be more than $40 billion in 2012 and is expected to approach $100 Billion in 2016. By 2016, public IT cloud services will account for 16% of IT revenue in five key technology categories: applications, system infrastructure software, platform as a service (PaaS), servers, and basic storage. More significantly, cloud services will generate 41% of all growth in these categories by 2016. 4 IDC, Worldwide and Regional Public IT Cloud Services 2012-2016 Forecast, August 2012.
The New Rules for Integration A Unified Integration Approach for Big Data, the Cloud, and the Enterprise 9 Not only commercial organizations are moving to the cloud. According to a recent study by Gartner 5 into IT spending by government, their interest in cloud is growing rapidly as well: Cloud computing continues to increase compared with prior years, driven by economic conditions and a shift from capital expenditure to operational expenditure, as well as potentially more important factors such as faster deployment and reduced risk. Between 30 to 50 per cent of respondents are seeking to adopt public and private cloudbased services and will sign up for an active IT services contract within the next 12 months. When systems are moved to the cloud, technically data is moved to the cloud, is moved from the cloud, or is moved within the cloud (from one system to another). The Blurring of the Cloud In the beginning, the cloud was special. There was a clear boundary between systems running in the cloud and systems running on-premises. Today, that boundary is becoming fuzzy. For example, enterprise systems are spilling into the cloud, they run onpremises but are accessing services running in the cloud, data stored on-premises has to be integrated with data stored in the cloud, and on-premises applications are migrated to running in the cloud or vice versa. The consequence is that more and more hybrid (cloud and noncloud) systems exist. Furthermore, there are different types of clouds, ranging from public clouds to private clouds, which also blurs the distinction between cloud and on-premises. It s more and more becoming a sliding scale from 100% cloud to 100% on-premises. Financial, privacy, security, and performance reasons determine where applications and data are best placed. Conclusion, things can and will change over time. Cloud and Integration What does cloud mean for integration? In general, it should be irrelevant for developers of an integration solution where data and applications reside, in the cloud or not. Therefore, integration tools should understand what the most efficient way is to transport data and messages into, from, and within the cloud. In addition, when applications and data sources move into the cloud or back, it should not change the logical integration specifications. The reason is that when Integration tools must hide technical cloud aspects for developers. data or applications are moved, the logical integration aspects don t change, only the technical aspects. Integration tools must hide technical cloud aspects for integration developers. Integration Requirements for the Cloud Technical requirements for integration tools to be able to operate successfully in the cloud are: Efficient data transmission: Moving data across the cloud-on-premises boundary, and moving data within the cloud, is not as fast as moving data between local systems. It s therefore important that integration solutions deploy efficient techniques for data transmission. For example, smart compression techniques should be supported. 5 Gartner, User Survey Analysis: IT Spending Priorities in Government, Worldwide, 2013, 25 January 2013.
The New Rules for Integration A Unified Integration Approach for Big Data, the Cloud, and the Enterprise 10 Location transparency: For developers it should be hidden whether a system or data source runs within the cloud or not. When a system is moved from on-premises to cloud, it should have no effect on the logical integration specifications. Only technical integration specifications dealing with location and network communications should have to be changed. Support for cloud APIs: With the cloud came various new applications and systems introducing new APIs and languages. For example, for extracting data from Salesforce.com, Facebook, and Twitter, special APIs are available. For inserting data in some cloud systems new APIs exist as well. Integration solutions should support as many of these typical cloud APIs as possible. Secure data transmission: Data and messages that are transmitted over public communication mechanisms, must be protected against unauthorized access. Integration solutions should support various encryption mechanisms. Again, it s important that these encryption specifications are independent of the logical integration specifications, so that when another encryption mechanisms is required, or when a system is moved and therefore another encryption becomes relevant, it has no impact on the logical specifications. All the integration work should still work. To summarize, due to the cloud and all its forms, vendors of integration solutions should invest in supporting all the Rule 4: Integration tools required features, integration tools must be cloud-ready. must be cloud-ready. Frank Gens 4 (senior vice president and chief analyst at IDC) worded it as follows: Quite simply, vendor failure in cloud services will mean stagnation." 9 Rule 5: Enterprise-Ready Enterprise-Ready For completeness sake the rule of enterprise-ready has been added. Evidently, this is a rule that has always applied: integration technology always had to be enterprise-ready. More and more organizations are relying on these solutions, and thus the need for enterpriseready is crucial. Enterprise-ready means the following: Integration tools must offer a high level of robustness, scalability, and performance. Integration tools must be enterprise-grade with respect to support. Integration tools must support all relevant security technologies, including authorization, authentication, and encryption. Integration tools must be DTAP-ready (Development Testing Acceptance Production). Integration tools must be easy to monitor and manage. However, being enterprise-ready is a moving target. What was a reasonable level of scalability five years ago can be far from sufficient today. For example, data warehouses are growing in size, the amount of data to be analyzed grows phenomenally, Rule 5: Integration tools must be enterprise-ready.
The New Rules for Integration A Unified Integration Approach for Big Data, the Cloud, and the Enterprise 11 the number of applications to be integrated increases, and the number of messages to be transmitted between applications grows. When enterprise demands are increasing, integration technology must follow. Open Source Software is Enterprise-Ready Some IT specialists still think that open source software is buggy, it doesn t scale, its functionality is poor, the products are simpler versions of their closed source counterparts, and so on. This is an erroneous view of open source software. Open source tools and systems are being deployed in the largest IT systems. They can easily compete with their closed source competitors. There are still some myths related to open source tools: Myth 1: Open source software is not market-ready. This myth suggests that open source software requires a lot of tweaking before it can be used. In other words, no simple install procedure exists, functionality is missing, and software is buggy. In a way, the suggestion is made that the products are not finished. This is incorrect, professional open source tools are as market-ready as their proprietary counterparts. Myth 2: Open source products are evaluation versions. For many open source products commercial and community versions exist. The community versions can be used for evaluation purposes, but they are mature enough to be deployed in operational environments. In fact, many organizations run operational systems that make use of community versions. In addition, the commercial versions may add additional enterprise class features or may be more scalable than the community versions. Myth 3: Open source software has a steep learning curve. This is not true. Whether a product is open source or not, has no relationship with the learning curve. It all depends on how the interface of the product has been developed. Many open source products are as easy to use as closed source products. And vice versa, there are closed and open source products that are very hard to use. Myth 4: No stable and predictable pricing model. Evidently, different pricing models apply for community and commercial versions of open source software. Different vendors use different pricing models for their commercial versions, and some of these models are crystal clear and others are somewhat muddy. This is not a lot different from the pricing models of closed source software. It s a myth to suggest that all open source software vendors have unclear pricing models. Myth 5: The loosely united community leads to weak software. The community of developers working on open source software may literally be spread out over the world. Some of these developers are on the payroll of the vendors and some are not. The tools and the project management techniques exist today that make geo-distributed development easy. In fact, vendors of closed source software are using this development model more and more as well. This style of development does not lead to weak software. Myth 6: Not enterprise-grade support. Whether vendors offer enterprise-grade support is not dependent on whether they offer open source software or not. It all depends on the maturity of the vendor itself and their willingness to
The New Rules for Integration A Unified Integration Approach for Big Data, the Cloud, and the Enterprise 12 invest in support. More and more open source software vendors are offering enterprise-grade support. Myth 7: Minimal connectivity options. For integration tools it s important to offer a wide range of connectors for different technologies. One of the strengths of open source software is that it s designed so that others can contribute and let them add features as well. When customers use very exotic systems for which no connector is available, they can develop that connector themselves and make it public. In other words, the openness makes it possible that a large community can develop a large and fast growing set of connectivity options. With closed source software, the vendor must build all the features themselves. 10 Talend and the New Rules for Integration Talend in a Nutshell Talend Inc. was founded in 2005. They were the first open source vendor of data integration software. In November 2006 they released their first product, the ETL tool called Talend Open Studio. In November 2011, Gartner rated Talend a visionary in their well-known magic quadrant for data integration tools. On November 10, 2010, Talend acquired Sopera and with that they got access to a successful, high-end, open source ESB for application integration. With this, Talend had the products, the know-how, and technology in the two main integration areas: data integration and application integration. Since the acquisition, they have worked hard to unify the two integration solutions. The result is the solution called The Talend Platform for Data Services which fully supports data integration and application integration. This approach seriously minimizes the proliferation of integration specifications. It makes the goal of a unified view real. Developers trained in data integration solutions can now re-use their skills when switching to other types of integration solutions. Meeting the Five Rules for Integration Section 2 lists the following five new rules for integration: 1. Unified integration platform 2. Code generation 3. Big data-ready 4. Cloud-ready 5. Enterprise-ready Rule 1: Integration tools must support unified integration capabilities By merging the data integration and application integration technologies, Talend offers a real unified integration platform. It consists of the following modules: Common graphical development environment: Designers and developers can use a single development environment, called Talend Open Studio, to enter and maintain integration specifications and develop solutions. This module is based on the popular Eclipse extensible integrated development environment. There is no need for developers
The New Rules for Integration A Unified Integration Approach for Big Data, the Cloud, and the Enterprise 13 to learn multiple integration tools. In other words, whether a developer wants to use Talend s ESB or ETL solution, they ll use the same development environment. Common repository: Technical integration specifications, such as connectors to data sources and schema definitions, need to be defined only once, and are stored in one common repository. This allows for different integration solutions to be shared. Whether an integration specification is used for ETL-style integration or for ESB-style integration, it s stored only once in the common repository. Common runtime environment: Next to seeing only one development environment, Talend developers see one common runtime environment. However, in reality the code they develop is really deployed on various runtime environments. Talend now supports four runtime environments: Java, SQL, Hadoop MapReduce, and Camel. This makes it possible to generate native, optimized code for these environments. So, a developer writing code to extract data from a source system doesn t have to deal with the technical aspects that are specific to SQL systems, Hadoop systems, or cloud-based applications. If integration specifications have to run on MapReduce, optimized MapReduce code is generated, and if they have to execute on a SQL database server, optimized SQL code is generated for that platform. Common deployment mechanism: Whether integration logic should run ETL- or ESB-like, developers use the same deployment mechanism. They don t have to study the different deployment mechanisms. The code generator will generate correct and efficient code. Common monitoring: Integration must be monitored. Talend supports one monitoring environment. Whether ESB or ETL is deployed, the same monitoring tool is used. There is no need to learn and install multiple different monitoring environments. This simplifies the entire integration environment considerably. Rule 2: Integration tools must support integration speed-up through code generation In Talend, whether data integration or application integration is selected, developers design their logical integration specifications independent of the source and target systems. The code generator that generates for the various runtime environments handles that. For example, Talend can generate MapReduce code needed to access data stored in Hadoop HDFS. Developers do not have to study all the complexities of this platform to deploy HDFS. In fact, when, in the future, a new interface for HDFS is invented, Talend will probably support that interface as well. This means that existing integration specifications don t have to be changed. New code is generated for this new interface based on the existing specifications. Rule 3: Integration tools must be Big Data-Ready Talend is big-data ready. As indicated, its common runtime environment supports Hadoop MapReduce. Code is generated for and executed by Hadoop MapReduce using the pushdown technique. How this is all done, is completely hidden for developers. In other words, integration developers don t have to learn the specifics of Hadoop MapReduce or any of the other Hadoop layers to be able to extract data. In addition, existing integration specifications developed for a SQL system can easily be migrated to Hadoop because of this feature. No or minimal specifications have to be altered. So, if such a migration is required because of scalability
The New Rules for Integration A Unified Integration Approach for Big Data, the Cloud, and the Enterprise 14 problems, the current investment in integration specifications remains safe. In addition, Talend is also available for Redshift and Google Big Query. Rule 4: Integration tools must be Cloud-Ready For Talend, applications in the cloud, databases in the cloud, messages transmitted through the cloud, are all sources or targets. As with big data, the common runtime platform hides the specific aspects of these cloud systems. The result is that applications can be moved from onpremises to the cloud (or vice versa), without having to change integration specifications. The same applies when data is moved from on-premises to cloud. Talend makes the cloud transparent for integration developers. Rule 5: Integration tools must be Enterprise-Ready Talend has always been Enterprise-ready. Many customers are using Talend in large-scale environments so there is no doubt about its enterprise-readiness. For illustration purposes, here are some large anonymous business cases where Talend is being deployed. The numbers mentioned show how large some of these environments are: A large e-commerce company is specialized in searching and booking business trips and vacations. They use Talend. They have three systems all running on MySQL databases. The system supports 300,000 users and it processes tens of thousands of auctions every day generating huge amounts of data. This big data stream is stored in a central warehouse where it can be accessed by the applications. The data warehouse currently holds over one terabyte of data and this figure is rising rapidly. A leading mobile service provider uses Talend in an environment with more than 30 million customers and approximately 200 million phone calls per day. The company has to manage huge volumes of data in quasi real-time. In order to improve its services, invoicing and marketing practices, the operator needed to extract different types of information from the call detail records, then integrate this data into three different systems for marketing campaigns, pricing simulation and revenue assurance management. The third case is a company that makes complex real estate data available on the web and mobile devices. They receive data for roughly 2 million MLS (Multiple Listing Service) property listing records on a daily basis. The listings include agent and office data and about 17 million photo files that the company uses to consolidate and standardize all data in order to efficiently provide property listings to major real estate companies and Web portals.
The New Rules for Integration A Unified Integration Approach for Big Data, the Cloud, and the Enterprise 15 About the Author Rick F. van der Lans Rick F. van der Lans is an independent analyst, consultant, author, and lecturer specializing in data warehousing, business intelligence, service oriented architectures, data virtualization, and database technology. He works for R20/Consultancy (www.r20.nl), a consultancy company he founded in 1987. Rick is chairman of the annual European Enterprise Data and Business Intelligence Conference (organized in London). He writes for the eminent B-eye-Network 6 and other websites. He introduced the business intelligence architecture called the Data Delivery Platform in 2009 in a number of articles 7 all published at BeyeNetwork.com. He has written several books on SQL. Published in 1987, his popular Introduction to SQL 8 was the first English book on the market devoted entirely to SQL. After more than twenty years, this book is still being sold, and has been translated in several languages, including Chinese, German, and Italian. His latest book 9 Data Virtualization for Business Intelligence Systems was published in 2012. For more information please visit www.r20.nl, or email to rick@r20.nl. You can also get in touch with him via LinkedIn and via Twitter @Rick_vanderlans. About Talend Inc. Talend provides integration solutions that truly scale for any type of integration challenge, any volume of data, and any scope of project, no matter how simple or complex. Only Talend s highly scalable data, application and business process integration platform enables organizations to effectively leverage all of their information assets. Talend unites integration projects and technologies to dramatically accelerate the time-to-value for the business. Ready for big data environments, Talend s flexible architecture easily adapts to future IT platforms. Talend s unified solutions portfolio includes data integration, data quality, master data management (MDM), enterprise service bus (ESB) and business process management (BPM). A common set of easy-to-use tools implemented across all Talend products maximizes the skills of integration teams. Unlike traditional vendors offering closed and disjointed solutions, Talend offers an open and flexible platform, supported by a predictable and scalable value-based subscription model. 6 See http://www.b-eye-network.com/channels/5087/articles/ 7 See http://www.b-eye-network.com/channels/5087/view/12495 8 R.F. van der Lans, Introduction to SQL; Mastering the Relational Database Language, fourth edition, Addison- Wesley, 2007. 9 R.F. van der Lans, Data Virtualization for Business Intelligence Systems, Morgan Kaufmann Publishers, 2012.