Data processing goes big



Similar documents
Enterprise Service Bus

Implement Hadoop jobs to extract business value from large and varied data sets

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Talend for Data Integration guide

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Oracle Data Integrator for Big Data. Alex Kotopoulis Senior Principal Product Manager

Integrating VoltDB with Hadoop

Data Integration Checklist

Qsoft Inc

Creating a universe on Hive with Hortonworks HDP 2.0

Why All Enterprise Data Integration Products Are Not Equal

IBM BigInsights Has Potential If It Lives Up To Its Promise. InfoSphere BigInsights A Closer Look

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Hadoop Ecosystem B Y R A H I M A.

Hadoop Job Oriented Training Agenda

Apache Hadoop: The Big Data Refinery

Cloud-based data integration solution with high-performance development environment

BIG DATA SERIES: HADOOP DEVELOPER TRAINING PROGRAM. An Overview

Sage Intelligence Financial Reporting for Sage ERP X3 Version 6.5 Installation Guide

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Cloudera Certified Developer for Apache Hadoop

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Testing 3Vs (Volume, Variety and Velocity) of Big Data

AWS Schema Conversion Tool. User Guide Version 1.0

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

Peers Techno log ies Pv t. L td. HADOOP

Big Data Course Highlights

WebSpy Vantage Ultimate 2.2 Web Module Administrators Guide

SAP Business Objects Business Intelligence platform Document Version: 4.1 Support Package Data Federation Administration Tool Guide

Hadoop and Map-Reduce. Swati Gore

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Big data for the Masses The Unique Challenge of Big Data Integration

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Solution White Paper Connect Hadoop to the Enterprise

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Important Notice. (c) Cloudera, Inc. All rights reserved.

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

BIG DATA What it is and how to use?

Using Tableau Software with Hortonworks Data Platform

User Guide. Analytics Desktop Document Number:

Database Studio is the new tool to administrate SAP MaxDB database instances as of version 7.5.

Constructing a Data Lake: Hadoop and Oracle Database United!

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

HareDB HBase Client Web Version USER MANUAL HAREDB TEAM

Click Stream Data Analysis Using Hadoop

Big Data Too Big To Ignore

Deploying Hadoop with Manager

WHITE PAPER. Four Key Pillars To A Big Data Management Solution

Cost-Effective Business Intelligence with Red Hat and Open Source

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

Talend Open Studio for Big Data. Release Notes 5.2.1

Quick Deployment Step-by-step instructions to deploy Oracle Big Data Lite Virtual Machine

Consumption of OData Services of Open Items Analytics Dashboard using SAP Predictive Analysis

Introduction To Hive

ITG Software Engineering

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

Workshop on Hadoop with Big Data

SAS Marketing Automation 5.1. User s Guide

Presenters: Luke Dougherty & Steve Crabb

Internals of Hadoop Application Framework and Distributed File System

Hadoop on Windows Azure: Hive vs. JavaScript for Processing Big Data

Informatica Cloud & Redshift Getting Started User Guide

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

IBM DB2 XML support. How to Configure the IBM DB2 Support in oxygen

Map Reduce & Hadoop Recommended Text:

Cloudera Manager Training: Hands-On Exercises

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

WHAT S NEW IN SAS 9.4

IBM Software InfoSphere Guardium. Planning a data security and auditing deployment for Hadoop

Data Domain Profiling and Data Masking for Hadoop

I/O Considerations in Big Data Analytics

Talend Metadata Manager. Reduce Risk and Friction in your Information Supply Chain

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Automate Your BI Administration to Save Millions with Command Manager and System Manager

Managing Third Party Databases and Building Your Data Warehouse

Jet Data Manager 2012 User Guide

Testing Big data is one of the biggest

SAP Data Services 4.X. An Enterprise Information management Solution

SQL Server Administrator Introduction - 3 Days Objectives

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Data Warehouse Optimization

A Brief Outline on Bigdata Hadoop

SQL Server Replication Guide

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

Lofan Abrams Data Services for Big Data Session # 2987

Hadoop & Spark Using Amazon EMR

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf

#TalendSandbox for Big Data

What's New in SAS Data Management

Big Data Analytics - Accelerated. stream-horizon.com

SAP BusinessObjects Business Intelligence (BI) platform Document Version: 4.1, Support Package Report Conversion Tool Guide

Oracle Service Bus Examples and Tutorials

INTEGRATING MICROSOFT DYNAMICS CRM WITH SIMEGO DS3

Cloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

Talend Real-Time Big Data Sandbox. Big Data Insights Cookbook

Transcription:

Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors, it can shift data from any number of different sources to just as many targets. The Big Data Edition of this solution is optimized to take advantage of Hadoop and its databases or technologies such as HBase, HCatalog, HDFS, Hive, Oozie and Pig. IAIT decided to put Integration through its paces and see how it stacks up in the real world. When IT specialists talk about big data, they are usually referring to data sets that are so large and complex that they can no longer be processed with conventional data management tools. These huge volumes of data are produced for a variety of reasons. Streams of data can be generated automatically (reports, logs, camera footage, etc.). Or they could be the result of detailed analyses of customer behavior (consumption data), scientific investigations (the Large Hadron Collider is an apt example) or the consolidation of various data sources. These data repositories which typically run into petabytes and exabytes are hard to analyze because conventional database systems simply lack the muscle. Big data has to be analyzed in massively parallel environments where computing power is distributed over thousands of computers and the results are transferred to a central location. The Hadoop open source platform has emerged as the preferred framework for analyzing big data. This distributed file system splits the information into several data blocks and distributes these blocks across multiple systems in the network (the Hadoop cluster). By distributing computing power, 1 Hadoop also ensures a high degree of availability and redundancy. A "master node" handles file storage as well as requests.

Integration Hadoop is a very powerful computing platform for working with big data. It can accept external requests, distribute them to individual computers in the cluster and execute them in parallel on the individual nodes. The results are fed back to a central location where they can then be analyzed. However, to reap the benefits of Hadoop, data analysts need a way to load data into Hadoop and subsequently extract it from this open source system. This is where the Big Data Edition of Jaspersoft, SAP, Amazon RDS and Salesforce and supports a variety of databases such as DB2, Informix and obviously Hadoop. How does it work? Integration is a code generator. All the user needs to do is define a data source a CSV file or a database for instance and then specify the operation to be performed on that data. In the case of a CSV file, for example, the user will specify details like encoding method and field separator, although the options available will of course vary from source to source. As soon as Users are presented with a welcome screen when they launch Talend Enterprise Data Integration Integration comes in. As previously mentioned, Talend Enterprise Data Integration can read data from just about any source, process the data as instructed by the user and then export it. Taking a simple job as an example, this tool can read CSV files, select certain fields like name or address, and export the results to an Excel file. But its capabilities don t end there: It also works with business intelligence solutions like the data source is set, the user can link it to their workspace as an icon. Next, the user can define their preferred workflow. They can manipulate the data in a number of ways filter, sort, replace, transform, split, merge and convert. There is also a map function for transforming data. This lets users select specific data fields, rearrange these fields, automatically insert additional 2 data like numbering and much more. Each of these transformation features comes with its own icon which can be dragged and dropped to the user s workspace and configured there. Once the data source has been defined and the user has specified how the tool should process the information, the next task is to define the export settings. For this, Talend offers various connectors to supported target systems such as Informix or Hadoop. Integration uses icons for the connectors, too. Again, the user can simply drag and drop the icons and configure them on their workspace. Here again, the configuration options depend on the export medium. In the case of an Excel spreadsheet, for example, the user just needs to enter the export path. Lines are used to illustrate the flow of data between icons. Analytics users can usually just drag these (sometimes they also need to select certain connection types from a menu). The job can be started once all of these steps have been completed. Firstly, Integration creates the code required to execute the job. Then it starts the job and transforms the data. The generated code can be in Java or SQL. Depending on the technology used, the generated code can be in Java or SQL for Hadoop, MapReduce, Pig Latin, HiveQL and others. Since each job step is symbolized by an icon for which users only have to specify the general requirements in order to generate the code automatically users don t need programming skills to

run complex, code intensive data processing jobs. The above example is a simple one to illustrate the process, but between the subscription and open source products is extra support options (including SLAs) and additional features such as shared repositories, wizards, shared jobs, version control, When they use a CSV file as their source, users can configure all kinds of parameters Integration can handle much more complex tasks such as importing data and then mapping specific fields, transforming certain data types and sorting the modified output prior to export. Versions Talend software is available in a number of editions, starting with freely downloadable open source products Talend Open Studio for Data Integration and Talend Open Studio for Big Data. These are available on the vendor s website and are free to download and use. The subscription based product is called Talend Enterprise Data Integration and is available in four editions: Team, Professional, Cluster and Big Data. The main difference reference projects, to mention but a few. The subscription versions themselves offer different combinations of features, including support for load balancing, high availability and Hadoop. Talend is in fact one of the first vendors to provide Hadoop support. The website provides a comparative matrix of the available software versions and their functions. For our test, we looked at the Big Data Edition of Integration, which requires a subscription. It is fair to point out, however, that the open source product Talend Open Studio for Big Data has quite an extensive set of functions, and 3 would be perfectly suitable for all internal data transformation jobs. Administrators looking to fast track import and export script writing should look at what the subscription products have to offer. The test For the test we used an environment with Hadoop 1.0.3 running in a vsphere installation based on IBM s X Architecture. After installing Talend Enterprise Data Integration on a workstation operating on the x64 edition of Windows 7 Ultimate, we started by importing data from a CSV file. We transformed the data and exported it as an Excel spreadsheet to become familiar with the solution s functionality. We then set up a connection to our Hadoop system, imported the same CSV data again and wrote it to Hadoop. Then, we re exported the data to check that everything had worked properly. For the next step, we took 100,000 or ten million records of company data. We analyzed this data using Pig. Finally, we used Hive to access the data in Hadoop and worked with the HBase database. We will cover these terms in greater detail later on. As for specifications, Talend Enterprise Data Integration runs on Java 1.6 the latest version. The vendor recommends the 64 bit version of Windows 7 as the operating system and a standard workstation with 4 GB of memory. Installing Talend Enterprise Data Integration Installing Integration is straightforward. The user just needs to make sure that their system runs a supported

version of Java and then unzip the Talend files to a folder of their choice (e.g. c:\talend). allow the user to define automated tasks like the splitting of fields. The next step is to launch the executable file. The system will initially ask for a valid license (a license key is, of course, not required in the open source editions). Once the key has been validated, the user will see a screen with some general license terms. After the users click to confirm, they can get started by creating a repository (with workspace) and setting up their first project. When they open this project they will be presented with the development tool s welcome screen, which guides the users through the first steps. The workspace is located in the top center of the screen. Here, the user can define jobs using icons. Working with Talend Enterprise Data Integration The design environment is based on the Eclipse platform. A repository on the left allows the user to define items like jobs, joblets and metadata. The jobs outline the operations represented by icons to be performed on the data. The metadata can be used to set up file, database and SAP connections, schemas, etc. And the joblets create reusable data integration task logic that can be factored into new jobs on a modular basis. Performing an analysis job with Pig, running here on the Hadoop server The Code subfolder contains two interesting additional functionalities. Job scripts, on the one hand, are process descriptions i.e. code generation instructions in XML format. Job scripts can provide a full description of the processes, so users can implement functions for the import of table descriptions, for example. Routines, on the other hand, Underneath, the context sensitive configuration options can be set for the selected icon. The options for starting and debugging jobs and lists of errors, messages and information are also found here. A panel on the right of the screen contains a Palette with the various components in icon form. These include the import and export connectors as well as the functions for editing data, executing commands and so on. Users can also integrate their own code into the system at any time. The Palette therefore holds all of the components that can be dragged to the workspace and dropped. The first jobs For the test, we carried out the first job at this point, importing a CSV file and then writing the data to an Excel spreadsheet. As we have already basically described this job in the 4 introduction, we will move on to the job of writing the data from a CSV file to Hadoop. For this job, our source was the CSV file that was pre defined under metadata with its configuration parameters like field separator and encoding. After dragging it to the workspace, we defined an export target. For this, we selected the thdfsoutput component from the Big Data folder in the Palette and dragged the icon next to our source file. The HDFS in the name stands for Hadoop Distributed File System. The next task was to configure the output icon. After clicking the icon, we were able to enter the required settings in the Component tab below the workspace. These settings included Hadoop version, server name, user account, target folder and name of target file in Hadoop. For our test, we again used a CSV file as our target. To finish, we had to create a connection between the two icons. We did this by right clicking the source icon and dragging a line as far as the HDFS icon.

Once the connection was made, we were able to start the job (on work of the information import and export process. Pig Latin in action the relevant tab below the workspace). When the task was completed, the Talend software showed the throughput in rows per second and the number of data sets transferred. We checked that the new file had actually arrived at its target destination using the Hadoop function "Browse the file system" at http://{name of Hadoop server}:50070. It took us less than five minutes to set up this job and everything worked out of the box exactly as expected. Working with the data We carried out the two jobs described above to make sure that the Talend suite was able to communicate seamlessly with our Hadoop system. Satisfied that this was the case, we decided to analyze the data. The objective was to read a particular customer number from transferred to the Hadoop system to carry out the data queries. We saved the query result as a file in Hadoop. At this point we need to go into the technical details. Hadoop uses the MapReduce algorithm to perform computations on large volumes of data. This is a framework for the parallel processing of queries using computer clusters. MapReduce involves two steps, the first of which is the mapping. This means that the master node takes the input, divides it into smaller sub queries and distributes these to nodes in the cluster. The sub nodes may then either, split the queries among themselves leading to a tree like structure or else query the database for the answer and send it back to the master node. In the For the next part of our test, we wanted to read data from Hadoop. We started by selecting a component from the Palette called thdfsinput, which was to be our source. We configured it using the same settings as for the target server name, file name, etc. For the data output we added a tlogrow component an easy way to export data from the data stream to the system console. As soon as we had created a connection between the two icons (as described above), we were able to start the job and look at the content of our original CSV file on the screen. Hadoop and the Integration solution made light Users can check the status of their jobs at any time on the Hadoop web interface a customer file with ten million data sets. Hadoop functionality was extremely beneficial for this task. We used the Talend tool to create a code which we 5 second step (reduce), the master node collects the answers and combines them to form the output the answer to the original query. This

parallelization of queries across multiple processors greatly improves process execution time. The Pig platform is used to create MapReduce programs running on Hadoop. It is so named because its task is to find truffles in the data sets. The associated programming language is called Pig Latin. Special programs therefore need to be written to use MapReduce. The Talend A connection to a Hive database can be set up quickly and easily Enterprise Data Integration code generator makes this task much easier. It delivers various functions that allow users to define the data sources, queries and targets using the familiar icons in the design workspace, generate the code (e.g. MapReduce or Pig Latin), send it to the Hadoop environment and execute it there. For the test, the first thing we did was to create a component called tpigload to load the data we wished to analyze. To this we assigned configuration parameters such as the Hadoop server name, Hadoop version, user account, file to be analyzed and the schema that we had previously configured under metadata. We then created a tpigrow filter component, specifying the value each field should have in order to execute the query. In relation to the schema, we would like to point out that since the source file consists of data like name, number, etc., Talend Enterprise Data Integration has to know which data belongs to which fields. The user can define the relevant fields as a schema under metadata and make this information available to the system. We defined the answer with an icon called tpigstoreresult, to which we added the target folder and the name of the result file. To finish, we created connections between the individual icons, not with the right click action described earlier, but by right clicking the relevant component and selecting the command Pig Combine in the Row menu. called number and that the system should count all the product names in the database (these were indicated alongside the respective customer entries) and then write the names in a file, specifying the frequency of each name. We were able to see the result on our Hadoop server shortly after starting the job. Working with Hive Hive creates a JDBC connection to Hadoop with SQL. Developers can use Hive to query Hadoop systems with an SQL like syntax. To test Hive, we first created a new database connection to the customer database that was already in our Hadoop test system under metadata. All we had to do was select Hive as the database type, specify the server and port and click Check. On successful completion of the database connection test, we were able to see the connection in our Data Integration system and use it as an icon. We did this because we wanted to create a script to be executed on the Hadoop system. We then started the job, and a short time later we were able to see the result on the Hadoop server s web interface, which was as expected. Our testing of the Pig components therefore ran completely smoothly. One of the configuration options for the Hive database connection is a Query field, where the user can type SQL queries. For our first query, we asked the customer database how many customers lived in Hannover. We started by entering "select count(*) from {database} where city like '%Hannover%'" in the database connection s query field. For our next job, we wanted to use the entries in our customer file to determine the popularity of certain products. We started by copying the query job and replaced the tpigrow component with an icon called tpigaggregate. We specified that we wanted an output column Again, we used a tlogrow component as output and created a connection between the two icons, which the system used to deliver the count value. Not long afterwards we were able to see the number of Hannover based customers on the system console. So just like Pig, the Hive testing 6

went extremely smoothly. For our second Hive job, we attempted to write the entire database to an Excel table. First, we adapted the query in our source connection. Instead of the tlogrow component, we selected a tfileoutputexcel icon and specified the target path and the name of the target file. After output on our system console. We started by creating a new job, dragging the icon with the source CSV file to the workspace. We then used a tmap component to filter the data destined for the database from the file. Finally, we created a thbaseoutput icon. To configure this icon we had to enter the Data queries with an SQL like language and Hive that we right clicked to connect the two entries. Shortly after starting the job, we found all the required data in an Excel spreadsheet on our workstation. Hive is a very useful technology for SQL administrators. It is extremely easy to use in conjunction with Talend Enterprise Data Integration. HBase HBase is a relatively easy to use, scalable database suited to managing large volumes of data in a Hadoop environment. Users rarely change the data in their HBase databases but frequently will keep adding data to it. To finish our test, we exported diverse data sets from our original CSV file to the HBase database on our Hadoop system and waited for the data to be Hadoop version, server name and table name, and assign data to the relevant fields. When all the required connections had been made, we started the job and the data arrived in the database. To check that everything had worked correctly, we extracted the data in the HBase environment to our system console. We did this with a component called thbaseinput, configuring it in the same way as the output component. We completed the job configuration by adding a tlogrow icon and creating a connection between the two components. After starting the job, the data appeared on our screen as expected. Users of HBase can thus be reassured that their database works perfectly with Talend Enterprise Data Integration. 7 Summary The Big Data Edition of Talend Enterprise Data Integration is a bridge between the data management tools of the past and the technology of tomorrow. Even without big data functionality, this solution provides an impressive range of functions for data integration, synchronization and transformation. And the big data functionality takes it into another league altogether. Pig support makes it easy for users to run distributed data queries in the cluster. Hive and HBase support mean that Integration can be deployed in just about any environment. Extensive data quality features and project management with a scheduling and monitoring framework round off the package. Integration works not only with the Apache Foundation s Hadoop distribution but also with solutions from Hortonworks, Cloudera, MapR and Greenplum. Database administrators and service providers will be hard pressed to find a better product. Integration Big Data Edition Powerful set of tools to access, transform, move and synchronize data with Hadoop support. Advantages: Lots of functions Easy to use Support for Hadoop, Pig, HBase and Hive Further information: Talend www.talend.com