Data processing goes big
|
|
|
- Peregrine Wilkinson
- 10 years ago
- Views:
Transcription
1 Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors, it can shift data from any number of different sources to just as many targets. The Big Data Edition of this solution is optimized to take advantage of Hadoop and its databases or technologies such as HBase, HCatalog, HDFS, Hive, Oozie and Pig. IAIT decided to put Integration through its paces and see how it stacks up in the real world. When IT specialists talk about big data, they are usually referring to data sets that are so large and complex that they can no longer be processed with conventional data management tools. These huge volumes of data are produced for a variety of reasons. Streams of data can be generated automatically (reports, logs, camera footage, etc.). Or they could be the result of detailed analyses of customer behavior (consumption data), scientific investigations (the Large Hadron Collider is an apt example) or the consolidation of various data sources. These data repositories which typically run into petabytes and exabytes are hard to analyze because conventional database systems simply lack the muscle. Big data has to be analyzed in massively parallel environments where computing power is distributed over thousands of computers and the results are transferred to a central location. The Hadoop open source platform has emerged as the preferred framework for analyzing big data. This distributed file system splits the information into several data blocks and distributes these blocks across multiple systems in the network (the Hadoop cluster). By distributing computing power, 1 Hadoop also ensures a high degree of availability and redundancy. A "master node" handles file storage as well as requests.
2 Integration Hadoop is a very powerful computing platform for working with big data. It can accept external requests, distribute them to individual computers in the cluster and execute them in parallel on the individual nodes. The results are fed back to a central location where they can then be analyzed. However, to reap the benefits of Hadoop, data analysts need a way to load data into Hadoop and subsequently extract it from this open source system. This is where the Big Data Edition of Jaspersoft, SAP, Amazon RDS and Salesforce and supports a variety of databases such as DB2, Informix and obviously Hadoop. How does it work? Integration is a code generator. All the user needs to do is define a data source a CSV file or a database for instance and then specify the operation to be performed on that data. In the case of a CSV file, for example, the user will specify details like encoding method and field separator, although the options available will of course vary from source to source. As soon as Users are presented with a welcome screen when they launch Talend Enterprise Data Integration Integration comes in. As previously mentioned, Talend Enterprise Data Integration can read data from just about any source, process the data as instructed by the user and then export it. Taking a simple job as an example, this tool can read CSV files, select certain fields like name or address, and export the results to an Excel file. But its capabilities don t end there: It also works with business intelligence solutions like the data source is set, the user can link it to their workspace as an icon. Next, the user can define their preferred workflow. They can manipulate the data in a number of ways filter, sort, replace, transform, split, merge and convert. There is also a map function for transforming data. This lets users select specific data fields, rearrange these fields, automatically insert additional 2 data like numbering and much more. Each of these transformation features comes with its own icon which can be dragged and dropped to the user s workspace and configured there. Once the data source has been defined and the user has specified how the tool should process the information, the next task is to define the export settings. For this, Talend offers various connectors to supported target systems such as Informix or Hadoop. Integration uses icons for the connectors, too. Again, the user can simply drag and drop the icons and configure them on their workspace. Here again, the configuration options depend on the export medium. In the case of an Excel spreadsheet, for example, the user just needs to enter the export path. Lines are used to illustrate the flow of data between icons. Analytics users can usually just drag these (sometimes they also need to select certain connection types from a menu). The job can be started once all of these steps have been completed. Firstly, Integration creates the code required to execute the job. Then it starts the job and transforms the data. The generated code can be in Java or SQL. Depending on the technology used, the generated code can be in Java or SQL for Hadoop, MapReduce, Pig Latin, HiveQL and others. Since each job step is symbolized by an icon for which users only have to specify the general requirements in order to generate the code automatically users don t need programming skills to
3 run complex, code intensive data processing jobs. The above example is a simple one to illustrate the process, but between the subscription and open source products is extra support options (including SLAs) and additional features such as shared repositories, wizards, shared jobs, version control, When they use a CSV file as their source, users can configure all kinds of parameters Integration can handle much more complex tasks such as importing data and then mapping specific fields, transforming certain data types and sorting the modified output prior to export. Versions Talend software is available in a number of editions, starting with freely downloadable open source products Talend Open Studio for Data Integration and Talend Open Studio for Big Data. These are available on the vendor s website and are free to download and use. The subscription based product is called Talend Enterprise Data Integration and is available in four editions: Team, Professional, Cluster and Big Data. The main difference reference projects, to mention but a few. The subscription versions themselves offer different combinations of features, including support for load balancing, high availability and Hadoop. Talend is in fact one of the first vendors to provide Hadoop support. The website provides a comparative matrix of the available software versions and their functions. For our test, we looked at the Big Data Edition of Integration, which requires a subscription. It is fair to point out, however, that the open source product Talend Open Studio for Big Data has quite an extensive set of functions, and 3 would be perfectly suitable for all internal data transformation jobs. Administrators looking to fast track import and export script writing should look at what the subscription products have to offer. The test For the test we used an environment with Hadoop running in a vsphere installation based on IBM s X Architecture. After installing Talend Enterprise Data Integration on a workstation operating on the x64 edition of Windows 7 Ultimate, we started by importing data from a CSV file. We transformed the data and exported it as an Excel spreadsheet to become familiar with the solution s functionality. We then set up a connection to our Hadoop system, imported the same CSV data again and wrote it to Hadoop. Then, we re exported the data to check that everything had worked properly. For the next step, we took 100,000 or ten million records of company data. We analyzed this data using Pig. Finally, we used Hive to access the data in Hadoop and worked with the HBase database. We will cover these terms in greater detail later on. As for specifications, Talend Enterprise Data Integration runs on Java 1.6 the latest version. The vendor recommends the 64 bit version of Windows 7 as the operating system and a standard workstation with 4 GB of memory. Installing Talend Enterprise Data Integration Installing Integration is straightforward. The user just needs to make sure that their system runs a supported
4 version of Java and then unzip the Talend files to a folder of their choice (e.g. c:\talend). allow the user to define automated tasks like the splitting of fields. The next step is to launch the executable file. The system will initially ask for a valid license (a license key is, of course, not required in the open source editions). Once the key has been validated, the user will see a screen with some general license terms. After the users click to confirm, they can get started by creating a repository (with workspace) and setting up their first project. When they open this project they will be presented with the development tool s welcome screen, which guides the users through the first steps. The workspace is located in the top center of the screen. Here, the user can define jobs using icons. Working with Talend Enterprise Data Integration The design environment is based on the Eclipse platform. A repository on the left allows the user to define items like jobs, joblets and metadata. The jobs outline the operations represented by icons to be performed on the data. The metadata can be used to set up file, database and SAP connections, schemas, etc. And the joblets create reusable data integration task logic that can be factored into new jobs on a modular basis. Performing an analysis job with Pig, running here on the Hadoop server The Code subfolder contains two interesting additional functionalities. Job scripts, on the one hand, are process descriptions i.e. code generation instructions in XML format. Job scripts can provide a full description of the processes, so users can implement functions for the import of table descriptions, for example. Routines, on the other hand, Underneath, the context sensitive configuration options can be set for the selected icon. The options for starting and debugging jobs and lists of errors, messages and information are also found here. A panel on the right of the screen contains a Palette with the various components in icon form. These include the import and export connectors as well as the functions for editing data, executing commands and so on. Users can also integrate their own code into the system at any time. The Palette therefore holds all of the components that can be dragged to the workspace and dropped. The first jobs For the test, we carried out the first job at this point, importing a CSV file and then writing the data to an Excel spreadsheet. As we have already basically described this job in the 4 introduction, we will move on to the job of writing the data from a CSV file to Hadoop. For this job, our source was the CSV file that was pre defined under metadata with its configuration parameters like field separator and encoding. After dragging it to the workspace, we defined an export target. For this, we selected the thdfsoutput component from the Big Data folder in the Palette and dragged the icon next to our source file. The HDFS in the name stands for Hadoop Distributed File System. The next task was to configure the output icon. After clicking the icon, we were able to enter the required settings in the Component tab below the workspace. These settings included Hadoop version, server name, user account, target folder and name of target file in Hadoop. For our test, we again used a CSV file as our target. To finish, we had to create a connection between the two icons. We did this by right clicking the source icon and dragging a line as far as the HDFS icon.
5 Once the connection was made, we were able to start the job (on work of the information import and export process. Pig Latin in action the relevant tab below the workspace). When the task was completed, the Talend software showed the throughput in rows per second and the number of data sets transferred. We checked that the new file had actually arrived at its target destination using the Hadoop function "Browse the file system" at of Hadoop server}: It took us less than five minutes to set up this job and everything worked out of the box exactly as expected. Working with the data We carried out the two jobs described above to make sure that the Talend suite was able to communicate seamlessly with our Hadoop system. Satisfied that this was the case, we decided to analyze the data. The objective was to read a particular customer number from transferred to the Hadoop system to carry out the data queries. We saved the query result as a file in Hadoop. At this point we need to go into the technical details. Hadoop uses the MapReduce algorithm to perform computations on large volumes of data. This is a framework for the parallel processing of queries using computer clusters. MapReduce involves two steps, the first of which is the mapping. This means that the master node takes the input, divides it into smaller sub queries and distributes these to nodes in the cluster. The sub nodes may then either, split the queries among themselves leading to a tree like structure or else query the database for the answer and send it back to the master node. In the For the next part of our test, we wanted to read data from Hadoop. We started by selecting a component from the Palette called thdfsinput, which was to be our source. We configured it using the same settings as for the target server name, file name, etc. For the data output we added a tlogrow component an easy way to export data from the data stream to the system console. As soon as we had created a connection between the two icons (as described above), we were able to start the job and look at the content of our original CSV file on the screen. Hadoop and the Integration solution made light Users can check the status of their jobs at any time on the Hadoop web interface a customer file with ten million data sets. Hadoop functionality was extremely beneficial for this task. We used the Talend tool to create a code which we 5 second step (reduce), the master node collects the answers and combines them to form the output the answer to the original query. This
6 parallelization of queries across multiple processors greatly improves process execution time. The Pig platform is used to create MapReduce programs running on Hadoop. It is so named because its task is to find truffles in the data sets. The associated programming language is called Pig Latin. Special programs therefore need to be written to use MapReduce. The Talend A connection to a Hive database can be set up quickly and easily Enterprise Data Integration code generator makes this task much easier. It delivers various functions that allow users to define the data sources, queries and targets using the familiar icons in the design workspace, generate the code (e.g. MapReduce or Pig Latin), send it to the Hadoop environment and execute it there. For the test, the first thing we did was to create a component called tpigload to load the data we wished to analyze. To this we assigned configuration parameters such as the Hadoop server name, Hadoop version, user account, file to be analyzed and the schema that we had previously configured under metadata. We then created a tpigrow filter component, specifying the value each field should have in order to execute the query. In relation to the schema, we would like to point out that since the source file consists of data like name, number, etc., Talend Enterprise Data Integration has to know which data belongs to which fields. The user can define the relevant fields as a schema under metadata and make this information available to the system. We defined the answer with an icon called tpigstoreresult, to which we added the target folder and the name of the result file. To finish, we created connections between the individual icons, not with the right click action described earlier, but by right clicking the relevant component and selecting the command Pig Combine in the Row menu. called number and that the system should count all the product names in the database (these were indicated alongside the respective customer entries) and then write the names in a file, specifying the frequency of each name. We were able to see the result on our Hadoop server shortly after starting the job. Working with Hive Hive creates a JDBC connection to Hadoop with SQL. Developers can use Hive to query Hadoop systems with an SQL like syntax. To test Hive, we first created a new database connection to the customer database that was already in our Hadoop test system under metadata. All we had to do was select Hive as the database type, specify the server and port and click Check. On successful completion of the database connection test, we were able to see the connection in our Data Integration system and use it as an icon. We did this because we wanted to create a script to be executed on the Hadoop system. We then started the job, and a short time later we were able to see the result on the Hadoop server s web interface, which was as expected. Our testing of the Pig components therefore ran completely smoothly. One of the configuration options for the Hive database connection is a Query field, where the user can type SQL queries. For our first query, we asked the customer database how many customers lived in Hannover. We started by entering "select count(*) from {database} where city like '%Hannover%'" in the database connection s query field. For our next job, we wanted to use the entries in our customer file to determine the popularity of certain products. We started by copying the query job and replaced the tpigrow component with an icon called tpigaggregate. We specified that we wanted an output column Again, we used a tlogrow component as output and created a connection between the two icons, which the system used to deliver the count value. Not long afterwards we were able to see the number of Hannover based customers on the system console. So just like Pig, the Hive testing 6
7 went extremely smoothly. For our second Hive job, we attempted to write the entire database to an Excel table. First, we adapted the query in our source connection. Instead of the tlogrow component, we selected a tfileoutputexcel icon and specified the target path and the name of the target file. After output on our system console. We started by creating a new job, dragging the icon with the source CSV file to the workspace. We then used a tmap component to filter the data destined for the database from the file. Finally, we created a thbaseoutput icon. To configure this icon we had to enter the Data queries with an SQL like language and Hive that we right clicked to connect the two entries. Shortly after starting the job, we found all the required data in an Excel spreadsheet on our workstation. Hive is a very useful technology for SQL administrators. It is extremely easy to use in conjunction with Talend Enterprise Data Integration. HBase HBase is a relatively easy to use, scalable database suited to managing large volumes of data in a Hadoop environment. Users rarely change the data in their HBase databases but frequently will keep adding data to it. To finish our test, we exported diverse data sets from our original CSV file to the HBase database on our Hadoop system and waited for the data to be Hadoop version, server name and table name, and assign data to the relevant fields. When all the required connections had been made, we started the job and the data arrived in the database. To check that everything had worked correctly, we extracted the data in the HBase environment to our system console. We did this with a component called thbaseinput, configuring it in the same way as the output component. We completed the job configuration by adding a tlogrow icon and creating a connection between the two components. After starting the job, the data appeared on our screen as expected. Users of HBase can thus be reassured that their database works perfectly with Talend Enterprise Data Integration. 7 Summary The Big Data Edition of Talend Enterprise Data Integration is a bridge between the data management tools of the past and the technology of tomorrow. Even without big data functionality, this solution provides an impressive range of functions for data integration, synchronization and transformation. And the big data functionality takes it into another league altogether. Pig support makes it easy for users to run distributed data queries in the cluster. Hive and HBase support mean that Integration can be deployed in just about any environment. Extensive data quality features and project management with a scheduling and monitoring framework round off the package. Integration works not only with the Apache Foundation s Hadoop distribution but also with solutions from Hortonworks, Cloudera, MapR and Greenplum. Database administrators and service providers will be hard pressed to find a better product. Integration Big Data Edition Powerful set of tools to access, transform, move and synchronize data with Hadoop support. Advantages: Lots of functions Easy to use Support for Hadoop, Pig, HBase and Hive Further information: Talend
Enterprise Service Bus
We tested: Talend ESB 5.2.1 Enterprise Service Bus Dr. Götz Güttich Talend Enterprise Service Bus 5.2.1 is an open source, modular solution that allows enterprises to integrate existing or new applications
Implement Hadoop jobs to extract business value from large and varied data sets
Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to
Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.
Big Data Hadoop Administration and Developer Course This course is designed to understand and implement the concepts of Big data and Hadoop. This will cover right from setting up Hadoop environment in
Talend for Data Integration guide
Talend for Data Integration guide Table of Contents Introduction...2 About the author...2 Download, install and run...2 The first project...3 Set up a new project...3 Create a new Job...4 Execute the job...7
Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP
Big-Data and Hadoop Developer Training with Oracle WDP What is this course about? Big Data is a collection of large and complex data sets that cannot be processed using regular database management tools
Oracle Data Integrator for Big Data. Alex Kotopoulis Senior Principal Product Manager
Oracle Data Integrator for Big Data Alex Kotopoulis Senior Principal Product Manager Hands on Lab - Oracle Data Integrator for Big Data Abstract: This lab will highlight to Developers, DBAs and Architects
Integrating VoltDB with Hadoop
The NewSQL database you ll never outgrow Integrating with Hadoop Hadoop is an open source framework for managing and manipulating massive volumes of data. is an database for handling high velocity data.
Data Integration Checklist
The need for data integration tools exists in every company, small to large. Whether it is extracting data that exists in spreadsheets, packaged applications, databases, sensor networks or social media
Qsoft Inc www.qsoft-inc.com
Big Data & Hadoop Qsoft Inc www.qsoft-inc.com Course Topics 1 2 3 4 5 6 Week 1: Introduction to Big Data, Hadoop Architecture and HDFS Week 2: Setting up Hadoop Cluster Week 3: MapReduce Part 1 Week 4:
Creating a universe on Hive with Hortonworks HDP 2.0
Creating a universe on Hive with Hortonworks HDP 2.0 Learn how to create an SAP BusinessObjects Universe on top of Apache Hive 2 using the Hortonworks HDP 2.0 distribution Author(s): Company: Ajay Singh
Why All Enterprise Data Integration Products Are Not Equal
Why All Enterprise Data Integration Products Are Not Equal Talend Enterprise Data Integration: Holding the Enterprise Together William McKnight McKnight Consulting Group www.mcknightcg.com Why All Enterprise
IBM BigInsights Has Potential If It Lives Up To Its Promise. InfoSphere BigInsights A Closer Look
IBM BigInsights Has Potential If It Lives Up To Its Promise By Prakash Sukumar, Principal Consultant at iolap, Inc. IBM released Hadoop-based InfoSphere BigInsights in May 2013. There are already Hadoop-based
Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database
Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica
Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop
Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social
Hadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
Hadoop Job Oriented Training Agenda
1 Hadoop Job Oriented Training Agenda Kapil CK [email protected] Module 1 M o d u l e 1 Understanding Hadoop This module covers an overview of big data, Hadoop, and the Hortonworks Data Platform. 1.1 Module
Apache Hadoop: The Big Data Refinery
Architecting the Future of Big Data Whitepaper Apache Hadoop: The Big Data Refinery Introduction Big data has become an extremely popular term, due to the well-documented explosion in the amount of data
Cloud-based data integration solution with high-performance development environment
Test: Talend Integration Cloud Cloud-based data integration solution with high-performance development environment Dr. Götz Güttich With the Talend Integration Cloud, Talend provides a secure Cloud-based
BIG DATA SERIES: HADOOP DEVELOPER TRAINING PROGRAM. An Overview
BIG DATA SERIES: HADOOP DEVELOPER TRAINING PROGRAM An Overview Contents Contents... 1 BIG DATA SERIES: HADOOP DEVELOPER TRAINING PROGRAM... 1 Program Overview... 4 Curriculum... 5 Module 1: Big Data: Hadoop
Sage Intelligence Financial Reporting for Sage ERP X3 Version 6.5 Installation Guide
Sage Intelligence Financial Reporting for Sage ERP X3 Version 6.5 Installation Guide Table of Contents TABLE OF CONTENTS... 3 1.0 INTRODUCTION... 1 1.1 HOW TO USE THIS GUIDE... 1 1.2 TOPIC SUMMARY...
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
Cloudera Certified Developer for Apache Hadoop
Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number
HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM
HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM 1. Introduction 1.1 Big Data Introduction What is Big Data Data Analytics Bigdata Challenges Technologies supported by big data 1.2 Hadoop Introduction
Testing 3Vs (Volume, Variety and Velocity) of Big Data
Testing 3Vs (Volume, Variety and Velocity) of Big Data 1 A lot happens in the Digital World in 60 seconds 2 What is Big Data Big Data refers to data sets whose size is beyond the ability of commonly used
AWS Schema Conversion Tool. User Guide Version 1.0
AWS Schema Conversion Tool User Guide AWS Schema Conversion Tool: User Guide Copyright 2016 Amazon Web Services, Inc. and/or its affiliates. All rights reserved. Amazon's trademarks and trade dress may
Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE
Spring,2015 Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE Contents: Briefly About Big Data Management What is hive? Hive Architecture Working
Peers Techno log ies Pv t. L td. HADOOP
Page 1 Peers Techno log ies Pv t. L td. Course Brochure Overview Hadoop is a Open Source from Apache, which provides reliable storage and faster process by using the Hadoop distibution file system and
Big Data Course Highlights
Big Data Course Highlights The Big Data course will start with the basics of Linux which are required to get started with Big Data and then slowly progress from some of the basics of Hadoop/Big Data (like
WebSpy Vantage Ultimate 2.2 Web Module Administrators Guide
WebSpy Vantage Ultimate 2.2 Web Module Administrators Guide This document is intended to help you get started using WebSpy Vantage Ultimate and the Web Module. For more detailed information, please see
SAP Business Objects Business Intelligence platform Document Version: 4.1 Support Package 7 2015-11-24. Data Federation Administration Tool Guide
SAP Business Objects Business Intelligence platform Document Version: 4.1 Support Package 7 2015-11-24 Data Federation Administration Tool Guide Content 1 What's new in the.... 5 2 Introduction to administration
Hadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks
Hadoop Introduction Olivier Renault Solution Engineer - Hortonworks Hortonworks A Brief History of Apache Hadoop Apache Project Established Yahoo! begins to Operate at scale Hortonworks Data Platform 2013
ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
Big data for the Masses The Unique Challenge of Big Data Integration
Big data for the Masses The Unique Challenge of Big Data Integration White Paper Table of contents Executive Summary... 4 1. Big Data: a Big Term... 4 1.1. The Big Data... 4 1.2. The Big Technology...
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
Solution White Paper Connect Hadoop to the Enterprise
Solution White Paper Connect Hadoop to the Enterprise Streamline workflow automation with BMC Control-M Application Integrator Table of Contents 1 EXECUTIVE SUMMARY 2 INTRODUCTION THE UNDERLYING CONCEPT
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
Important Notice. (c) 2010-2013 Cloudera, Inc. All rights reserved.
Hue 2 User Guide Important Notice (c) 2010-2013 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, and any other product or service names or slogans contained in this document
Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>
s Big Data solutions Roger Wullschleger DBTA Workshop on Big Data, Cloud Data Management and NoSQL 10. October 2012, Stade de Suisse, Berne 1 The following is intended to outline
BIG DATA What it is and how to use?
BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14
Using Tableau Software with Hortonworks Data Platform
Using Tableau Software with Hortonworks Data Platform September 2013 2013 Hortonworks Inc. http:// Modern businesses need to manage vast amounts of data, and in many cases they have accumulated this data
User Guide. Analytics Desktop Document Number: 09619414
User Guide Analytics Desktop Document Number: 09619414 CONTENTS Guide Overview Description of this guide... ix What s new in this guide...x 1. Getting Started with Analytics Desktop Introduction... 1
Database Studio is the new tool to administrate SAP MaxDB database instances as of version 7.5.
1 2 3 4 Database Studio is the new tool to administrate SAP MaxDB database instances as of version 7.5. It replaces the previous tools Database Manager GUI and SQL Studio from SAP MaxDB version 7.7 onwards
Constructing a Data Lake: Hadoop and Oracle Database United!
Constructing a Data Lake: Hadoop and Oracle Database United! Sharon Sophia Stephen Big Data PreSales Consultant February 21, 2015 Safe Harbor The following is intended to outline our general product direction.
Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes
Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes Highly competitive enterprises are increasingly finding ways to maximize and accelerate
HareDB HBase Client Web Version USER MANUAL HAREDB TEAM
2013 HareDB HBase Client Web Version USER MANUAL HAREDB TEAM Connect to HBase... 2 Connection... 3 Connection Manager... 3 Add a new Connection... 4 Alter Connection... 6 Delete Connection... 6 Clone Connection...
Click Stream Data Analysis Using Hadoop
Governors State University OPUS Open Portal to University Scholarship Capstone Projects Spring 2015 Click Stream Data Analysis Using Hadoop Krishna Chand Reddy Gaddam Governors State University Sivakrishna
Big Data Too Big To Ignore
Big Data Too Big To Ignore Geert! Big Data Consultant and Manager! Currently finishing a 3 rd Big Data project! IBM & Cloudera Certified! IBM & Microsoft Big Data Partner 2 Agenda! Defining Big Data! Introduction
Deploying Hadoop with Manager
Deploying Hadoop with Manager SUSE Big Data Made Easier Peter Linnell / Sales Engineer [email protected] Alejandro Bonilla / Sales Engineer [email protected] 2 Hadoop Core Components 3 Typical Hadoop Distribution
WHITE PAPER. Four Key Pillars To A Big Data Management Solution
WHITE PAPER Four Key Pillars To A Big Data Management Solution EXECUTIVE SUMMARY... 4 1. Big Data: a Big Term... 4 EVOLVING BIG DATA USE CASES... 7 Recommendation Engines... 7 Marketing Campaign Analysis...
Cost-Effective Business Intelligence with Red Hat and Open Source
Cost-Effective Business Intelligence with Red Hat and Open Source Sherman Wood Director, Business Intelligence, Jaspersoft September 3, 2009 1 Agenda Introductions Quick survey What is BI?: reporting,
Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014
Forecast of Big Data Trends Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014 Big Data transforms Business 2 Data created every minute Source http://mashable.com/2012/06/22/data-created-every-minute/
Talend Open Studio for Big Data. Release Notes 5.2.1
Talend Open Studio for Big Data Release Notes 5.2.1 Talend Open Studio for Big Data Copyleft This documentation is provided under the terms of the Creative Commons Public License (CCPL). For more information
Quick Deployment Step-by-step instructions to deploy Oracle Big Data Lite Virtual Machine
Quick Deployment Step-by-step instructions to deploy Oracle Big Data Lite Virtual Machine Version 3.0 Please note: This appliance is for testing and educational purposes only; it is unsupported and not
Consumption of OData Services of Open Items Analytics Dashboard using SAP Predictive Analysis
Consumption of OData Services of Open Items Analytics Dashboard using SAP Predictive Analysis (Version 1.17) For validation Document version 0.1 7/7/2014 Contents What is SAP Predictive Analytics?... 3
Introduction To Hive
Introduction To Hive How to use Hive in Amazon EC2 CS 341: Project in Mining Massive Data Sets Hyung Jin(Evion) Kim Stanford University References: Cloudera Tutorials, CS345a session slides, Hadoop - The
ITG Software Engineering
Introduction to Apache Hadoop Course ID: Page 1 Last Updated 12/15/2014 Introduction to Apache Hadoop Course Overview: This 5 day course introduces the student to the Hadoop architecture, file system,
Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science
A Seminar report On Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science SUBMITTED TO: www.studymafia.org SUBMITTED BY: www.studymafia.org
Workshop on Hadoop with Big Data
Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly
SAS Marketing Automation 5.1. User s Guide
SAS Marketing Automation 5.1 User s Guide The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2007. SAS Marketing Automation 5.1: User s Guide. Cary, NC: SAS Institute
Presenters: Luke Dougherty & Steve Crabb
Presenters: Luke Dougherty & Steve Crabb About Keylink Keylink Technology is Syncsort s partner for Australia & New Zealand. Our Customers: www.keylink.net.au 2 ETL is THE best use case for Hadoop. ShanH
Internals of Hadoop Application Framework and Distributed File System
International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop
Hadoop on Windows Azure: Hive vs. JavaScript for Processing Big Data
Hive vs. JavaScript for Processing Big Data For some time Microsoft didn t offer a solution for processing big data in cloud environments. SQL Server is good for storage, but its ability to analyze terabytes
Informatica Cloud & Redshift Getting Started User Guide
Informatica Cloud & Redshift Getting Started User Guide 2014 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording
MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering
MySQL and Hadoop: Big Data Integration Shubhangi Garg & Neha Kumari MySQL Engineering 1Copyright 2013, Oracle and/or its affiliates. All rights reserved. Agenda Design rationale Implementation Installation
Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview
Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce
IBM DB2 XML support. How to Configure the IBM DB2 Support in oxygen
Table of Contents IBM DB2 XML support About this Tutorial... 1 How to Configure the IBM DB2 Support in oxygen... 1 Database Explorer View... 3 Table Explorer View... 5 Editing XML Content of the XMLType
Map Reduce & Hadoop Recommended Text:
Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately
Cloudera Manager Training: Hands-On Exercises
201408 Cloudera Manager Training: Hands-On Exercises General Notes... 2 In- Class Preparation: Accessing Your Cluster... 3 Self- Study Preparation: Creating Your Cluster... 4 Hands- On Exercise: Working
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
WHAT S NEW IN SAS 9.4
WHAT S NEW IN SAS 9.4 PLATFORM, HPA & SAS GRID COMPUTING MICHAEL GODDARD CHIEF ARCHITECT SAS INSTITUTE, NEW ZEALAND SAS 9.4 WHAT S NEW IN THE PLATFORM Platform update SAS Grid Computing update Hadoop support
IBM Software InfoSphere Guardium. Planning a data security and auditing deployment for Hadoop
Planning a data security and auditing deployment for Hadoop 2 1 2 3 4 5 6 Introduction Architecture Plan Implement Operationalize Conclusion Key requirements for detecting data breaches and addressing
Data Domain Profiling and Data Masking for Hadoop
Data Domain Profiling and Data Masking for Hadoop 1993-2015 Informatica LLC. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording or
I/O Considerations in Big Data Analytics
Library of Congress I/O Considerations in Big Data Analytics 26 September 2011 Marshall Presser Federal Field CTO EMC, Data Computing Division 1 Paradigms in Big Data Structured (relational) data Very
Talend Metadata Manager. Reduce Risk and Friction in your Information Supply Chain
Talend Metadata Manager Reduce Risk and Friction in your Information Supply Chain Talend Metadata Manager Talend Metadata Manager provides a comprehensive set of capabilities for all facets of metadata
COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015
COSC 6397 Big Data Analytics 2 nd homework assignment Pig and Hive Edgar Gabriel Spring 2015 2 nd Homework Rules Each student should deliver Source code (.java files) Documentation (.pdf,.doc,.tex or.txt
Automate Your BI Administration to Save Millions with Command Manager and System Manager
Automate Your BI Administration to Save Millions with Command Manager and System Manager Presented by: Dennis Liao Sr. Sales Engineer Date: 27 th January, 2015 Session 2 This Session is Part of MicroStrategy
Managing Third Party Databases and Building Your Data Warehouse
Managing Third Party Databases and Building Your Data Warehouse By Gary Smith Software Consultant Embarcadero Technologies Tech Note INTRODUCTION It s a recurring theme. Companies are continually faced
Jet Data Manager 2012 User Guide
Jet Data Manager 2012 User Guide Welcome This documentation provides descriptions of the concepts and features of the Jet Data Manager and how to use with them. With the Jet Data Manager you can transform
Testing Big data is one of the biggest
Infosys Labs Briefings VOL 11 NO 1 2013 Big Data: Testing Approach to Overcome Quality Challenges By Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen Kumar Gajja Validate data quality by employing
SAP Data Services 4.X. An Enterprise Information management Solution
SAP Data Services 4.X An Enterprise Information management Solution Table of Contents I. SAP Data Services 4.X... 3 Highlights Training Objectives Audience Pre Requisites Keys to Success Certification
SQL Server Administrator Introduction - 3 Days Objectives
SQL Server Administrator Introduction - 3 Days INTRODUCTION TO MICROSOFT SQL SERVER Exploring the components of SQL Server Identifying SQL Server administration tasks INSTALLING SQL SERVER Identifying
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,
Data Warehouse Optimization
Data Warehouse Optimization Embedding Hadoop in Data Warehouse Environments A Whitepaper Rick F. van der Lans Independent Business Intelligence Analyst R20/Consultancy September 2013 Sponsored by Copyright
A Brief Outline on Bigdata Hadoop
A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is
SQL Server Replication Guide
SQL Server Replication Guide Rev: 2013-08-08 Sitecore CMS 6.3 and Later SQL Server Replication Guide Table of Contents Chapter 1 SQL Server Replication Guide... 3 1.1 SQL Server Replication Overview...
How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time
SCALEOUT SOFTWARE How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time by Dr. William Bain and Dr. Mikhail Sobolev, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T wenty-first
Lofan Abrams Data Services for Big Data Session # 2987
Lofan Abrams Data Services for Big Data Session # 2987 Big Data Are you ready for blast-off? Big Data, for better or worse: 90% of world s data generated over last two years. ScienceDaily, ScienceDaily
Hadoop & Spark Using Amazon EMR
Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?
Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments
Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments Important Notice 2010-2015 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, Impala, and
Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf
Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf Rong Gu,Qianhao Dong 2014/09/05 0. Introduction As we want to have a performance framework for Tachyon, we need to consider two aspects
#TalendSandbox for Big Data
Evalua&on von Apache Hadoop mit der #TalendSandbox for Big Data Julien Clarysse @whatdoesdatado @talend 2015 Talend Inc. 1 Connecting the Data-Driven Enterprise 2 Talend Overview Founded in 2006 BRAND
What's New in SAS Data Management
Paper SAS034-2014 What's New in SAS Data Management Nancy Rausch, SAS Institute Inc., Cary, NC; Mike Frost, SAS Institute Inc., Cary, NC, Mike Ames, SAS Institute Inc., Cary ABSTRACT The latest releases
Big Data Analytics - Accelerated. stream-horizon.com
Big Data Analytics - Accelerated stream-horizon.com Legacy ETL platforms & conventional Data Integration approach Unable to meet latency & data throughput demands of Big Data integration challenges Based
SAP BusinessObjects Business Intelligence (BI) platform Document Version: 4.1, Support Package 3-2014-04-03. Report Conversion Tool Guide
SAP BusinessObjects Business Intelligence (BI) platform Document Version: 4.1, Support Package 3-2014-04-03 Table of Contents 1 Report Conversion Tool Overview.... 4 1.1 What is the Report Conversion Tool?...4
Oracle Service Bus Examples and Tutorials
March 2011 Contents 1 Oracle Service Bus Examples... 2 2 Introduction to the Oracle Service Bus Tutorials... 5 3 Getting Started with the Oracle Service Bus Tutorials... 12 4 Tutorial 1. Routing a Loan
INTEGRATING MICROSOFT DYNAMICS CRM WITH SIMEGO DS3
INTEGRATING MICROSOFT DYNAMICS CRM WITH SIMEGO DS3 Often the most compelling way to introduce yourself to a software product is to try deliver value as soon as possible. Simego DS3 is designed to get you
Cloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box
Cloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box By Kavya Mugadur W1014808 1 Table of contents 1.What is CDH? 2. Hadoop Basics 3. Ways to install CDH 4. Installation and
OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)
Use Data from a Hadoop Cluster with Oracle Database Hands-On Lab Lab Structure Acronyms: OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS) All files are
Talend Real-Time Big Data Sandbox. Big Data Insights Cookbook
Talend Real-Time Big Data Talend Real-Time Big Data Overview of Real-time Big Data Pre-requisites to run Setup & Talend License Talend Real-Time Big Data Big Data Setup & About this cookbook What is the
