The Power of Pentaho and Hadoop in Action. Demonstrating MapReduce Performance at Scale

Similar documents
Big Data at Cloud Scale

Build a Streamlined Data Refinery. An enterprise solution for blended data that is governed, analytics-ready, and on-demand

Architected Blended Big Data with Pentaho

Embedded Analytics Vendor Selection Guide. A holistic evaluation criteria for your OEM analytics project

Buying vs. Building Business Analytics. A decision resource for technology and product teams

Performance and Scalability Overview

Performance and Scalability Overview

The Ultimate Guide to Buying Business Analytics

The Ultimate Guide to Buying Business Analytics

Blueprints for Big Data Success

Big Data Use Cases. To Start Today. Paul Scholey Sales Director, EMEA. 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866)

SAS BIG DATA SOLUTIONS ON AWS SAS FORUM ESPAÑA, OCTOBER 16 TH, 2014 IAN MEYERS SOLUTIONS ARCHITECT / AMAZON WEB SERVICES

DATAMEER WHITE PAPER. Beyond BI. Big Data Analytic Use Cases

Your Path to. Big Data A Visual Guide

Big Data for Investment Research Management

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

The Future of Data Management

Vectorwise 3.0 Fast Answers from Hadoop. Technical white paper

Next-Generation Cloud Analytics with Amazon Redshift

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

Three Open Blueprints For Big Data Success

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

# Not a part of 1Z0-061 or 1Z0-144 Certification test, but very important technology in BIG DATA Analysis

Big Data and Natural Language: Extracting Insight From Text

BigMemory and Hadoop: Powering the Real-time Intelligent Enterprise

Business Intelligence for Big Data

Real Time Big Data Processing

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Understanding Your Customer Journey by Extending Adobe Analytics with Big Data

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

Interactive data analytics drive insights

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

Hadoop Market - Global Industry Analysis, Size, Share, Growth, Trends, and Forecast,

BIG DATA-AS-A-SERVICE

BIG DATA TRENDS AND TECHNOLOGIES

An Integrated Big Data & Analytics Infrastructure June 14, 2012 Robert Stackowiak, VP Oracle ESG Data Systems Architecture

Contents. Pentaho Corporation. Version 5.1. Copyright Page. New Features in Pentaho Data Integration 5.1. PDI Version 5.1 Minor Functionality Changes

Cloudera Enterprise Data Hub in Telecom:

How To Scale Out Of A Nosql Database

Hadoop. Sunday, November 25, 12

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

Big Data for Investment Research Management

TAMING THE BIG CHALLENGE OF BIG DATA MICROSOFT HADOOP

Evolution from Big Data to Smart Data

Informatica PowerCenter The Foundation of Enterprise Data Integration

Automated Data Ingestion. Bernhard Disselhoff Enterprise Sales Engineer

Hadoop Big Data for Processing Data and Performing Workload

High Availability of the Polarion Server

A Performance Analysis of Distributed Indexing using Terrier

Oracle Big Data SQL Technical Update

Protecting Big Data Data Protection Solutions for the Business Data Lake

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Ubuntu and Hadoop: the perfect match

Data Challenges in Telecommunications Networks and a Big Data Solution

EMC s Enterprise Hadoop Solution. By Julie Lockner, Senior Analyst, and Terri McClure, Senior Analyst

Get More Scalability and Flexibility for Big Data

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Chapter 7. Using Hadoop Cluster and MapReduce

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

BEYOND BI: Big Data Analytic Use Cases

Data Refinery with Big Data Aspects

Information Architecture

Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc.

Hadoop & Spark Using Amazon EMR

Native Connectivity to Big Data Sources in MSTR 10

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

Offload Enterprise Data Warehouse (EDW) to Big Data Lake. Ample White Paper

QLIKVIEW INTEGRATION TION WITH AMAZON REDSHIFT John Park Partner Engineering

APPROACHABLE ANALYTICS MAKING SENSE OF DATA

BUILDING A SCALABLE BIG DATA INFRASTRUCTURE FOR DYNAMIC WORKFLOWS

Adobe Deploys Hadoop as a Service on VMware vsphere

Building your Big Data Architecture on Amazon Web Services

Service Organization Controls 3 Report

Data Modeling for Big Data

The 4 Pillars of Technosoft s Big Data Practice

Why Big Data in the Cloud?

Workshop on Hadoop with Big Data

Agenda. Big Data. Dell Cloud Solutions A Dell Story Summary. Concepts Market Trends and Challenges Dell Solutions

Sybase Solutions for Healthcare Adapting to an Evolving Business and Regulatory Environment

I/O Considerations in Big Data Analytics

Testing Big data is one of the biggest

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Big Data Are You Ready? Thomas Kyte

Oracle Database 12c Plug In. Switch On. Get SMART.

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Cloud Integration and the Big Data Journey - Common Use-Case Patterns

Transcription:

The Power of Pentaho and Hadoop in Action Demonstrating MapReduce Performance at Scale

Introduction Over the last few years, Big Data has gone from a tech buzzword to a value generator for many organizations. At the forefront of this trend has been Hadoop, a common framework for distributed data storage and processing that is well-suited to handle very large quantities of unstructured and semi-structured data from sources like web logs, social media, and industrial sensors. Forrester expects that two-thirds of enterprises will have deployed Hadoop or a related solution for at least one use case by the end of 2016. 1 As mainstream enterprise interest grows for a particular technology, so does the demand for reliable scalability and performance with respect to that technology. Large, complex organizations require that a system not only meets today s requirements, but also meets future needs. For Big Data integration and analytics solutions like Pentaho, this means demonstrating an ability to perform efficiently even as data volumes rise very rapidly. This report discusses a recent scalability test conducted using Pentaho Data Integration to execute and orchestrate MapReduce jobs in Hadoop with the purpose of demonstrating sustained performance at scale. Pentaho and Hadoop Pentaho provides an end-to-end data integration and business analytics platform that allows users to blend, orchestrate, and analyze data from a wide array of sources including relational databases, Hadoop distributions, NoSQL stores, enterprise applications, and many others. A part of the broader platform, Pentaho Data Integration (PDI) is comprised of Pentaho s data integration engine and associated server and workstation tools. It provides a flexible, user-friendly interface for creating visual data flows for transforming and integrating data. Pentaho began investment in the development of data integration solutions for Hadoop in early 2009 and delivered its first commercial Hadoop integration capabilities in 2010. For reference, Cloudera - the first company formed to commercialize Hadoop - was founded in 2008. Today, Pentaho offers extensive native integration capabilities for Hadoop that facilitate data ingestion, execution of complex data transformations in-cluster via MapReduce, and management of clusters through the Hadoop YARN infrastructure. The scalability tests were focused on Pentaho Visual MapReduce PDI s native ability to visually design Hadoop MapReduce jobs and run them in-cluster, leveraging Hadoop s distributed cache. This capability allows technical teams to accelerate time to value by avoiding complex and time-consuming hand coding of MapReduce jobs using Java. For example, in one use case a major financial institution was able to create Hadoop MapReduce jobs in Pentaho approximately 15 times faster than they could using a hand-coded approach. 1 From Forrester research report Hadoop Ecosystem Overview, Q4 2014, Brian Hopkins; Published 21 November 2014 PENTAHO 2

The Scalability Tests According to IDC, all data created and copied worldwide is expected to grow by 40% per year through 2020. 2 As such, enterprises want to be sure they can accommodate major growth in data volume as they evaluate their current environment and new Big Data technology options. In particular, a key challenge for parallel processing systems, like Hadoop, is to remain ecient as work demands increase. With this in mind, Pentaho wanted to demonstrate that running Pentaho Visual MapReduce delivers consistently high performance when scaling to billions of rows and multiple Terabytes (TB) of data processed across large Hadoop clusters. The data set used in these tests represented a common pattern of customer data, including orders and detailed line items with date, time, and geographic attributes. It is similar to the type of data an enterprise might expect to offload from a large data warehouse into a Hadoop cluster to optimize its overall data infrastructure, storage costs, and processing performance. The tests were conducted using Pentaho Data Integration and a 129- node Cloudera Hadoop cluster, deployed on Amazon Web Services cloud-based EC2 machines. Test Case Architecture Diagram PDI Client Amazon Web Services - EC2 Machines 128 Data Nodes Hadoop Resource Manager Hadoop MapReduce HDFS The tested Pentaho Visual MapReduce job included operations to filter, score, and rank the data. This overall job included 2 component MapReduce jobs, each of which was comprised of a Mapper and Reducer transformation. A final transformation at the end of the job prepared a correlated top 10 output set. All jobs and transformations were create in PDI s visual drag-and-drop interface. The data processed by PDI Visual MapReduce in the Hadoop cluster was increased by a factor of 8 over the course of the tests, with each trial approximately doubling the previous data volume processed - as indicated in the following chart: Test Data Characteristics by Trial Data Volumes Processed Trial Number Gigabytes (GB) Table Rows Orders Customers 1 471 3 Billion 750 Million 75 Million 2 948 6 Billion 1.5 Billion 150 Million 3 1,886 12 Billion 3 Billion 300 Million 4 3,773 24 Billion 6 Billion 600 Million 2 From IDC Whitepaper The Digital Universe of Opportunities: Rich Data and the Increasing Value of the Internet of Things, Published April 2014 PENTAHO 3

Results Pentaho Data Integration delivered consistent and efficient data processing rates of approximately 1 million table rows per second over the course of the 4 tests. In fact, while data processed increased by a factor of approximately 8 times between trials 1 and 4, Pentaho s processing rate actually became 20% faster over that interval. While we would not necessarily expect improved performance at higher and higher data volumes, the overall results demonstrate that Pentaho Visual MapReduce delivers sustained performance at scale in a large Hadoop cluster with several Terabytes of data processed. Millions per second Data Processing Rates Billions of Table Rows Conclusion The rapidly evolving Big Data ecosystem presents challenges regarding how to create data processing solutions that can adapt and scale efficiently. Pentaho delivers a time-tested, cost-effective data and analytics platform that meets these challenges in a way that is efficient for organizations to implement and maintain. Pentaho Data Integration has shown its ability in this test case to compliment large Hadoop clusters by sustaining a high rate of processing over increased dataset sizes of multiple terabytes. Overall, PDI provides the enterprise with the ability to accelerate IT productivity in bringing Big Data projects into production, while ensuring that their environment is future-proofed against large and unexpected increases in data volume. PENTAHO 4

Learn more about Pentaho Business Analytics pentaho.com/contact +1 (866) 660-7555. Global Headquarters Citadel International - Suite 340 5950 Hazeltine National Drive Orlando, FL 32822, USA tel +1 407 812 6736 fax +1 407 517 4575 US & Worldwide Sales Office 353 Sacramento Street, Suite 1500 San Francisco, CA 94111, USA tel +1 415 525 5540 toll free +1 866 660 7555 United Kingdom, Rest of Europe, Middle East, Africa London, United Kingdom tel +44 (0) 20 3574 4790 toll free (UK) 0 800 680 0693 FRANCE Offices - Paris, France tel +33 97 51 82 296 toll free (France) 0800 915343 GERMANY, AUSTRIA, SWITZERLAND Offices - Munich, Germany tel +49 (0) 322 2109 4279 toll free (Germany) 0800 186 0332 BELGIUM, NETHERLANDS, LUXEMBOURG Offices - Antwerp, Belgium tel (Netherlands) +31 8 58 880 585 toll free (Belgium) 0800 773 83 ITALY, SPAIN, PORTUGAL Offices - Valencia, Spain toll free (Italy) 800 798 217 toll free (Portugal) 800 180 060 Be social with Pentaho: Copyright 2015 Pentaho Corporation. Redistribution permitted. All trademarks are the property of their respective owners. For the latest information, please visit our web site at pentaho.com.