Performance and Scalability Overview



Similar documents
Performance and Scalability Overview

The Power of Pentaho and Hadoop in Action. Demonstrating MapReduce Performance at Scale

Eliminating Complexity to Ensure Fastest Time to Big Data Value

Eliminating Complexity to Ensure Fastest Time to Big Data Value

Big Data at Cloud Scale

Architected Blended Big Data with Pentaho

Build a Streamlined Data Refinery. An enterprise solution for blended data that is governed, analytics-ready, and on-demand

The Ultimate Guide to Buying Business Analytics

Contents. Pentaho Corporation. Version 5.1. Copyright Page. New Features in Pentaho Data Integration 5.1. PDI Version 5.1 Minor Functionality Changes

Blueprints for Big Data Success

The IBM Cognos Platform for Enterprise Business Intelligence

Buying vs. Building Business Analytics. A decision resource for technology and product teams

Automated Data Ingestion. Bernhard Disselhoff Enterprise Sales Engineer

The Ultimate Guide to Buying Business Analytics

Oracle Big Data SQL Technical Update

The SMB s Blueprint for Taking an Agile Approach to BI

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Embedded Analytics Vendor Selection Guide. A holistic evaluation criteria for your OEM analytics project

Product Innovation with Big Data

IBM Cognos 10: Enhancing query processing performance for IBM Netezza appliances

Big Data Analytics - Accelerated. stream-horizon.com

Big Data Technologies Compared June 2014

How To Handle Big Data With A Data Scientist

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Native Connectivity to Big Data Sources in MSTR 10

Real Life Performance of In-Memory Database Systems for BI

Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc.

IBM BigInsights for Apache Hadoop

Vectorwise 3.0 Fast Answers from Hadoop. Technical white paper

Implement Hadoop jobs to extract business value from large and varied data sets

Data Integration Checklist

EMC Federation Big Data Solutions. Copyright 2015 EMC Corporation. All rights reserved.

Introducing Oracle Exalytics In-Memory Machine

How To Use Hp Vertica Ondemand

A TECHNICAL WHITE PAPER ATTUNITY VISIBILITY

Composite Software Data Virtualization Turbocharge Analytics with Big Data and Data Virtualization

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

SQL Server 2012 Performance White Paper

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

An Oracle White Paper February Oracle Data Integrator 12c Architecture Overview

How To Scale Out Of A Nosql Database

a division of Technical Overview Xenos Enterprise Server 2.0

Data processing goes big

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam

Open Source Business Intelligence Intro

[Hadoop, Storm and Couchbase: Faster Big Data]

Affordable, Scalable, Reliable OLTP in a Cloud and Big Data World: IBM DB2 purescale

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

BIG DATA-AS-A-SERVICE

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard

Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world

In-Memory Analytics for Big Data

AtScale Intelligence Platform

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

<Insert Picture Here> Oracle and/or Hadoop And what you need to know

Crazy NoSQL Data Integration with Pentaho

Scalable Architecture on Amazon AWS Cloud

Presenters: Luke Dougherty & Steve Crabb

Actian SQL in Hadoop Buyer s Guide

Big Data and Natural Language: Extracting Insight From Text

Enterprise Performance Tuning: Best Practices with SQL Server 2008 Analysis Services. By Ajay Goyal Consultant Scalability Experts, Inc.

An Oracle White Paper July Oracle Primavera Contract Management, Business Intelligence Publisher Edition-Sizing Guide

SAP HANA. SAP HANA Performance Efficient Speed and Scale-Out for Real-Time Business Intelligence

Big Data Use Case. How Rackspace is using Private Cloud for Big Data. Bryan Thompson. May 8th, 2013

Innovative technology for big data analytics

CitusDB Architecture for Real-Time Big Data

QLIKVIEW INTEGRATION TION WITH AMAZON REDSHIFT John Park Partner Engineering

Zynga Analytics Leveraging Big Data to Make Games More Fun and Social

Tap into Hadoop and Other No SQL Sources

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

Lofan Abrams Data Services for Big Data Session # 2987

Hadoop & Spark Using Amazon EMR

GigaSpaces Real-Time Analytics for Big Data

2010 Ingres Corporation. Interactive BI for Large Data Volumes Silicon India BI Conference, 2011, Mumbai Vivek Bhatnagar, Ingres Corporation

<Insert Picture Here> Big Data

Can Flash help you ride the Big Data Wave? Steve Fingerhut Vice President, Marketing Enterprise Storage Solutions Corporation

In-memory computing with SAP HANA

Getting Started & Successful with Big Data

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Big Data and Data Science: Behind the Buzz Words

An Oracle White Paper October Oracle: Big Data for the Enterprise

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

WHAT S NEW IN SAS 9.4

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Hadoop Market - Global Industry Analysis, Size, Share, Growth, Trends, and Forecast,

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Data Integrator Performance Optimization Guide

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

TRANSFORM BIG DATA INTO ACTIONABLE INFORMATION

I/O Considerations in Big Data Analytics

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

NoSQL Data Base Basics

Big Data Success Step 1: Get the Technology Right

Transcription:

Performance and Scalability Overview This guide provides an overview of some of the performance and scalability capabilities of the Pentaho Business Analytics platform. PENTAHO PERFORMANCE ENGINEERING TEAM Copyright 2015 Pentaho Corporation. Redistribution permitted. All trademarks are the property of their respective owners. For the latest information, please visit our web site at pentaho.com.

Contents Pentaho Scalability and High-Performance Architecture... 3 Pentaho Business Analytics Server... 3 64-bit Architecture Support... 4 Clustering Support... 4 Pentaho Reporting... 4 Pentaho Analysis... 4 In-Memory Caching Capabilities... 4 Aggregate Table Support... 5 Partitioning Support for High Cardinality Dimensionality... 6 Pentaho Data Integration... 6 Multi-threaded Architecture... 6 Transformation Processing Engine... 6 Clustering and Partitioning... 7 Executing in Hadoop (Pentaho MapReduce)... 8 Native Support for Big Data Sources including Hadoop, NoSQL and High-Performance Analytical Databases... 8 Customer Examples and Use Cases... 9 PENTAHO 2

Pentaho Scalability and High-Performance Architecture Business Analytics solutions are only valuable when they can be accessed and used by anyone, from anywhere and at any time. When selecting a business analytics platform, it is critical to assess the underlying architecture of the platform to ensure that it not only scales to the number of users and amount of data organizations have today, but supports growing numbers of users and increased data sizes into the future. By tightly coupling high-performance business intelligence with data integration in a single platform, Pentaho Business Analytics provides a scalable solution that can address enterprise requirements in organizations of all sizes. This guide provides an overview for just some of the performance tuning and scalability options available. Pentaho Business Analytics Server is a Web application for creating, accessing and sharing reports, analysis and dashboards. The Pentaho Business Analytics Server can be deployed in different configurations, from a single server node, to a cluster of nodes distributed across multiple servers. There are a number of ways to increase performance and scalability: > > Deployment on 64-bit operating systems > > Clustering multiple server nodes > > Optimizing the configuration of the Reporting and Analysis engines Pentaho Business Analytics Server DBA/ETL/BI DEVELOPER BUSINESS USERS DATA ANALYSTS PENTAHO BUSINESS ANALYTICS Enterprise & Interactive Reporting Interactive Analysis Dashboards Predictive Analysis Direct Access Data Integration & Data Quality > Visual MapReduce OPERATIONAL DATA BIG DATA DATA STREAM PUBLIC/PRIVATE CLOUDS PENTAHO 3

Deployment on 64-bit Operating Systems The Pentaho Business Analytics Server supports 64-bit operating systems for larger amounts of server memory and vertical scalability for higher user and data volumes on a single server. Clustering the Business Analytics Server Client Requests (Typically via web browser) The Pentaho Business Analytics Server can effectively scale out to a cluster, or further to a cloud environment. Clusters are excellent for permanently expanding resources commensurate with increasing load; cloud computing is particularly useful if scaling out is only need for specific periods of increased activity. Load Balancer Example: Apache HTTPD (requires sticky sessions) Optimizing the Configuration of the Reporting and Analysis Engines Pentaho Reporting Pentaho BA Server Cluster (deployed in Tomcat or JBoss) The Pentaho Reporting engine enables the retrieval, formatting and processing of information from a data source, to generate user-readable output. One example for increasing the performance and scalability of the Pentaho Reporting solutions is to take advantage of result set caching. When rendered, a parameterized report must account for every dataset required for every parameter. Every time a parameter field changes, every dataset is recalculated. This can negatively impact performance. Caching parameterized report result sets creates improved performance for larger datasets. Pentaho Analysis The Pentaho Analysis engine (Mondrian) creates an analysis schema, and forms data sets from that schema by using an MDX query. Maximizing performance and scalability always begins with the proper design and tuning of source data. Once the database has been optimized, there are some additional areas within the Pentaho Analysis engine that can be tuned. IN-MEMORY CACHING CAPABILITIES Pentaho s in-memory caching capability enables ad hoc analysis of millions of rows of data in seconds. Pentaho s pluggable, in-memory architecture is integrated with popular open source caching platforms such as Infinispan and Memcached and is used by many of the world s most popular social, ecommerce and multi-media websites. Business Analytics Repository PENTAHO 4

IN-MEMORY CACHING CAPABILITIES Mondrian s Pluggable, In-Memory Caching Architecture MDX Thin client: > Ad Hoc Analysis > Data Discovery We have operational metrics for six different businesses running in each of our senior care facilities that need to be retrieved and accessed everyday by our corporate management, the individual facilities managers, as well as the line of business managers in a matter of seconds. Now, with the high performance in-memory analysis capabilities in the latest release of Pentaho Business Analytics, we can be more aggressive in rollouts adding more metrics to dashboards, giving dashboards and data analysis capabilities to more users, and see greater usage rates and more adoption of business analytics solutions. BRANDON JACKSON, DIR. OF ANALYTICS AND FINANCE, STONEGATE SENIOR LIVING LLC. Mondrian Server > MDX Parser > Query Optimizer > SQL Generation > In-Memory, Pluggable Cache > Infinispan > MemcacheD Aggregate Table Example SQL (JDBC) Quantity Product Relational, MPP, or Columnar Database Time Sales Customer In addition, Pentaho allows in-memory aggregation of data where granular data can be rolled-up to higher-level summaries entirely in-memory, reducing the need to send new queries to the database. This will result in even faster performance for more complex analytic queries. Sales Aggregate Table AGGREGATE TABLE SUPPORT When working with large data sets, properly creating and using aggregate tables greatly improves performance. An aggregate table coexists with the base fact table, and contains pre-aggregated measures built from the fact table. Registered in the schema Pentaho Analysis can choose to use an aggregate table rather than the fact table, resulting in faster query performance. PENTAHO 5

PARTITIONING SUPPORT FOR HIGH CARDINALITY DIMENSIONALITY Large, enterprise data warehouse deployments often contain attributes comprised of tens or hundreds of thousands of unique members. For these use cases, the Pentaho Analysis engine can be configured to properly address a (partitioned) high-cardinality dimension. This will streamline SQL generation for partitioned tables; ultimately, only the relevant partitions will be queried, which can greatly increases query performance. Pentaho Data Integration Pentaho Data Integration (PDI) is an extract, transform, and load (ETL) solution that uses an innovative metadata-driven approach. It includes an easy to use, graphical design environment for building ETL jobs and transformations, resulting in faster development, lower maintenance costs, interactive debugging, and simplified deployment. PDI s multithreaded, scale-out architecture provides performance tuning and scalability options for handling even the most demanding ETL workloads. MULTI-THREADED ARCHITECTURE PDI s streaming engine architecture provides the ability to work with extremely large data volumes, and provides enterprise-class performance and scalability with a broad range of deployment options including dedicated, clustered, and/or cloud-based ETL servers MULTI-THREADED ARCHITECTURE Example of a Data Integration Flow with Multiple Threads for a Single Step (Row Demoralizer) Import Sort Demoralizer Group Demoralizer Import Sort Demoralizer Group Demoralizer TRANSFORMATION PROCESSING ENGINE Pentaho Data Integration s transformation processing engine starts and executes all steps within a transformation in parallel (multi-threaded) allowing maximum usage of available CPU resources. Done by default this allows processing of an unlimited number of rows and columns in a streaming fashion. Furthermore, the engine is 100% metadata driven (no code generation) resulting in reduced deployment complexity. PDI also provides different processing engines that can be used to influence thread priority or limit execution to a single thread which is useful for parallel performance tuning of large transformations. The architecture allows both vertical and horizontal scaling. The engine executes tasks in parallel and across multiple CPUs on a single machine as well as across multiple servers via clustering and partitioning. Additional tuning options include the ability to configure streaming buffer sizes, reduce internal data type conversions (lazy conversion), leverage high performance non-blocking I/O (NIO) for read large blocks at a time and parallel reading of files, and support for multiple step copies to allowing optimization of Java Virtual Machine multi-thread usage. PENTAHO 6

Clustering in Pentaho Data Integration Source Data Flat Files Applications Databases Master Distributes the workload Slaves Parallel worker Target Database CLUSTERING AND PARTITIONING Pentaho Data Integration provides advanced clustering and partitioning capabilities that allow organizations to scale out their data integration deployments. Pentaho Data Integration clusters are built for increasing performance and throughput of data transformations; in particular they are built to perform classic divide and conquer processing of data sets in parallel. PDI clusters have a strong master/slave topology. There is one master in cluster but there can be many slaves. This cluster scheme can be used to distribute the ETL workload in parallel appropriately across these multiple systems. Transformations are broken into master/slaves topology and deployed to all servers in a cluster where each server in the cluster is running a PDI engine to listen, receive, execute and monitor transformations. It is also possible to define dynamic clusters where the Slave servers are only known at run-time. This is very useful in cloud computing scenarios where hosts are added or removed at will. More information on this topic including load statistics can be found in an independent consulting white paper created by Nick Goodman from Bayon Technologies, Scaling Out Large Data Volume Processing in the Cloud or on Premise. PENTAHO 7

PENTAHO MAPREDUCE EXAMPLE Executing Pentaho Data Integration Inside a Hadoop Cluster Map/Reduce Input Mapper Process Web Logs Reducer Map/Reduce Input Parse Log Combine Year & Month into Output Key Group on Key Field Map/Reduce Output Map/Reduce Output Pentaho Data Integration Engine (or PDI Server) JAR Hadoop Cluster EXECUTING IN HADOOP (PENTAHO MAPREDUCE) Pentaho s Java-based data integration engine integrates with the Hadoop cache for automatic deployment as a MapReduce task across every data node in a Hadoop cluster, leveraging the use of the massively parallel processing and high availability of Hadoop. NATIVE SUPPORT FOR BIG DATA SOURCES INCLUDING HADOOP, NOSQL AND HIGH- PERFORMANCE ANALYTICAL DATABASES Pentaho supports native access, bulk-loading and querying of a large number of databases including: > > NoSQL data sources such as: > > MongoDB > > Cassandra > > HBase > > HPCC Systems > > ElasticSearch > > Analytic databases such as: > > HP Vertica > > EMC Greenplum > > HP NonStop SQL/MX > > IBM Netezza > > Infobright > > Actian Vectorwise > > LucidDB > > MonetDB > > Teradata > > Transactional databases such as: > > MySQL > > Postgres > > Oracle > > DB2 > > SQL Server > > Teradata PENTAHO 8

Customer Examples and Use Cases INDUSTRY Retail Telecom (B2C) Social Networking System Integration (Global SI) High-tech Manufacturing Stream Global Provider of Sales, Customer Service and Technical Support for the Fortune 1000 USE CASE Store Operations Dashboard Customer Value Analysis Website Activity Analysis Business Performance Metrics Dashboard Customer Service Management 10 Operational Dashboards DATA VOLUME AND TYPE 5+ TB HP Neoview # USERS (TOTAL) 1200 200 2+ TB in Greenplum <500 <25 1 TB in Vectorwise 10+ TB in a 20-node Hadoop cluster Loading 200,000 rows per second 20 billion chat logs per month 240 million user profiles 500 GB to 1TB in an 8-node Greenplum cluster 200 GB in Oracle Cloudera Hadoop Loading 10 million records per hour 650,000 XML documents per week (2 to 4 MB each) 100+ million devices dimension Data from 28 switches around the world 12 source systems e.g. Oracle HRMS, SAP, Salesforce.com 20 million records per hour Social Networking >100,000 3,000 High-tech Manufacturing # USERS (CONCURRENT) Website Activity Analysis Customer Service Management 200+ Today 120-200. Will add 50-100 more. 49 locations across 22 countries Sheetz 2+ TB in Teradata 80 30 PENTAHO 9

Global Headquarters Citadel International - Suite 340 5950 Hazeltine National Dr. Orlando, FL 32822, USA tel +1 407 812 6736 fax +1 407 517 4575 US & Worldwide Sales Office 353 Sacramento Street, Suite 1500 San Francisco, CA 94111, USA tel +1 415 525 5540 toll free +1 866 660 7555 Learn more about Pentaho Business Analytics pentaho.com/contact +1 (866) 660-7555 United Kingdom, Rest of Europe, Middle East, Africa London, United Kingdom tel +44 7711 104854 toll free (UK) 0 800 680 0693 FRANCE Offices - Paris, France tel +33 97 51 82 296 toll free (France) 0800 915343 GERMANY, AUSTRIA, SWITZERLAND Offices - Frankfurt, Germany tel +49(0)89 / 37 41 40 81 toll free (Germany) 0800 186 0332 BELGIUM, NETHERLANDS, LUXEMBOURG Offices - Antwerp, Belgium tel +31 6 52 69 88 01 toll free (Belgium) 0800 773 83 Copyright 2015 Pentaho Corporation. Redistribution permitted. All trademarks are the property of their respective owners. For the latest information, please visit our web site at pentaho.com. 015-525