The Business Intelligence for Hadoop Benchmark

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "The Business Intelligence for Hadoop Benchmark"

Transcription

1 The Business Intelligence for Hadoop Benchmark Q1 2016

2 Table of Contents Hadoop as an Analytics Platform Executive Summary: Key Findings The Business Intelligence Evaluation Framework Benchmark Data Set Business Intelligence Query Types Testing Methodology Raw Data Set: Characteristics and Results Large Data Set: Key Findings and Observations AtScale Adaptive Cache: Characteristics and Query Results Concurrency Query Results Testing Environment and Resource Configurations Business Intelligence Benchmark Summary

3 Hadoop as an Analytics Platform The past decade has brought significant change to the way organizations store and process data. Much of this change is due to the innovation, adoption, and commercialization of Hadoop. The limitations of traditional data infrastructures have led many organizations to move to Hadoop for not only new data use cases, but for day-today operational workloads as well. As Hadoop matures, enterprises are starting to use this powerful platform to serve more diverse workloads. Hadoop is no longer just a batch-processing platform for data science and machine learning use cases it has evolved into a multi-purpose data platform for operational reporting, exploratory analysis, and real-time decision support. With the ongoing innovation of the SQL-on-Hadoop and in-memory data processing engines, Hadoop is now able to serve business-critical workloads in production. Hadoop is now ready to be the data source for business intelligence (BI) and online analytical processing (OLAP) workloads. According to the 2015 Hadoop Maturity Survey conducted by AtScale, Cloudera, Hortonworks, MapR, and Tableau, Business Intelligence (BI) is the top use case enterprises plan to migrate to Hadoop. In this research study, AtScale evaluated several SQL-on-Hadoop engines against a set of performance criteria that represent typical BI workloads and query patterns. Forrester predicts that 100% of enterprises will adopt Hadoop in the next 24 months 1

4 Executive Summary: Key Findings The goal of the Business Intelligence for Hadoop benchmark is to help technology evaluators select the best SQL-on-Hadoop engine for their use cases. Key findings include: Hadoop is a viable platform for BI: All tested engines successfully executed all of the queries in our benchmark suite and are stable enough to support business intelligence workloads. One engine does not fit all: Depending on raw data size, query complexity, and the target number of concurrent users, there is no one-size-fits-all engine. Each SQLon-Hadoop engine has its own strengths and challenges. Enterprises may consider deploying more than one engine to meet their company s goals. Performance differs for big and small query workloads: While all query engines successfully completed the query tests on large data sets (billions of rows), SparkSQL and Impala performed better on smaller data sets (thousands to a few million rows). Number of concurrent users affects performance: Impala showed the best performance with multiple concurrent users (over Hive and SparkSQL). Companies that anticipate connecting large numbers of business users to Hadoop may want to consider Impala. Innovation and feature support should be considered: Aside from performance, enterprises should consider the viability of the developer communities behind each engine. SparkSQL has frequent feature releases and consistent innovation. Cloudera donated the Impala project to the Apache Software Foundation this past November. AtScale expects these engines to continue to evolve. The majority of this benchmark study describes the testing methodology, datasets, and queries used to evaluate the Hive, Impala, and SparkSQL data processing engines. A technical summary of the benchmark results, as well as the system resource configurations can be found at the end of this paper. 2

5 The BI Benchmark Evaluation Framework AtScale considered the following key requirements when evaluating the SQL-on-Hadoop engines ability to satisfy Business Intelligence workloads: Performs on Big Data: the SQL-on-Hadoop engine must be able to consistently analyze billions or trillions of rows of data without generating errors and with response times on the order of 10s or 100s of seconds. Fast on Small Data: the engine must deliver interactive performance on known query patterns, and return results in a few seconds (or less) on small data sets (thousands to a few million rows). Stable for Many Users: the engine must be able to support concurrent queries issued by multiple users, and perform reliably under highly concurrent analysis workloads. AtScale believes that the above criteria represents the majority of requirements for running Business Intelligence (BI) workloads directly on data in Hadoop. This evaluation criteria was drawn from AtScale s experience working with a large number of companies in financial services, technology, retail, telecommunication, healthcare, and other industries. There are, of course, other aspects of the SQL-on-Hadoop engines that should be evaluated, such as SQL feature support, and performance for Data Science query workloads. However, the methodology behind the AtScale Business Intelligence benchmark focuses on traditional OLAP-style (Online Analytical Processing) queries that make extensive use of aggregation functions, GROUP BYs and WHERE clauses. AtScale provides a more exhaustive list of key requirements an organization should consider when architecting a BI on Hadoop solution. Please sign up here for a dedicated session to evaluate your company s unique requirements, and walk through the OLAP on Hadoop Selection Checklist. 3

6 Benchmark Data Set AtScale used the Star Schema Benchmark (SSB) data set to perform the tests in this benchmark study. You can find a detailed description of this data at: http: / This benchmark data set is based on the widely used TPCH data set, but has been modified to more accurately represent a typical business intelligence data model. This data set allowed AtScale to test queries across large tables, including a multi-billion row fact table and a high-cardinality dimension (over a billion unique values). The row counts for the tables used in the BI benchmarks are as follows: Star Schema Benchmark (SSB) Table Details Table Name Number of Rows Notes CUSTOMER_SMALL 30,000,000 CUSTOMER 1,050,000,000 Expanded customer dimension to test large joins LINEORDER 5,999,989,709 SUPPLIER 2,000,000 PART 2,000,000 DATE 16,799 5

7 Business Intelligence Query Types In order to truly simulate a Business Intelligence enterprise environment, AtScale tested 13 queries. The queries used for this benchmark can be summarized into the following high-level query patterns that test for data size, volume, and processing complexity: Q1.1 - Q1.3 are Quick Metric queries, which compute a particular metric value for a period of time. These queries have a small number of JOIN operations and minimal or no GROUP BY operations. Q2.1 - Q2.3 are Product Insight queries, which compute a metric (or several metrics) aggregated against a set of product and date dimension attributes. These queries include medium- sized JOIN operations and a small number of GROUP BY operations. Q3.1 - Q4.3 are Customer Insight queries, which compute a metric (or several metrics) aggregated against a set of product, customer, and date dimension attributes. These queries include both medium, large, and extra large JOIN operations plus several GROUP BY operations. 6

8 Testing Methodology AtScale ran a set of 13 queries against each engine. All queries were generated by Tableau and submitted via the AtScale Intelligence Platform. There may be small semantic differences between the queries submitted to each SQL-on-Hadoop engine, but otherwise all queries were identical in form. To find out more about the AtScale Intelligence Platform, go here. All 13 queries were executed twice: once against the raw data sets and once against the tables comprising AtScale s Adaptive Cache (aggregated data generated by the AtScale Engine based on the set of 13 queries issued). This benchmark will refer to the AtScale s Adaptive Cache tables as the aggregated data. Note: Each individual query was executed 10 times in succession. The numbers shown represent the average response time of the 10 queries. Benchmark Query Characteristics Query ID Number of Joins Largest Join Table Number of Group Bys Number of Filters Comments Q , range condition, 1 comparative filter condition directly on LINEORDER table Q , range filter conditions directly on LINEORDER table Q , range filter conditions directly on LINEORDER table, 2 conditions on joined table Q ,000, filter on p_category (less selective) Q ,000, filter on p_brand, 2 values (more selective) Q ,000, filter on p_brand, 1 value (most selective) Q ,050,000, filter on region (less selective) Q ,050,000, filter on nation (more selective) Q ,050,000, filter on city (most selective) Q ,050,000, filter on city (most selective) and month (vs. year) Q ,050,000, Q ,050,000, includes filter on year (more selective) Q ,050,000, includes filter on year and nation (most selective) 7

9 Raw Data Set: Characteristics and Results The table below shows the relative performance of Impala, SparkSQL, and Hive for the 13 benchmark queries executed against the 6 Billion-row LINEORDERS table. Benchmark Query Results for Large Tables Query Query Execution Time (in seconds) Impala 2.3 Spark-1.6 Hive 1.2 Q Q Q Q Q Q Q Q Q Q Q Q Q *Note: Fastest query time for each query is highlighted in a green 8

10 Large Data Set: Key Findings and Observations Based on the results of the Large Table Benchmarks, there are several key observations: No single SQL-on-Hadoop engine is best for ALL queries. Spark SQL and Impala tend to be faster than Hive, although for two queries (Q2.2, Q2.1) Hive is the fastest. For many queries, the performance difference between Impala and Spark SQL is relatively small. Increasing the number of JOIN operations generally increases query processing time. Going from Quick Metric to Product Insight queries (where the number of joins goes from 1 to 3) had the longest query time difference on Impala. However, as query selectivity increased, Impala performance increased significantly while SparkSQL and Hive performance remained relatively static. Increased query selectivity resulted in reduced query processing time. Selectivity involves reducing the amount of data requested using filters (WHERE clauses). Impala performance improved the most as query selectivity increased. JOIN operations between large tables increased processing time. As expected, joining two large tables was an expensive operation for all engines. This was particularly true when involving the 1 Billion-row CUSTOMERS table. SparkSQL performs best for multi-join queries. SparkSQL performance remains consistent as the number of joins increases, especially for less selective queries. The recent improvements in SparkSQL 1.6 contribute to SparkSQL s performance for multijoin queries. Benchmark Query Results for Large Tables 9

11 AtScale Adaptive Cache : Characteristics & Query Results Business Intelligence architectures rely on caching mechanisms in the form of in-memory stores, pre-materialized data structures, and aggregated data tables. As such, a thorough benchmark should evaluate performance against both the raw data and the accelerated data. AtScale re-executed the benchmark queries using the aggregated data tables in the AtScale Adaptive Cache. About AtScale Adaptive Cache Technology AtScale Adaptive Cache provides two key functions: Cache creation: After the first queries have been executed, the AtScale engine uses heuristic functions to determine if it is optimal to store partial query result sets to accelerate the processing of future queries. If so, one more aggregate tables may be created in the Adaptive Cache. Cache management: The AtScale engine manages the Adaptive Cache result sets to optimize for new queries, incorporate new data, and remove under-utilized aggregate tables. 10

12 Benchmark Query Results Using AtScale Adaptive Cache The following table shows the performance of the SQL-on-Hadoop engines against the accelerated aggregated data. Query Execution Time (in seconds) Query Impala 2.3 Spark-1.6 Hive 1.2 Q Q Q Q Q Q Q Q Q Q Q Q Q *Note: The fastest query time per query is highlighted in green. Benchmark Query Results for Large Tables 11

13 AtScale Adaptive Cache : Summary of Results The benchmark results indicate that both Impala and Spark SQL perform very well on the smaller aggregated data tables, effectively returning query results on the 6 Billion row data set with query response times ranging from under 300 milliseconds to several seconds. As expected, Hive queries against the aggregated data were slower, ranging from 3 to 15 seconds. Note that a future version of Hive will incorporate a project called LLAP (long-lived query execution) that is expected to significantly reduce Hive execution time on smaller data sets. These benchmark results also illustrate the significant performance gains that can be realized from AtScale s Adaptive Cache technology, with query performance for the same query improving by as much as 50X. In real-life deployments on larger data sets (on the order of 100s of Billions of rows) AtScale customers have seen performance gains over X as a result of this technology. 12

14 Concurrency Query Results Business Intelligence (BI) workloads often involve many data workers issuing queries at the same time. Concurrency is a key factor when evaluating performance of a BI platform. To test for performance against concurrent workloads, AtScale tested each benchmark query with 5,10, and 25 concurrent users. Each user session issued all 13 queries in succession against the aggregated data tables. Average Query Response Time for All Queries vs. User Concurrency As shown in the chart above, increasing the number of concurrent users increased the average response time for all of the SQL-on-Hadoop engines. While it is worth noting that all engines were able to scale to support up to 25+ concurrent users without a single query failure, we observed different behaviors from the various engines tested. Impala s performance scaled better than SparkSQL or Hive as the number of concurrent users increased. These results are consistent with AtScale s experience using Impala in real-world customer deployments. 13

15 Testing Environment and Resource Configurations In order to evaluate the base performance of the SQL-on-Hadoop engines. AtScale compared the performance of Cloudera Impala, Hive Tez, and SparkSQL running against a retail data set. The test environment was a 12 node Hadoop cluster with 1 master node,10 data nodes, and one dedicated AtScale gateway node. Benchmark Cluster Configuration Details RAM per node CPU specs for data (worker) nodes Storage specs for data (worker) nodes 128G 32 CPU cores 2x 512mb SSD The SQL-on-Hadoop engine benchmarks were performed by running each engine individually against the same Hadoop cluster configuration and data set. Details about engine-specific configurations and settings are below: Impala Configuration Impala Version Hive Version File Format Workers Memory Per Worker Parquet G 14

16 Cluster Environment and Engine Configurations (continued) SparkSQL Configuration Spark Version 1.6-SNAPSHOT Hive Version 1.2 File Format Parquet Workers 70 Memory per worker 14G Cores per worker 4 Hive Tez Configuration Tez Version 0.7 Hive Version 1.2 File Format ORC hive.tez.container.size 8192mb hive.cbo.enabled true hive.auto.convert.join.noconditionaltask.size NOTE: AtScale used the default configuration settings and virtual machine instances to execute this benchmark. These results should be considered an assessment of relative engine performance and not a rigorous performance evaluation of any single engine. 15

17 Business Intelligence Benchmark Summary AtScale s Business Intelligence Benchmark resulted in some valuable insights for enterprises looking to deploy a modern Business Intelligence solution using Hadoop. At a high level, this study revealed the following about the current state of the SQL-on- Hadoop engines: Different engines perform well for different types of queries: For large data sets Hive, Impala, and Spark SQL were all able to effectively complete a range of queries on over 6 Billion rows of data. The most performant engine for each query benchmark was dependent on the query characteristics (selectivity, JOIN size, number of GROUP BY operations, etc.). SparkSQL and Impala perform well on small data sets: SparkSQL and Impala both consistently deliver fast response times on small data sets (ranging from a few seconds to 100 milliseconds). Impala scales concurrently better than Hive and Spark: Impala performed best for concurrent user query workloads. It showed the least query degradation as concurrent query workload increased. SparkSQL and Impala have strong community support: AtScale noticed significant performance improvements between Spark 1.5 and Spark 1.6. An active open source ecosystem has contributed to the rapid innovation in this space. Cloudera s recent decision to donate Impala to the Apache Foundation is also a great move forward. The open source community accelerates the development and innovation of these SQL-on-Hadoop engines more than any single vendor can deliver. All engines are stable enough to support BI on Hadoop workloads: Although there are differences across the 3 engines tested, we found that Hive, Impala, and SparkSQL are all potential solutions to support BI on Hadoop workloads. A successful BI on Hadoop architecture will likely require more than one SQL-on- Hadoop engine: Each engine has its strengths: Impala offers the best concurrency scaling support, SparkSQL performs the best on large JOIN operations, and Hive remains consistent across diverse query workloads. Enterprises might consider using different engines for different query patterns. A query planning and optimization later, such as AtScale, can automatically route queries to the best query engine for the job without additional work on the client application side. Permission is granted to reproduce content, use or duplicate this material without express and written permission from this site s author and owner. Excerpts and links may be used, provided that full and clear credit is given to AtScale, Inc and with appropriate and specific direction to the original content. 16

AtScale Intelligence Platform

AtScale Intelligence Platform AtScale Intelligence Platform PUT THE POWER OF HADOOP IN THE HANDS OF BUSINESS USERS. Connect your BI tools directly to Hadoop without compromising scale, performance, or control. TURN HADOOP INTO A HIGH-PERFORMANCE

More information

Real-Time Data Analytics and Visualization

Real-Time Data Analytics and Visualization Real-Time Data Analytics and Visualization Making the leap to BI on Hadoop Predictive Analytics & Business Insights 2015 February 9, 2015 David P. Mariani CEO, AtScale, Inc. THE TRUTH ABOUT DATA We think

More information

The Internet of Things and Big Data: Intro

The Internet of Things and Big Data: Intro The Internet of Things and Big Data: Intro John Berns, Solutions Architect, APAC - MapR Technologies April 22 nd, 2014 1 What This Is; What This Is Not It s not specific to IoT It s not about any specific

More information

Integrating Cloudera and SAP HANA

Integrating Cloudera and SAP HANA Integrating Cloudera and SAP HANA Version: 103 Table of Contents Introduction/Executive Summary 4 Overview of Cloudera Enterprise 4 Data Access 5 Apache Hive 5 Data Processing 5 Data Integration 5 Partner

More information

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016 Big Data Approaches Making Sense of Big Data Ian Crosland Jan 2016 Accelerate Big Data ROI Even firms that are investing in Big Data are still struggling to get the most from it. Make Big Data Accessible

More information

Tableau Server 7.0 scalability

Tableau Server 7.0 scalability Tableau Server 7.0 scalability February 2012 p2 Executive summary In January 2012, we performed scalability tests on Tableau Server to help our customers plan for large deployments. We tested three different

More information

Microsoft Analytics Platform System. Solution Brief

Microsoft Analytics Platform System. Solution Brief Microsoft Analytics Platform System Solution Brief Contents 4 Introduction 4 Microsoft Analytics Platform System 5 Enterprise-ready Big Data 7 Next-generation performance at scale 10 Engineered for optimal

More information

Next-Gen Big Data Analytics using the Spark stack

Next-Gen Big Data Analytics using the Spark stack Next-Gen Big Data Analytics using the Spark stack Jason Dai Chief Architect of Big Data Technologies Software and Services Group, Intel Agenda Overview Apache Spark stack Next-gen big data analytics Our

More information

A very short talk about Apache Kylin Business Intelligence meets Big Data. Fabian Wilckens EMEA Solutions Architect

A very short talk about Apache Kylin Business Intelligence meets Big Data. Fabian Wilckens EMEA Solutions Architect A very short talk about Apache Kylin Business Intelligence meets Big Data Fabian Wilckens EMEA Solutions Architect 1 The challenge today 2 Very quickly: OLAP Online Analytical Processing How many beers

More information

Apache Kylin Introduction Dec 8, 2014 @ApacheKylin

Apache Kylin Introduction Dec 8, 2014 @ApacheKylin Apache Kylin Introduction Dec 8, 2014 @ApacheKylin Luke Han Sr. Product Manager lukhan@ebay.com @lukehq Yang Li Architect & Tech Leader yangli9@ebay.com Agenda What s Apache Kylin? Tech Highlights Performance

More information

An Open Source Memory-Centric Distributed Storage System

An Open Source Memory-Centric Distributed Storage System An Open Source Memory-Centric Distributed Storage System Haoyuan Li, Tachyon Nexus haoyuan@tachyonnexus.com September 30, 2015 @ Strata and Hadoop World NYC 2015 Outline Open Source Introduction to Tachyon

More information

Reference Architecture, Requirements, Gaps, Roles

Reference Architecture, Requirements, Gaps, Roles Reference Architecture, Requirements, Gaps, Roles The contents of this document are an excerpt from the brainstorming document M0014. The purpose is to show how a detailed Big Data Reference Architecture

More information

The Future of Data Management

The Future of Data Management The Future of Data Management with Hadoop and the Enterprise Data Hub Amr Awadallah (@awadallah) Cofounder and CTO Cloudera Snapshot Founded 2008, by former employees of Employees Today ~ 800 World Class

More information

Actian SQL in Hadoop Buyer s Guide

Actian SQL in Hadoop Buyer s Guide Actian SQL in Hadoop Buyer s Guide Contents Introduction: Big Data and Hadoop... 3 SQL on Hadoop Benefits... 4 Approaches to SQL on Hadoop... 4 The Top 10 SQL in Hadoop Capabilities... 5 SQL in Hadoop

More information

Self-service BI for big data applications using Apache Drill

Self-service BI for big data applications using Apache Drill Self-service BI for big data applications using Apache Drill 2015 MapR Technologies 2015 MapR Technologies 1 Data Is Doubling Every Two Years Unstructured data will account for more than 80% of the data

More information

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

TE's Analytics on Hadoop and SAP HANA Using SAP Vora TE's Analytics on Hadoop and SAP HANA Using SAP Vora Naveen Narra Senior Manager TE Connectivity Santha Kumar Rajendran Enterprise Data Architect TE Balaji Krishna - Director, SAP HANA Product Mgmt. -

More information

Cisco Data Preparation

Cisco Data Preparation Data Sheet Cisco Data Preparation Unleash your business analysts to develop the insights that drive better business outcomes, sooner, from all your data. As self-service business intelligence (BI) and

More information

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics ESSENTIALS EMC ISILON Use the industry's first and only scale-out NAS solution with native Hadoop

More information

MapR: Best Solution for Customer Success

MapR: Best Solution for Customer Success 2015 MapR Technologies 2015 MapR Technologies 1 MapR: Best Solution for Customer Success Best Product High Growth 700+ Customers Premier Investors Apache Open Source 2X 2X Growth In Direct Customers Growth

More information

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future

More information

The Inside Scoop on Hadoop

The Inside Scoop on Hadoop The Inside Scoop on Hadoop Orion Gebremedhin National Solutions Director BI & Big Data, Neudesic LLC. VTSP Microsoft Corp. Orion.Gebremedhin@Neudesic.COM B-orgebr@Microsoft.com @OrionGM The Inside Scoop

More information

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect Big Data & QlikView Democratizing Big Data Analytics David Freriks Principal Solution Architect TDWI Vancouver Agenda What really is Big Data? How do we separate hype from reality? How does that relate

More information

Impact of Big Data growth On Transparent Computing

Impact of Big Data growth On Transparent Computing Impact of Big Data growth On Transparent Computing Michael A. Greene Intel Vice President, Software and Services Group, General Manager, System Technologies and Optimization 1 Transparent Computing (TC)

More information

Dell* In-Memory Appliance for Cloudera* Enterprise

Dell* In-Memory Appliance for Cloudera* Enterprise Built with Intel Dell* In-Memory Appliance for Cloudera* Enterprise Find out what faster big data analytics can do for your business The need for speed in all things related to big data is an enormous

More information

Hadoop vs Apache Spark

Hadoop vs Apache Spark Innovate, Integrate, Transform Hadoop vs Apache Spark www.altencalsoftlabs.com Introduction Any sufficiently advanced technology is indistinguishable from magic. said Arthur C. Clark. Big data technologies

More information

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce Analytics in the Cloud Peter Sirota, GM Elastic MapReduce Data-Driven Decision Making Data is the new raw material for any business on par with capital, people, and labor. What is Big Data? Terabytes of

More information

III Big Data Technologies

III Big Data Technologies III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution

More information

Monitor and Manage Your MicroStrategy BI Environment Using Enterprise Manager and Health Center

Monitor and Manage Your MicroStrategy BI Environment Using Enterprise Manager and Health Center Monitor and Manage Your MicroStrategy BI Environment Using Enterprise Manager and Health Center Presented by: Dennis Liao Sales Engineer Zach Rea Sales Engineer January 27 th, 2015 Session 4 This Session

More information

PAXATA DATA PREPARATION PERFORMANCE BENCHMARKING SPRING 15 RELEASE

PAXATA DATA PREPARATION PERFORMANCE BENCHMARKING SPRING 15 RELEASE PAXATA DATA PREPARATION PERFORMANCE BENCHMARKING SPRING 15 RELEASE February 2015 Page 1 Table of Contents Introduction... 3 Paxata Technology Stack... 3 The user interface layer... 4 Data preparation application

More information

A Reference Architecture for Next Generation Big Data and Analytics

A Reference Architecture for Next Generation Big Data and Analytics A Reference Architecture for Next Generation Big Data and Analytics A Reference Architecture for Next Generation Big Data and Analytics 2 CONTENTS Executive Summary 3 Introduction 4 Current State of Hadoop

More information

Self-service BI for big data applications using Apache Drill

Self-service BI for big data applications using Apache Drill Self-service BI for big data applications using Apache Drill 2015 MapR Technologies 2015 MapR Technologies 1 Management - MCS MapR Data Platform for Hadoop and NoSQL APACHE HADOOP AND OSS ECOSYSTEM Batch

More information

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP Big-Data and Hadoop Developer Training with Oracle WDP What is this course about? Big Data is a collection of large and complex data sets that cannot be processed using regular database management tools

More information

THE DEVELOPER GUIDE TO BUILDING STREAMING DATA APPLICATIONS

THE DEVELOPER GUIDE TO BUILDING STREAMING DATA APPLICATIONS THE DEVELOPER GUIDE TO BUILDING STREAMING DATA APPLICATIONS WHITE PAPER Successfully writing Fast Data applications to manage data generated from mobile, smart devices and social interactions, and the

More information

Driving Peak Performance. 2013 IBM Corporation

Driving Peak Performance. 2013 IBM Corporation Driving Peak Performance 1 Session 2: Driving Peak Performance Abstract We know you want the fastest performance possible for your deployments, and yet that relies on many choices across data storage,

More information

Hadoop-BAM and SeqPig

Hadoop-BAM and SeqPig Hadoop-BAM and SeqPig Keijo Heljanko 1, André Schumacher 1,2, Ridvan Döngelci 1, Luca Pireddu 3, Matti Niemenmaa 1, Aleksi Kallio 4, Eija Korpelainen 4, and Gianluigi Zanetti 3 1 Department of Computer

More information

Use of Hadoop File System for Nuclear Physics Analyses in STAR

Use of Hadoop File System for Nuclear Physics Analyses in STAR 1 Use of Hadoop File System for Nuclear Physics Analyses in STAR EVAN SANGALINE UC DAVIS Motivations 2 Data storage a key component of analysis requirements Transmission and storage across diverse resources

More information

Quantcast Petabyte Storage at Half Price with QFS!

Quantcast Petabyte Storage at Half Price with QFS! 9-131 Quantcast Petabyte Storage at Half Price with QFS Presented by Silvius Rus, Director, Big Data Platforms September 2013 Quantcast File System (QFS) A high performance alternative to the Hadoop Distributed

More information

QlikView, Creating Business Discovery Application using HDP V1.0 March 13, 2014

QlikView, Creating Business Discovery Application using HDP V1.0 March 13, 2014 QlikView, Creating Business Discovery Application using HDP V1.0 March 13, 2014 Introduction Summary Welcome to the QlikView (Business Discovery Tools) tutorials developed by Qlik. The tutorials will is

More information

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments Important Notice 2010-2015 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, Impala, and

More information

Does Your SQL-on-Hadoop Solution Make The Grade?

Does Your SQL-on-Hadoop Solution Make The Grade? Does Your SQL-on-Hadoop Solution Make The Grade? Finding the right SQL Engine for your Hadoop initiatives is not an easy task. With a plethora of options and similar capabilities at the outset, you need

More information

Platfora Big Data Analytics

Platfora Big Data Analytics Platfora Big Data Analytics ISV Partner Solution Case Study and Cisco Unified Computing System Platfora, the leading enterprise big data analytics platform built natively on Hadoop and Spark, delivers

More information

Understanding the Value of In-Memory in the IT Landscape

Understanding the Value of In-Memory in the IT Landscape February 2012 Understing the Value of In-Memory in Sponsored by QlikView Contents The Many Faces of In-Memory 1 The Meaning of In-Memory 2 The Data Analysis Value Chain Your Goals 3 Mapping Vendors to

More information

Why Big Data in the Cloud?

Why Big Data in the Cloud? Have 40 Why Big Data in the Cloud? Colin White, BI Research January 2014 Sponsored by Treasure Data TABLE OF CONTENTS Introduction The Importance of Big Data The Role of Cloud Computing Using Big Data

More information

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce

More information

Hadoop and Data Warehouse Friends, Enemies or Profiteers? What about Real Time?

Hadoop and Data Warehouse Friends, Enemies or Profiteers? What about Real Time? Hadoop and Data Warehouse Friends, Enemies or Profiteers? What about Real Time? Kai Wähner kwaehner@tibco.com @KaiWaehner www.kai-waehner.de Disclaimer! These opinions are my own and do not necessarily

More information

Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA. by Christian Tzolov @christzolov

Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA. by Christian Tzolov @christzolov Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA by Christian Tzolov @christzolov Whoami Christian Tzolov Technical Architect at Pivotal, BigData, Hadoop, SpringXD,

More information

Virtualizing Apache Hadoop. June, 2012

Virtualizing Apache Hadoop. June, 2012 June, 2012 Table of Contents EXECUTIVE SUMMARY... 3 INTRODUCTION... 3 VIRTUALIZING APACHE HADOOP... 4 INTRODUCTION TO VSPHERE TM... 4 USE CASES AND ADVANTAGES OF VIRTUALIZING HADOOP... 4 MYTHS ABOUT RUNNING

More information

HDP Enabling the Modern Data Architecture

HDP Enabling the Modern Data Architecture HDP Enabling the Modern Data Architecture Herb Cunitz President, Hortonworks Page 1 Hortonworks enables adoption of Apache Hadoop through HDP (Hortonworks Data Platform) Founded in 2011 Original 24 architects,

More information

Ani Jain Senior Product Marketing Manager

Ani Jain Senior Product Marketing Manager System Monitoring with Operations Manager & Enterprise Manager: How to leverage robust operational statistics to maximize the success of your MicroStrategy implementation Ani Jain Senior Product Marketing

More information

Ubuntu and Hadoop: the perfect match

Ubuntu and Hadoop: the perfect match WHITE PAPER Ubuntu and Hadoop: the perfect match February 2012 Copyright Canonical 2012 www.canonical.com Executive introduction In many fields of IT, there are always stand-out technologies. This is definitely

More information

ORACLE BUSINESS INTELLIGENCE, ORACLE DATABASE, AND EXADATA INTEGRATION

ORACLE BUSINESS INTELLIGENCE, ORACLE DATABASE, AND EXADATA INTEGRATION ORACLE BUSINESS INTELLIGENCE, ORACLE DATABASE, AND EXADATA INTEGRATION EXECUTIVE SUMMARY Oracle business intelligence solutions are complete, open, and integrated. Key components of Oracle business intelligence

More information

Testing 3Vs (Volume, Variety and Velocity) of Big Data

Testing 3Vs (Volume, Variety and Velocity) of Big Data Testing 3Vs (Volume, Variety and Velocity) of Big Data 1 A lot happens in the Digital World in 60 seconds 2 What is Big Data Big Data refers to data sets whose size is beyond the ability of commonly used

More information

How Companies are! Using Spark

How Companies are! Using Spark How Companies are! Using Spark And where the Edge in Big Data will be Matei Zaharia History Decreasing storage costs have led to an explosion of big data Commodity cluster software, like Hadoop, has made

More information

How to Navigate Big Data with Ad Hoc Visual Data Discovery Data technologies are rapidly changing, but principles of 30 years ago still apply today

How to Navigate Big Data with Ad Hoc Visual Data Discovery Data technologies are rapidly changing, but principles of 30 years ago still apply today How to Navigate Big Data with Ad Hoc Visual Data Discovery Data technologies are rapidly changing, but principles of 30 years ago still apply today INTRODUCTION Data is the heart of TIBCO Spotfire. It

More information

locuz.com Big Data Services

locuz.com Big Data Services locuz.com Big Data Services Big Data At Locuz, we help the enterprise move from being a data-limited to a data-driven one, thereby enabling smarter, faster decisions that result in better business outcome.

More information

Big Data Patterns. Ron Bodkin Founder and President, Think Big

Big Data Patterns. Ron Bodkin Founder and President, Think Big Big Data Patterns Ron Bodkin Founder and President, Think Big 1 About Me Ron Bodkin Founder and President, Think Big I have 9 years experience working with Big Data and Hadoop. In 2010, I founded Think

More information

Dominik Wagenknecht Accenture

Dominik Wagenknecht Accenture Dominik Wagenknecht Accenture Improving Mainframe Performance with Hadoop October 17, 2014 Organizers General Partner Top Media Partner Media Partner Supporters About me Dominik Wagenknecht Accenture Vienna

More information

Oracle Big Data SQL Technical Update

Oracle Big Data SQL Technical Update Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical

More information

Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc.

Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc. Beyond Web Application Log Analysis using Apache TM Hadoop A Whitepaper by Orzota, Inc. 1 Web Applications As more and more software moves to a Software as a Service (SaaS) model, the web application has

More information

Hortonworks and ODP: Realizing the Future of Big Data, Now Manila, May 13, 2015

Hortonworks and ODP: Realizing the Future of Big Data, Now Manila, May 13, 2015 Hortonworks and ODP: Realizing the Future of Big Data, Now Manila, May 13, 2015 We Do Hadoop Fall 2014 Page 1 HDP delivers a comprehensive data management platform GOVERNANCE Hortonworks Data Platform

More information

Maximize MicroStrategy Speed and Throughput with High Performance Tuning

Maximize MicroStrategy Speed and Throughput with High Performance Tuning Maximize MicroStrategy Speed and Throughput with High Performance Tuning Jochen Demuth, Director Partner Engineering Maximize MicroStrategy Speed and Throughput with High Performance Tuning Agenda 1. Introduction

More information

Big Data in Cloud. Round table

Big Data in Cloud. Round table Big Data in Cloud Round table Big Data Management Introduction Traditional DI Environments Start simple. Build as you grow Big Data World! Sentry Kerberos Knox Ranger Map Reduce Spark Stream Exec Engines

More information

HiBench Introduction. Carson Wang (carson.wang@intel.com) Software & Services Group

HiBench Introduction. Carson Wang (carson.wang@intel.com) Software & Services Group HiBench Introduction Carson Wang (carson.wang@intel.com) Agenda Background Workloads Configurations Benchmark Report Tuning Guide Background WHY Why we need big data benchmarking systems? WHAT What is

More information

Enable your Modern Data Architecture by delivering Enterprise Apache Hadoop

Enable your Modern Data Architecture by delivering Enterprise Apache Hadoop Modern Data Architecture with Enterprise Apache Hadoop Hortonworks. We do Hadoop. Jeff Markham Technical Director, APAC jmarkham@hortonworks.com Page 1 Our Mission: Enable your Modern Data Architecture

More information

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments Important Notice 2010-2016 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, Impala, and

More information

hmetrix Revolutionizing Healthcare Analytics with Vertica & Tableau

hmetrix Revolutionizing Healthcare Analytics with Vertica & Tableau Powered by Vertica Solution Series in conjunction with: hmetrix Revolutionizing Healthcare Analytics with Vertica & Tableau The cost of healthcare in the US continues to escalate. Consumers, employers,

More information

Non-Stop Hadoop Paul Scott-Murphy VP Field Techincal Service, APJ. Cloudera World Japan November 2014

Non-Stop Hadoop Paul Scott-Murphy VP Field Techincal Service, APJ. Cloudera World Japan November 2014 Non-Stop Hadoop Paul Scott-Murphy VP Field Techincal Service, APJ Cloudera World Japan November 2014 WANdisco Background WANdisco: Wide Area Network Distributed Computing Enterprise ready, high availability

More information

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances INSIGHT Oracle's All- Out Assault on the Big Data Market: Offering Hadoop, R, Cubes, and Scalable IMDB in Familiar Packages Carl W. Olofson IDC OPINION Global Headquarters: 5 Speen Street Framingham, MA

More information

Einsatzfelder von IBM PureData Systems und Ihre Vorteile.

Einsatzfelder von IBM PureData Systems und Ihre Vorteile. Einsatzfelder von IBM PureData Systems und Ihre Vorteile demirkaya@de.ibm.com Agenda Information technology challenges PureSystems and PureData introduction PureData for Transactions PureData for Analytics

More information

Best Practices for Hadoop Data Analysis with Tableau

Best Practices for Hadoop Data Analysis with Tableau Best Practices for Hadoop Data Analysis with Tableau September 2013 2013 Hortonworks Inc. http:// Tableau 6.1.4 introduced the ability to visualize large, complex data stored in Apache Hadoop with Hortonworks

More information

Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia

Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia Unified Big Data Processing with Apache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing

More information

Enterprise Performance Tuning: Best Practices with SQL Server 2008 Analysis Services. By Ajay Goyal Consultant Scalability Experts, Inc.

Enterprise Performance Tuning: Best Practices with SQL Server 2008 Analysis Services. By Ajay Goyal Consultant Scalability Experts, Inc. Enterprise Performance Tuning: Best Practices with SQL Server 2008 Analysis Services By Ajay Goyal Consultant Scalability Experts, Inc. June 2009 Recommendations presented in this document should be thoroughly

More information

Running Analytics on SAP HANA and BW with MicroStrategy

Running Analytics on SAP HANA and BW with MicroStrategy Running Analytics on SAP HANA and BW with MicroStrategy Presented by: Trishla Maru Agenda Overview Relationship and Certification with SAP Integration to SAP BW Overview with SAP BW Import process and

More information

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica

More information

Why DBMSs Matter More than Ever in the Big Data Era

Why DBMSs Matter More than Ever in the Big Data Era E-PAPER FEBRUARY 2014 Why DBMSs Matter More than Ever in the Big Data Era Having the right database infrastructure can make or break big data analytics projects. TW_1401138 Big data has become big news

More information

Introducing Oracle Exalytics In-Memory Machine

Introducing Oracle Exalytics In-Memory Machine Introducing Oracle Exalytics In-Memory Machine Jon Ainsworth Director of Business Development Oracle EMEA Business Analytics 1 Copyright 2011, Oracle and/or its affiliates. All rights Agenda Topics Oracle

More information

Analytics With Hadoop. SAS and Cloudera Starter Services: Visual Analytics and Visual Statistics

Analytics With Hadoop. SAS and Cloudera Starter Services: Visual Analytics and Visual Statistics Analytics With Hadoop SAS and Cloudera Starter Services: Visual Analytics and Visual Statistics Everything You Need to Get Started on Your First Hadoop Project SAS and Cloudera have identified the essential

More information

PLATFORA INTERACTIVE, IN-MEMORY BUSINESS INTELLIGENCE FOR HADOOP

PLATFORA INTERACTIVE, IN-MEMORY BUSINESS INTELLIGENCE FOR HADOOP PLATFORA INTERACTIVE, IN-MEMORY BUSINESS INTELLIGENCE FOR HADOOP Your business is swimming in data, and your business analysts want to use it to answer the questions of today and tomorrow. YOU LOOK TO

More information

Scalability and Performance Report - Analyzer 2007

Scalability and Performance Report - Analyzer 2007 - Analyzer 2007 Executive Summary Strategy Companion s Analyzer 2007 is enterprise Business Intelligence (BI) software that is designed and engineered to scale to the requirements of large global deployments.

More information

An Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

An Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database An Oracle White Paper June 2012 High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database Executive Overview... 1 Introduction... 1 Oracle Loader for Hadoop... 2 Oracle Direct

More information

Integrate and Deliver Trusted Data and Enable Deep Insights

Integrate and Deliver Trusted Data and Enable Deep Insights SAP Technical Brief SAP s for Enterprise Information Management SAP Data Services Objectives Integrate and Deliver Trusted Data and Enable Deep Insights Provide a wide-ranging view of enterprise information

More information

Architecture & Experience

Architecture & Experience Architecture & Experience Data Mining - Combination from SAP HANA, R & Hadoop Markus Severin, Solution Principal Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein

More information

Hadoop Market - Global Industry Analysis, Size, Share, Growth, Trends, and Forecast, 2012 2018

Hadoop Market - Global Industry Analysis, Size, Share, Growth, Trends, and Forecast, 2012 2018 Transparency Market Research Hadoop Market - Global Industry Analysis, Size, Share, Growth, Trends, and Forecast, 2012 2018 Buy Now Request Sample Published Date: July 2013 Single User License: US $ 4595

More information

SQLstream Blaze and Apache Storm A BENCHMARK COMPARISON

SQLstream Blaze and Apache Storm A BENCHMARK COMPARISON SQLstream Blaze and Apache Storm A BENCHMARK COMPARISON 2 The V of Big Data Velocity means both how fast data is being produced and how fast the data must be processed to meet demand. Gartner The emergence

More information

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia Monitis Project Proposals for AUA September 2014, Yerevan, Armenia Distributed Log Collecting and Analysing Platform Project Specifications Category: Big Data and NoSQL Software Requirements: Apache Hadoop

More information

Gain Contextual Awareness for a Smarter Digital Enterprise with SAP HANA Vora

Gain Contextual Awareness for a Smarter Digital Enterprise with SAP HANA Vora SAP Brief SAP Technology SAP HANA Vora Objectives Gain Contextual Awareness for a Smarter Digital Enterprise with SAP HANA Vora Bridge the divide between enterprise data and Big Data Bridge the divide

More information

High-Performance Business Analytics: SAS and IBM Netezza Data Warehouse Appliances

High-Performance Business Analytics: SAS and IBM Netezza Data Warehouse Appliances High-Performance Business Analytics: SAS and IBM Netezza Data Warehouse Appliances Highlights IBM Netezza and SAS together provide appliances and analytic software solutions that help organizations improve

More information

IBM Cognos 10: Enhancing query processing performance for IBM Netezza appliances

IBM Cognos 10: Enhancing query processing performance for IBM Netezza appliances IBM Software Business Analytics Cognos Business Intelligence IBM Cognos 10: Enhancing query processing performance for IBM Netezza appliances 2 IBM Cognos 10: Enhancing query processing performance for

More information

Big Data Visualization. Apache Spark and Zeppelin

Big Data Visualization. Apache Spark and Zeppelin Big Data Visualization using Apache Spark and Zeppelin Prajod Vettiyattil, Software Architect, Wipro Agenda Big Data and Ecosystem tools Apache Spark Apache Zeppelin Data Visualization Combining Spark

More information

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics In Organizations Mark Vervuurt Cluster Data Science & Analytics AGENDA 1. Yellow Elephant 2. Data Ingestion & Complex Event Processing 3. SQL on Hadoop 4. NoSQL 5. InMemory 6. Data Science & Machine Learning

More information

Creating Connection with Hive

Creating Connection with Hive Creating Connection with Hive Intellicus Enterprise Reporting and BI Platform Intellicus Technologies info@intellicus.com www.intellicus.com Creating Connection with Hive Copyright 2010 Intellicus Technologies

More information

Qlik Sense scalability

Qlik Sense scalability Qlik Sense scalability Visual analytics platform Qlik Sense is a visual analytics platform powered by an associative, in-memory data indexing engine. Based on users selections, calculations are computed

More information

Integrating Apache Spark with an Enterprise Data Warehouse

Integrating Apache Spark with an Enterprise Data Warehouse Integrating Apache Spark with an Enterprise Warehouse Dr. Michael Wurst, IBM Corporation Architect Spark/R/Python base Integration, In-base Analytics Dr. Toni Bollinger, IBM Corporation Senior Software

More information

An Oracle White Paper July 2011. Oracle Primavera Contract Management, Business Intelligence Publisher Edition-Sizing Guide

An Oracle White Paper July 2011. Oracle Primavera Contract Management, Business Intelligence Publisher Edition-Sizing Guide Oracle Primavera Contract Management, Business Intelligence Publisher Edition-Sizing Guide An Oracle White Paper July 2011 1 Disclaimer The following is intended to outline our general product direction.

More information

Accelerating Enterprise Big Data Success. Tim Stevens, VP of Business and Corporate Development Cloudera

Accelerating Enterprise Big Data Success. Tim Stevens, VP of Business and Corporate Development Cloudera Accelerating Enterprise Big Data Success Tim Stevens, VP of Business and Corporate Development Cloudera 1 Big Opportunity: Extract value from data Revenue Growth x = 50 Billion 35 ZB Cost Savings Margin

More information

How Transactional Analytics is Changing the Future of Business A look at the options, use cases, and anti-patterns

How Transactional Analytics is Changing the Future of Business A look at the options, use cases, and anti-patterns How Transactional Analytics is Changing the Future of Business A look at the options, use cases, and anti-patterns Table of Contents Abstract... 3 Introduction... 3 Definition... 3 The Expanding Digitization

More information

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet Ema Iancuta iorhian@gmail.com Radu Chilom radu.chilom@gmail.com Buzzwords Berlin - 2015 Big data analytics / machine

More information

Interactive data analytics drive insights

Interactive data analytics drive insights Big data Interactive data analytics drive insights Daniel Davis/Invodo/S&P. Screen images courtesy of Landmark Software and Services By Armando Acosta and Joey Jablonski The Apache Hadoop Big data has

More information