Big Data at Cloud Scale



Similar documents
The Power of Pentaho and Hadoop in Action. Demonstrating MapReduce Performance at Scale

Build a Streamlined Data Refinery. An enterprise solution for blended data that is governed, analytics-ready, and on-demand

Buying vs. Building Business Analytics. A decision resource for technology and product teams

Embedded Analytics Vendor Selection Guide. A holistic evaluation criteria for your OEM analytics project

The Ultimate Guide to Buying Business Analytics

The Ultimate Guide to Buying Business Analytics

Architected Blended Big Data with Pentaho

Blueprints for Big Data Success

Performance and Scalability Overview

Performance and Scalability Overview

Your Path to. Big Data A Visual Guide

The Future of Data Management

Native Connectivity to Big Data Sources in MSTR 10

Why Big Data in the Cloud?

5 Keys to Unlocking the Big Data Analytics Puzzle. Anurag Tandon Director, Product Marketing March 26, 2014

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

Three Open Blueprints For Big Data Success

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Informatica PowerCenter The Foundation of Enterprise Data Integration

Real Time Big Data Processing

A TECHNICAL WHITE PAPER ATTUNITY VISIBILITY

How to Enhance Traditional BI Architecture to Leverage Big Data

Introducing Oracle Exalytics In-Memory Machine

Big Data Use Cases. To Start Today. Paul Scholey Sales Director, EMEA. 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866)

White. Paper. EMC Isilon: A Scalable Storage Platform for Big Data. April 2014

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

Data Integration Checklist

Vectorwise 3.0 Fast Answers from Hadoop. Technical white paper

Tap into Big Data at the Speed of Business

SELLING PROJECTS ON THE MICROSOFT BUSINESS ANALYTICS PLATFORM

Extend your analytic capabilities with SAP Predictive Analysis

Big Analytics: A Next Generation Roadmap

Understanding Your Customer Journey by Extending Adobe Analytics with Big Data

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

Tap into Hadoop and Other No SQL Sources

HDP Hadoop From concept to deployment.

PLATFORA INTERACTIVE, IN-MEMORY BUSINESS INTELLIGENCE FOR HADOOP

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

IT as a Service Emerges as a New Management Paradigm in the Software-Defined Datacenter Era

QLIKVIEW INTEGRATION TION WITH AMAZON REDSHIFT John Park Partner Engineering

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Traditional BI vs. Business Data Lake A comparison

BIG DATA-AS-A-SERVICE

Next-Generation Cloud Analytics with Amazon Redshift

Business Intelligence for Big Data

Actian SQL in Hadoop Buyer s Guide

An Integrated Big Data & Analytics Infrastructure June 14, 2012 Robert Stackowiak, VP Oracle ESG Data Systems Architecture

Cloud Integration and the Big Data Journey - Common Use-Case Patterns

Accelerate BI Initiatives With Self-Service Data Discovery And Integration

Big Data for the Rest of Us Technical White Paper

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

Building Your Big Data Team

Microsoft Big Data Solutions. Anar Taghiyev P-TSP

Building your Big Data Architecture on Amazon Web Services

Oracle Data Integrator 12c (ODI12c) - Powering Big Data and Real-Time Business Analytics. An Oracle White Paper October 2013

Information Architecture

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

BIG DATA TRENDS AND TECHNOLOGIES

Elevate your analytics with SAS in the cloud

Microsoft Big Data. Solution Brief

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Modern IT Operations Management. Why a New Approach is Required, and How Boundary Delivers

DATAMEER WHITE PAPER. Beyond BI. Big Data Analytic Use Cases

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

I D C M A R K E T S P O T L I G H T. P r i va t e a n d H yb r i d C l o u d s E n a b l e New L e ve l s o f B u s i n e s s and IT Collaboration

Big Data Are You Ready? Jorge Plascencia Solution Architect Manager

Big Data Integration: A Buyer's Guide

Ganzheitliches Datenmanagement

Affordable, Scalable, Reliable OLTP in a Cloud and Big Data World: IBM DB2 purescale

Dell Cloudera Syncsort Data Warehouse Optimization ETL Offload

How To Handle Big Data With A Data Scientist

Oracle Big Data SQL Technical Update

Big Data and Natural Language: Extracting Insight From Text

Azure Data Lake Analytics

Offload Enterprise Data Warehouse (EDW) to Big Data Lake. Ample White Paper

Harnessing the Power of Big Data for Real-Time IT: Sumo Logic Log Management and Analytics Service

Data Refinery with Big Data Aspects

Necto on Azure The Ultimate Cloud Solution for BI

Testing Big data is one of the biggest

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

Interactive data analytics drive insights

SQL Server 2012 Parallel Data Warehouse. Solution Brief

Hadoop Market - Global Industry Analysis, Size, Share, Growth, Trends, and Forecast,

Automated Data Ingestion. Bernhard Disselhoff Enterprise Sales Engineer

Transcription:

Big Data at Cloud Scale Pushing the limits of flexible & powerful analytics Copyright 2015 Pentaho Corporation. Redistribution permitted. All trademarks are the property of their respective owners. For the latest information, please visit our web site at pentaho.com.

Cloud Computing and Big Data Two of the most disruptive technology trends over the last 10 years have been the growth of cloud computing and the emergence of Big Data systems. These developments have changed the way technology organizations operate and deliver value to their stakeholders. At a basic level, cloud computing has allowed enterprises to optimize IT operations by significantly reducing the need to invest in on-premise hardware and software, not to mention the staff required maintain these systems. The cloud affords businesses a new level of flexibility, as they can acquire applications, infrastructure, and computing power in a way that is much more closely matched with the timing and duration of their project needs. Further, by pooling infrastructure across many customers, cloud vendors are able to provide services that are highly elastic and scalable. This means it is much more financially and operationally manageable for enterprises to address unanticipated peaks and troughs in infrastructure needs. Overall, cloud adoption continues to show momentum, as the public IT cloud services market is expected to grow five times faster than the IT industry as a whole. 1 At the same time, Big Data technologies have enabled organizations to generate value from data assets like never before. Historically, data that was high in volume, diverse in structure, and rapidly changing posed difficult challenges for enterprises that were used to working with traditional relational database technology. However, new technical paradigms such as schema-on-read, massively parallel processing, and MapReduce have provided ways to reduce the overhead required to get raw data into a data store and drastically increased the speed and efficiency of processing large amounts of data. They have also made unstructured and semistructured data much more accessible for businesses. These innovations have begun to unleash actionable analysis on a variety of previously challenging data sources, including web logs, documents and text, and machine sensors. Even, dark data (data locked in corporate warehouses with little analytic access) has been given new life through these new technologies. As open source Big Data technologies have matured into commercially supported products, we have seen several platform categories start to gain rapid adoption: > Hadoop distributions: Frameworks for large scale data storage and high performance processing across a distributed file system with the MapReduce paradigm as well as the more recently released MapReduce 2, also known as the YARN data operating system for cluster resource management; ideal for high volume unstructured data > NoSQL stores: Non-relational databases with a very flexible structure; ideal for extremely rapid data ingestion and large numbers of reads based on key values > Analytic databases: Databases designed for high performance analytics, leveraging techniques like compression, column-based storage, and high-speed bulk inserts of structured data; ideal for complex queries and OLAP analysis 1 IDC press release, IDC Forecasts Worldwide Public IT Cloud Services Spending to Reach Nearly $108 Billion by 2017 as Focus Shifts from Savings to Innovation, 9/3/2013 1

Two Worlds Converging Big Data systems help organizations solve hard problems, but they normally require a significant up-front and ongoing IT investment. This includes a potentially large number of server machines as well as employees with skills that may be hard to come by, such as Java MapReduce programming. At the same time, the sheer amount of data in more ambitious projects (at times greater than 1 Petabyte or 1,000,000 Gigabytes) may lead teams to re-think whether keeping everything in-house is the best strategy. Finally, the time element is also important procuring, installing, configuring, and testing the required technology doesn t happen over night. On some level, it makes sense that enterprises would turn to cloud providers who have expertise in managing and maintaining extremely scalable and flexible computing and storage infrastructure. While on-premise data systems are by no means going away, research indicates that cloud platforms are ideal deployment options for elastic and transient workloads built in modern application architectures. This suggests that organizations can effectively push the limits of analytics at scale by tapping into Big Data systems hosted on cloud infrastructure. 2 In fact, a recent survey of enterprise decision makers reported that over a quarter of organizations have already started utilizing public cloud resources for Big Data analytics projects and another quarter plan to do so going forward. 3 While many of these early cloud projects involve high volumes of structured data, there are several key technology components that are already enabling extraction of value from massive, diverse data on cloud infrastructure: > Cloud Analytical Databases: These cloud-based services, such as Amazon RedShift, are elastic data warehouses optimized for analytics with existing BI tools. In addition to leveraging enhancements like massively parallel processing and columnar storage to boost performance, this type of analytical database also includes management and monitoring of the solution by the provider. Users are able to avoid many of the costs related to setting up and managing a traditional data warehouse. > Hosted Hadoop Services: Hadoop clusters can also be hosted in the cloud, which avoids the need for on-premise infrastructure and reduces reliance on in-house Hadoop-specific staffing to support Big Data use cases. Given on-premise start-up costs and cluster hardware expansion over time, it s easy to see where the cloud can provide value. Some Hadoop cloud offerings also include managed services, like job troubleshooting, software installation, testing, and more. > Data Integration and Analytics: While adoption of cloud BI tools has increased, Pentaho is unique in that it provides a cloud-deployable platform that supports end-to-end data integration and business analytics for Big Data stores, including the cloud analytical databases and hosted Hadoop services discussed above. This data can be blended with a variety of other cloud-based data for further insight. The next section discusses a sample solution architecture, illustrating how these different technologies can be leveraged to drive business results in practice. 2 Forrester, The Public Cloud Market Is Now In Hypergrowth, 4/24/2014 3 GigaOm Research, How enterprises will use the cloud for big data analytics, 11/10/2014 2

Sample Solution Architecture In this example, a regulatory organization has implemented a cloud-based data refinery solution in order to facilitate claims analyst access to up to 50 billion daily records of diverse structure. The goal is to enable more granular identification of potential violations and access to individual transactions to support claims. Amazon Web Services elastic computing and storage resources have been leveraged to control the cost of supporting these activities from an IT perspective. Raw record data is first staged in a hosted Hadoop distribution, and a summary of that data is made available to Amazon Redshift via the Hive relational data warehouse layer, using Pentaho Data integration for orchestration. End users have the ability to drill down into detailed underlying data by selecting a set of parameters, such as dates, from a simple Pentaho form interface. Cloud Infrastructure Raw transactions in hosted Hadoop + Hive Data Flow Pentaho Data Integration User Self-Service Request Data is automatically modeled, published, and governed for business users User defines cube parameters and measures via internal form Pentaho Business Analytics Audit and Security Analyst Analysis and Data Mining Business Analyst Data Discovery and Auto- Modeling Upon submission of the selection, Pentaho Data integration triggers in-cluster Hadoop transformations to pull the desired data set and stage it in Redshift. PDI also automatically creates and publishes a multi-dimensional analysis model for this data set, logging the whole process for audit and administrative purposes. From here, the claims investigator uses Pentaho Business Analytics for ad hoc analysis and visualization of the extracted data set in order to better identify regulatory risks. The end-to-end solution facilitates navigation of up to 500 TB of daily transaction data for precise drill-down and business insight. At the same time, infrastructure characteristics, like Hadoop processing power or Redshift storage needed at a given time, are elastic. This of course translates to minimized fixed project costs which would be substantially higher in a case where the entire solution architecture was hosted on premise. 3

Conclusion Early adopters are already illustrating how the cloud can expand on the value proposition of Big Data, delivering elastic and cost-effective solutions for integrating and analyzing data at unprecedented scale. However, taking Big Data to the cloud doesn t eliminate the challenges of blending and orchestrating multiple complex data sources to drive value-added analysis. Only the right end-to-end data integration and analytics platform can translate these visionary architectures into proven solutions. Raw record data is first staged in a hosted Hadoop distribution, and a summary of that data is made available to Amazon Redshift via the Hive relational data warehouse layer, using Pentaho Data integration for orchestration. End users have the ability to drill down into detailed underlying data by selecting a set of parameters, such as dates, from a simple Pentaho form interface. Pentaho helps deliver on the promise of Big Data in the Cloud with the following unique capabilities: > Flexible data transformation and orchestration for cloud-based Big Data stores, including hosted Hadoop distributions and Amazon Redshift > Drag & drop ETL design for Big Data, including MapReduce workflow > Auto-modeling and auto-publishing of analysis models for Amazon Redshift and other analytical databases > The full spectrum of cloud-friendly end-user analytics, including visualization, ad hoc analysis, reporting, and dashboards Case Study - Nasdaq BUSINESS CHALLENGE Nasdaq manages several billion rows of financial information each business day, and needed a modern, cost-effective way to make this information readily useful for several lines of business. SOLUTION Nasdaq leverages Pentaho s end-to-end platform to transform large complex data sets, integrate with Amazon Redshift, and empower end users with automatically generated reports as well as self-service analytics and dashboards to effectively manage several lines of business. VALUE ADDED With Pentaho, NASDAQ OMX created a cloud-based solution that manages huge volumes of data efficiently and cost effectively so the business can derive more useful information. Now a single development team replaces work previously done by a mix of development, system admins, and database administrators. The new solution cost represents over 50% savings relative to previous solution. Our legacy systems were extremely slow, and lacked required data governance for data at scale. With today s big data solution in the cloud, we re not only able to scale beyond previous capabilities, but do it in a much more cost-effective way with flexible deployment options and higher data confidence. Michael Weiss Senior Software Engineer, Nasdaq OMX 4

Global Headquarters Citadel International - Suite 340 5950 Hazeltine National Dr. Orlando, FL 32822, USA tel +1 407 812 6736 fax +1 407 517 4575 US & Worldwide Sales Office 353 Sacramento Street, Suite 1500 San Francisco, CA 94111, USA tel +1 415 525 5540 toll free +1 866 660 7555 United Kingdom, Rest of Europe, Middle East, Africa London, United Kingdom tel +44 (0) 20 3574 4790 toll free (UK) 0 800 680 0693 Learn more about Pentaho Business Analytics pentaho.com/contact +1 (866) 660-7555 FRANCE Offices - Paris, France tel +33 97 51 82 296 toll free (France) 0800 915343 GERMANY, AUSTRIA, SWITZERLAND Offices - Munich, Germany tel +49 (0) 322 2109 4279 toll free (Germany) 0800 186 0332 BELGIUM, NETHERLANDS, LUXEMBOURG Offices - Antwerp, Belgium tel +31 8 58 880 585 toll free (Belgium) 0800 773 83 ITALY, SPAIN, PORTUGAL Offices - Valencia, Spain toll free (Italy) 800-798-217 toll free (Spain) 900-868522 toll free (Portugal) 800-180-060 Copyright 2015 Pentaho Corporation. Redistribution permitted. All trademarks are the property of their respective owners. For the latest information, please visit our web site at pentaho.com. 020915