Discovering Business Insights in Big Data Using SQL-MapReduce



Similar documents
Data Warehouse Optimization

INVESTOR PRESENTATION. First Quarter 2014

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

INVESTOR PRESENTATION. Third Quarter 2014

Implement Hadoop jobs to extract business value from large and varied data sets

UNIFY YOUR (BIG) DATA

Key Data Replication Criteria for Enabling Operational Reporting and Analytics

Data Governance in the Hadoop Data Lake. Michael Lang May 2015

UNLEASHING THE VALUE OF THE TERADATA UNIFIED DATA ARCHITECTURE WITH ALTERYX

Teradata Unified Big Data Architecture

What is Data Virtualization? Rick F. van der Lans, R20/Consultancy

IBM Netezza High Capacity Appliance

Investor Presentation. Second Quarter 2015

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

The New Rules for Integration

What is Data Virtualization?

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

90% of your Big Data problem isn t Big Data.

In-Memory Analytics for Big Data

VIEWPOINT. High Performance Analytics. Industry Context and Trends

Splice Machine: SQL-on-Hadoop Evaluation Guide

Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Information Architecture

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

How to Enhance Traditional BI Architecture to Leverage Big Data

Data Services: The Marriage of Data Integration and Application Integration

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013

ADVANCED ANALYTICS AND FRAUD DETECTION THE RIGHT TECHNOLOGY FOR NOW AND THE FUTURE

Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.

Proact whitepaper on Big Data

Unleash your intuition

Actian SQL in Hadoop Buyer s Guide

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Data Modeling for Big Data

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

A very short talk about Apache Kylin Business Intelligence meets Big Data. Fabian Wilckens EMEA Solutions Architect

SAS and Teradata Partnership

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth

Transparently Offloading Data Warehouse Data to Hadoop using Data Virtualization

How To Handle Big Data With A Data Scientist

Welcome. Host: Eric Kavanagh. The Briefing Room. Twitter Tag: #briefr

Deploying an Operational Data Store Designed for Big Data

Is Actian PSQL a Relational Database Server?

Parallel Data Warehouse

Big Data Integration: A Buyer's Guide

Data Virtualization for Agile Business Intelligence Systems and Virtual MDM. To View This Presentation as a Video Click Here

IBM System x reference architecture solutions for big data

IBM BigInsights for Apache Hadoop

Big Data Analytics Nokia

Using an In-Memory Data Grid for Near Real-Time Data Analysis

BIG DATA: FIVE TACTICS TO MODERNIZE YOUR DATA WAREHOUSE

Reference Architecture, Requirements, Gaps, Roles

An Oracle White Paper October Oracle: Big Data for the Enterprise

What Is In-Memory Computing and What Does It Mean to U.S. Leaders? EXECUTIVE WHITE PAPER

An Oracle White Paper June Integration Technologies for Primavera Solutions

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Native Connectivity to Big Data Sources in MSTR 10

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010

Achieving Business Value through Big Data Analytics Philip Russom

Manifest for Big Data Pig, Hive & Jaql

IBM InfoSphere Guardium Data Activity Monitor for Hadoop-based systems

Using In-Memory Computing to Simplify Big Data Analytics

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

IBM InfoSphere BigInsights Enterprise Edition

Solution White Paper Connect Hadoop to the Enterprise

BIG DATA-AS-A-SERVICE

ORACLE BUSINESS INTELLIGENCE, ORACLE DATABASE, AND EXADATA INTEGRATION

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

How To Learn To Use Big Data

Data Virtualization A Potential Antidote for Big Data Growing Pains

High-Performance Business Analytics: SAS and IBM Netezza Data Warehouse Appliances

A Whole New World. Big Data Technologies Big Discovery Big Insights Endless Possibilities

Integrated Big Data: Hadoop + DBMS + Discovery for SAS High Performance Analytics

Oracle Big Data SQL Technical Update

Big Data Explained. An introduction to Big Data Science.

Oracle Big Data Building A Big Data Management System

Introducing Oracle Exalytics In-Memory Machine

Qlik Sense Enterprise

Trafodion Operational SQL-on-Hadoop

The 3 questions to ask yourself about BIG DATA

Big Data at Cloud Scale

Elastic Application Platform for Market Data Real-Time Analytics. for E-Commerce

Apache Hadoop: The Big Data Refinery

An Oracle White Paper September Oracle Database and the Oracle Database Cloud

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

Transcription:

Discovering Business Insights in Big Data Using SQL-MapReduce A Technical Whitepaper Rick F. van der Lans Independent Business Intelligence Analyst R20/Consultancy July 2013 Sponsored by

Copyright 2013 R20/Consultancy. All rights reserved. Teradata, the Teradata logo, Aster, SQL-MapReduce, Applications-Within, SQL-H are trademarks of registered trademarks of Teradata Corporation and/or its affiliates in the U.S. and worldwide. Trademarks of companies referenced in this document are the sole property of their respective owners.

Table of Contents 1 Summary 1 2 Discovery and the Data Scientist 3 3 The Discovery Process 5 4 Requirements for a Discovery Platform 7 5 5.1 5.2 5.3 5.4 5.5 5.6 6 6.1 6.2 6.3 6.4 6.5 Technologies for Implementing a Discovery Platform The World of Big Data, NoSQL, and Schemas Using SQL for Discovery Functions in SQL Parallelization of SQL Queries and SQL Functions Hadoop and MapReduce in a Nutshell The Marriage of SQL and MapReduce: SQL-MapReduce Implementing a Discovery Platform Solution 1: Classic SQL System Solution 2: Advanced Reporting and Analytical Platform Solution 3: SQL-MapReduce System Solution 4: Hadoop with MapReduce Solution 5: Hadoop with a SQL Interface 7 Comparison of Five Solutions for Implementing a Discovery Platform 35 8 Technical Advantages of SQL-MapReduce 37 About the Author Rick F. van der Lans 39 About Teradata and the Teradata Aster Discovery Platform 39 Appendix A The Built-in Functions of Teradata Aster Database 41 9 10 13 17 19 24 27 30 31 31 32 33 34

Discovering Business Insights in Big Data Using SQL-MapReduce 1 1 Summary This whitepaper describes the advantages of merging the openness and productivity of SQL with the scalability of MapReduce to create a discovery platform that supports today s complex and data-intensive analytical workload generated by data scientists. It focuses on the SQL- MapReduce implementation offered by Teradata through the Teradata Aster Discovery Platform which includes the Aster Database and Aster Discovery Portfolio. The whitepaper also discusses some of the alternative technologies available for developing a discovery platform, such as Hadoop, MapReduce, NoSQL, schema-on-read, and SQL-fication. Business intelligence users are traditionally classified based on the tools they use: users of reporting tools and users of analytical tools. But there is a third group of users, one that uses anything they can find to Discovery is about searching discover new insights that can lead to business benefits. and analyzing data to find They can benefit from reporting and analytical tools, but they new business insights. need more, they need discovery capabilities. Discovery is about searching and analyzing data to find new business insights that can lead to business opportunities. Nowadays, analysts responsible for discovery are called data scientists. Data scientists commonly use all the data and all the tools they can lay their hands on. They use analytics and reporting to study data, but they won t stop there. In other words, discovery is not a fancy new term for analytics. Analytics is just one of the many technologies used by a data scientist to discover new insights. The discovery process commonly followed by data scientists consists of four steps: data acquisition, data preparation, data analysis, and data interpretation. To support data scientists in this discovery process, a reporting tool or analytical tool is not sufficient, they need a featurerich, fast, and flexible discovery platform. Such a platform should minimally support the following features: Data scalability Heterogeneous data access Complex value analysis Data preparation techniques Multiple analysis techniques Multiple analysis tools Interactive analysis High-speed analysis High-level development language Data scientists can select from different solutions for implementing a discovery platform: 1. Classic SQL system 2. Advanced Reporting Platform 3. SQL-MapReduce System 4. Hadoop with MapReduce 5. Hadoop with SQL interface

Discovering Business Insights in Big Data Using SQL-MapReduce 2 This whitepaper describes all five solutions in detail and focuses on the third one, the SQL- MapReduce solution. SQL-MapReduce is a framework based on a combination of SQL, which is the most popular database language, and a programming model created by Google called MapReduce. The goal of MapReduce is to distribute as much processing over as many processors as possible. This whitepaper describes the SQL-MapReduce implementation offered by the Teradata Aster Database (formerly called Aster Data ncluster). Aster Database is part of the Teradata Aster Discovery Platform which also includes the Teradata Aster Discovery Portfolio. On the outside, the Teradata Aster Database looks like any other SQL database server. It supports standard SQL and all the common APIs, such as ODBC and JDBC, so that it can be accessed by all the popular analytical and reporting tools. What s inside makes Aster Database special. The product has been designed specifically for discovery and exploration of big data with the intention to uncover business insights. Its unique Applications-Within architecture runs analytic application logic inside the database, leveraging its massively-parallel architecture and SQL-MapReduce to fully parallelize the processing of complex analytical queries. Besides being a powerful platform for data scientists, the support for standard SQL makes Aster Database also suitable for more traditional query workloads such as reporting and analytics, thus making it a platform for business analysts as well. To summarize, extending a SQL database server with MapReduce creates a discovery platform that combines the expressive query power, openness, and productivity of SQL with the parallelizability and scalability of MapReduce. The combination has the potential to improve the performance of complex analytical queries Teradata s Aster Database is a mature and robust implementation of SQL- MapReduce and has proven itself as a discovery platform. running on large to extremely large datasets. Teradata Aster Database is a mature and robust implementation of SQL-MapReduce and has proven itself as a discovery platform. Note: This whitepaper is a rewrite of an older whitepaper entitled Using SQL-MapReduce for Advanced Analytical Queries 1 and was published in September 2011. Since then, much has changed: the Hadoop stack has grown with several new modules, new SQL interfaces for HDFS have been released, new versions of Aster Database have become available, and the interest for big data has grown drastically. Therefore, it was decided to drastically rewrite this whitepaper. Some pieces of text have been reused, but major sections are new or have been completely revised. 1 R.F. van der Lans, Using SQL-MapReduce for Advanced Analytical Queries, September 2011.

Discovering Business Insights in Big Data Using SQL-MapReduce 3