Architectures for Big Data Analytics A database perspective

Similar documents
Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

How To Handle Big Data With A Data Scientist

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

In-Memory Analytics for Big Data

Navigating the Big Data infrastructure layer Helena Schwenk

BIG DATA What it is and how to use?

SAP HANA SAP s In-Memory Database. Dr. Martin Kittel, SAP HANA Development January 16, 2013

CSE-E5430 Scalable Cloud Computing Lecture 2

Hadoop Ecosystem B Y R A H I M A.

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

How To Scale Out Of A Nosql Database

What Does Big Data Mean and Who Will Win? Michael Stonebraker

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Hadoop. Sunday, November 25, 12

Large scale processing using Hadoop. Ján Vaňo

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Splice Machine: SQL-on-Hadoop Evaluation Guide

NoSQL for SQL Professionals William McKnight

BIG DATA TRENDS AND TECHNOLOGIES

Hadoop and Map-Reduce. Swati Gore

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Big Data and Big Analytics

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

SAP HANA PLATFORM Top Ten Questions for Choosing In-Memory Databases. Start Here

CIO Guide How to Use Hadoop with Your SAP Software Landscape

Cloud Computing at Google. Architecture

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

A Next-Generation Analytics Ecosystem for Big Data. Colin White, BI Research September 2012 Sponsored by ParAccel

Big Data With Hadoop

Advanced Big Data Analytics with R and Hadoop

A Comparison of Approaches to Large-Scale Data Analysis

Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang

Open source Google-style large scale data analysis with Hadoop

NoSQL and Hadoop Technologies On Oracle Cloud

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Big Data and Apache Hadoop s MapReduce

INDUS / AXIOMINE. Adopting Hadoop In the Enterprise Typical Enterprise Use Cases

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

I/O Considerations in Big Data Analytics

Big Data and Data Science: Behind the Buzz Words

Big Data Challenges in Bioinformatics

Big Fast Data Hadoop acceleration with Flash. June 2013

In-memory computing with SAP HANA

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

A Brief Outline on Bigdata Hadoop

Parallel Computing. Benson Muite. benson.

Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

Big Data Analysis and HADOOP

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Evaluating NoSQL for Enterprise Applications. Dirk Bartels VP Strategy & Marketing

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Oracle Database In-Memory The Next Big Thing

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

How Companies are! Using Spark

Cost-Effective Business Intelligence with Red Hat and Open Source

Reference Architecture, Requirements, Gaps, Roles

Data Mining in the Swamp

Bringing Big Data Modelling into the Hands of Domain Experts

RevoScaleR Speed and Scalability

2009 Oracle Corporation 1

A Brief Introduction to Apache Tez

Open source large scale distributed data management with Google s MapReduce and Bigtable

Big Data Introduction

Scaling Your Data to the Cloud

Integrating Big Data into the Computing Curricula

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Integrated Big Data: Hadoop + DBMS + Discovery for SAS High Performance Analytics

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

Hadoop IST 734 SS CHUNG

Accelerating and Simplifying Apache

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

HPC ABDS: The Case for an Integrating Apache Big Data Stack

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

BIG DATA & ANALYTICS. Transforming the business and driving revenue through big data and analytics

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

SQL Server 2005 Features Comparison

Big Data Can Drive the Business and IT to Evolve and Adapt

Constructing a Data Lake: Hadoop and Oracle Database United!

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Big Data: A Storage Systems Perspective Muthukumar Murugan Ph.D. HP Storage Division

Hadoop-BAM and SeqPig

Hadoop implementation of MapReduce computational model. Ján Vaňo

Integrating Apache Spark with an Enterprise Data Warehouse

Big-data Analytics: Challenges and Opportunities

Can the Elephants Handle the NoSQL Onslaught?

Transcription:

Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013

Outline Big Data Analytics Requirements Spectrum of Analytic Computations Solution Options Summary 2013 SAP AG. All rights reserved. 2

Big Data Analytics The process of examining large amounts of data of a variety of types to uncover hidden patterns and unknown correlations to improve business performance or mitigate risk Ebay Example: detect correlation between increased postal rates and decreased web traffic in time to launch promotion countermeasures Requirements Scalability on terabyte range (1 1000) volumes Deep analytical computation Relational and non-relational data Response-time allowing interactivity Data management (querying, fault tolerance, backup, archiving, security, data quality ) 2013 SAP AG. All rights reserved. 3

Analytic computations, from simple to deep SQL (92) For each employee, give the covariance of salary and bonus over 3 years SQL 99 Analytic extensions For each employee, give the count of employees earning half less than them Packaged algorithms (clustering, linear regression, ) Tell me which customers are likely to be early adopters of my new product Predict and fit to past data next month s sales by business unit and per geography Linear algebra packages (ScaLAPACK) Discover communities among users of products sold to my enterprise customers Analytic programming language (R, MatLab, SAS ) Complete environment for data analysis including thousands of add-in packages All require regularity in the data at odds with Big Variety 2013 SAP AG. All rights reserved. 4

Solution options Analytic PL Relational DBMS Map-Reduce Analytic PL + (Relational DBMS or Map-Reduce) Redesign New DBMS 2013 SAP AG. All rights reserved. 5

Analytic PL: R (and Matlab et. al.) State of the art deep analytics Open source, high-level analytic & stats functions, matrix algebra, graphics Standard file formats for data sharing across PL platforms HDF5 Limitations Load data from file (or ODBC), operate entirely in main memory Little or no parallelism Little or no data management R community working to address these needs R community very active in parallelizing R across multiple CPUs or nodes Revolution offers external memory implementation on proprietary file format 2013 SAP AG. All rights reserved. 6

Evaluating against requirements Analytic PL Scalability Deep analytics Relational & nonrelational data Response time Data Management 2013 SAP AG. All rights reserved. 7

Relational DBMSs The ultimate data management engines Scalability and response time predicated mostly on The architecture for parallel processing (incl. quality of data placement, quality of parallel algebra implementation) Row storage versus column storage Analytics Simple (SQL level) : good support Deep analytics : much harder Linear algebra on large matrices via SQL UDFs is hard to parallelize well But packaged algorithms becoming available as stored procedures Big Variety is an issue Typical architecture is to ETL data from MR systems (e.g., Hadoop) 2013 SAP AG. All rights reserved. 8

Architectures for parallel processing 1 Shared Memory P P P P M Shared Memory The bus is the bottleneck beyond 32 processors in practice used for 4-8 proc. machines 2013 SAP AG. All rights reserved. 9

Architectures for parallel processing 2 Shared Memory Shared Disk P P P P M M M M Shared Memory The bus is the bottleneck beyond 32 processors in practice used for 4-8 proc. machines Shared Disk Slower inter-cpu communication, better fault-tolerance, bottleneck pushed to about 100 proc. 2013 SAP AG. All rights reserved. 10

Architectures for parallel processing 3 Shared Memory Shared Disk Shared Nothing P P P P M M M M Shared Memory The bus is the bottleneck beyond 32 processors in practice used for 4-8 proc. machines Shared Disk Slower inter-cpu communication, better fault-tolerance, bottleneck pushed to about 100 proc. Shared Nothing Preferred parallel architecture for decision support, works on 1000 s of commodity hardware processors. Costlier architecture for transactional workloads 2013 SAP AG. All rights reserved. 11

Architectures for parallel processing 4 Shared Memory Shared Disk Shared Nothing P P P P M M Hybrid (NUMA) Shared Memory The bus is the bottleneck beyond 32 processors in practice used for 4-8 proc. machines Shared Disk Slower inter-cpu communication, better fault-tolerance, bottleneck pushed to about 100 proc. Shared Nothing Preferred parallel architecture for decision support, works on 1000 s of commodity hardware processors. Costlier architecture for transactional workloads 2013 SAP AG. All rights reserved. 12

Row vs. column storage and future trends Row Storage Read / Write fields of a tuple ALL fields is FAST a SINGLE field is SLOW Limited compression Slower aggregations Fit for transactional workloads Column Storage Read / Write fields of a tuple ALL fields is SLOW a SINGLE field is FAST High compression rates (5 to 10x) Fast aggregations Fit for analytic workloads CPU clock speed - GHz (+50% each year) Memory bandwidth GB/s (+50% each year) Memory latency - ns (1% each year) Years 2013 SAP AG. All rights reserved. 13

Evaluating against requirements Analytic PL Relational DBMS Scalability Deep analytics Relational & nonrelational data Response time Data Management 2013 SAP AG. All rights reserved. 14

Map-Reduce Shared nothing cluster computing Thousands of inter-connected commodity (low-cost) computer hardware Distributed File System Architecture for storing redundantly chunks of very large files in the cluster Map-Reduce A parallel computing model built on a DFS (Google, 2004) Programmer writes two functions: map and reduce Implementation coordinates parallel execution of map and reduce tasks Higher level languages layered on top of M-R HIVE (a SQL variant), HBase (a No-SQL column oriented DBMS) Mahout (a machine learning package) Hadoop is an open-source implementation of DFS and M-R 2013 SAP AG. All rights reserved. 15

Map-Reduce example 0699194305.. 0699195005.. 0699105005.. 0699194905.. 0699194805.. Input map Shuffle reduce > Output (0, 0699194305...) (106,0699195005...) (212,0699195005...) (0, 0699194905...) (106,0699194805...) (1950, -11) (1950, 22) (1949, 111) (1950, 0) (1949, 78) (1949, [111, 78]) (1950, [0, 22-11]) (1949, 111) (1950, 22) 2013 SAP AG. All rights reserved. 16

Map-Reduce for Big Data Analytics No data model Typically solved by higher level layers (e.g., HIVE) Hadoop MR slow on simple analytics SQL-level analytics slower than parallel R-DBMSs [Pavlo 2009] MR has no indexing MR can only retain states between intermediate steps by writing to DFS MR can t bind specific processing to nodes in the cluster Selection with full table scan Same on large files, else MR is worse Highly selective query (with index) R-DBMS about 10 times faster Simple aggregation Selection, join and aggregation R-DBMS about 2 times faster R-DBMS about 50 times faster Deep analytics experimentation on MR not yet conclusive, but Apache HAMA (matrix & graph computations) preferred the BSP model 2013 SAP AG. All rights reserved. 17

Evaluating against requirements Scalability Deep analytics Relational & nonrelational data Response time Data Management Analytic PL Relational DBMS Map-Reduce Hadoop 2013 SAP AG. All rights reserved. 18

Analytic PL + (Relational DBMS or Map-Reduce) 1. Enhance a DBMS with deep analytics capabilities MADLib: Open source library of SQL-based algorithms for statistics & machine learning, designed for shared-nothing parallel DBMSs Laudable effort, but portability is an issue, and performance today is not very good. Besides, will the analyst favor SQL over an Analytic PL? 2. Improve cooperation between DBMS and Analytic PL Ricardo: Gets analysts to decompose which parts of their program remain in (e.g.,) R, and which by the DBMS (e.g., JAQL on M-R) Need sophisticated user to learn two systems and wanting to refactor code 3. Embedding Analytic PL scripts in stored procedures SAP HANA: R scripts in stored procedure execution plan triggers processing in R, transfers intermediate tables As DBMS plans execute in parallel, multiple R runtimes run in parallel 2013 SAP AG. All rights reserved. 19

Redesign New DBMSs Example Is SciDB New data model merging relations and arrays New query language (Array SQL) Shared nothing architecture Scalable data management Scalable deep analytics No-overwrite, fault tolerance Open source 2013 SAP AG. All rights reserved. 20

SAP HANA a contender in big data analytics In-memory, parallel multi-core database Column store A predictive analytics library for statistics and data mining Server-side stored procedures An integration with R Allowing to embed R script in stored procedure ETL tool including Hadoop connector into HANA Parallel ETL through MR task generation The most advanced BI toolset in the market Including BOBJ tools and new UX for predictive analysis 2013 SAP AG. All rights reserved. 21

Summary Lots of interest Deep analytics are becoming mainstream Lots of players Industrial, research Lots of challenges Common scalable execution model for linear algebra, statistic algorithms and query languages may prove elusive for a while This area is ripe for disruption! 2013 SAP AG. All rights reserved. 22