What Does Big Data Mean and Who Will Win? Michael Stonebraker



Similar documents
What Does Big Data Mean and Who Will Win? Michael Stonebraker

Big Data Means at Least Three Different Things. Michael Stonebraker

Tackling The Challenges of Big Data Big Data Storage. Tackling The Challenges of Big Data Big Data Storage. History Lesson. Michael Stonebraker

Architectures for Big Data Analytics A database perspective

OldSQL vs. NoSQL vs. NewSQL on New OLTP

One-Size-Fits-All: A DBMS Idea Whose Time has Come and Gone. Michael Stonebraker December, 2008

Innovative technology for big data analytics

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Advanced In-Database Analytics

Stay Tuned for Today s Session! NAVIGATING THE DATABASE UNIVERSE"

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

Big Data and Data Science: Behind the Buzz Words

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

ANALYTICS IN BIG DATA ERA

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Actian Vector in Hadoop

Big Data Technologies Compared June 2014

How to Build a High-Performance Data Warehouse By David J. DeWitt, Ph.D.; Samuel Madden, Ph.D.; and Michael Stonebraker, Ph.D.

BIG DATA APPLIANCES. July 23, TDWI. R Sathyanarayana. Enterprise Information Management & Analytics Practice EMC Consulting

How To Scale Out Of A Nosql Database

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

The Internet of Things and Big Data: Intro

In-Memory Analytics for Big Data

Big Data and Big Analytics

Understanding the Value of In-Memory in the IT Landscape

Next presentation starting soon Business Analytics using Big Data to gain competitive advantage

Big Data Defined Introducing DataStack 3.0

NoSQL for SQL Professionals William McKnight

Oracle Big Data Strategy Simplified Infrastrcuture

INTRODUCTION TO CASSANDRA

Testing Big data is one of the biggest

Big Data and Transactional Databases Exploding Data Volume is Creating New Stresses on Traditional Transactional Databases

Big Data and Its Impact on the Data Warehousing Architecture

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Scaling Your Data to the Cloud

Oracle Big Data SQL Technical Update

Cray: Enabling Real-Time Discovery in Big Data

The? Data: Introduction and Future

Supercomputing and Big Data: Where are the Real Boundaries and Opportunities for Synergy?

Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.

How To Use Big Data For Telco (For A Telco)

Big Data Explained. An introduction to Big Data Science.

Choosing The Right Big Data Tools For The Job A Polyglot Approach

Integrating Big Data into the Computing Curricula

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Modernizing Your Data Warehouse for Hadoop

TECHNICAL OVERVIEW HIGH PERFORMANCE, SCALE-OUT RDBMS FOR FAST DATA APPS RE- QUIRING REAL-TIME ANALYTICS WITH TRANSACTIONS.

EMC/Greenplum Driving the Future of Data Warehousing and Analytics

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Trafodion Operational SQL-on-Hadoop

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Big Data Can Drive the Business and IT to Evolve and Adapt

Getting Started Practical Input For Your Roadmap

An Approach to Implement Map Reduce with NoSQL Databases

Bringing Big Data into the Enterprise

Splice Machine: SQL-on-Hadoop Evaluation Guide

REAL-TIME BIG DATA ANALYTICS

Big Data: Are You Ready? Kevin Lancaster

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Architecting for the Internet of Things & Big Data

Real Life Performance of In-Memory Database Systems for BI

How To Handle Big Data With A Data Scientist

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

In-Database Analytics

A Comparison of Approaches to Large-Scale Data Analysis

Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard

Data Lake In Action: Real-time, Closed Looped Analytics On Hadoop

Big Data and Big Data Modeling

In-memory computing with SAP HANA

Advanced Big Data Analytics with R and Hadoop

ROME, BIG DATA ANALYTICS

Safe Harbor Statement

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang

Big Data Analytics - Accelerated. stream-horizon.com

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

SAP Real-time Data Platform. April 2013

An Oracle White Paper October Oracle: Big Data for the Enterprise

From Spark to Ignition:

DAMA NY DAMA Day October 17, 2013 IBM 590 Madison Avenue 12th floor New York, NY

I/O Considerations in Big Data Analytics

Lecture Data Warehouse Systems

Integrating Apache Spark with an Enterprise Data Warehouse

WHITE PAPER. Four Key Pillars To A Big Data Management Solution

ANALYTICS CENTER LEARNING PROGRAM

Using Big Data for Smarter Decision Making. Colin White, BI Research July 2011 Sponsored by IBM

Data Mining in the Swamp

An Integrated Analytics & Big Data Infrastructure September 21, 2012 Robert Stackowiak, Vice President Data Systems Architecture Oracle Enterprise

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

Microsoft Analytics Platform System. Solution Brief

Teradata s Big Data Technology Strategy & Roadmap

The Future of Data Management

BIG DATA What it is and how to use?

Evaluating NoSQL for Enterprise Applications. Dirk Bartels VP Strategy & Marketing

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

Transcription:

What Does Big Data Mean and Who Will Win? Michael Stonebraker

The Meaning of Big Data - 3 V s Big Volume Business stuff with simple (SQL) analytics Business stuff with complex (non-sql) analytics Science stuff with complex analytics Big Velocity Drink from the fire hose Big Variety Large number of diverse data sources to integrate 2

Big Volume - Little Analytics Well addressed by data warehouse crowd Who are pretty good at SQL analytics on Hundreds of nodes Petabytes of data 3

Table Stakes Scalability Shared nothing architecture On commodity parts Don t care if packaged as an appliance or not As long as it s not proprietary iron 4

The Participants Row storage and row executor Microsoft Madison, DB2, Netezza, Oracle(!) Column store grafted onto a row executor (wannabees) Terradata/Asterdata, EMC/Greenplum Column store and column executor HP/Vertica, Sybase/IQ, Paraccel Oracle Exadata is not: a column store a scalable shared-nothing architecture 5

Performance Row stores -- x1 Column stores -- x50 Wannabees somewhere in between 6

My Prediction The only successful Big data-small analytics architecture will be a column executor on shared-nothing hardware All the successful vendors will have to get there sooner or later The elephants are in a serious Innovator s Dilemma dilemma P.S. I started Vertica but I have no current relationship with HP/Vertica 7

Big Data - Big Analytics Complex math operations (machine learning, clustering, trend detection,.) The world of the quants and the rocket scientists Mostly specified as linear algebra on array data A dozen or so common inner loops Matrix multiply QR decomposition SVD decomposition Linear regression 8

Big Data - Big Analytics An Example Consider closing price on all trading days for the last 5 years for two stocks A and B What is the covariance between the two timeseries? (1/N) * sum (Ai - mean(a)) * (Bi - mean (B)) 9

Now Make It Interesting Do this for all pairs of 4000 stocks The data is the following 4000 x 1000 matrix Stock t 1 t 2 t 3 t 4 t 5 t 6 t 7. t 1000 S 1 S 2 S 4000 Hourly data? All securities? 10

Array Answer Ignoring the (1/N) and subtracting off the means. Stock * Stock T Try this in SQL with some relational simulation of the stock array!!!! 11

Solution Options R, SAS, Matlab, et al Weak or non-existent data management Do the correlation only for companies with revenue > $1B? File system storage R doesn t scale and is not a parallel system Revolution does a bit better 12

Solution Options RDBMS alone SQL simulator (MadLib) is slooooow And only does some of the required operations Coding operations as UDFs still requires you to simulate arrays on top of tables --- sloooow And current UDF model not powerful enough to support iteration 13

Solution Options R + RDBMS Have to extract and transform the data from RDBMS table to math package data format (e.g. data frames) move the world nightmare Need to learn 2 systems And R still doesn t scale and is not a parallel system 14

Solution Options New Array DBMS designed with this market in mind 15

Array Databases beat Relational Database tables on storage efficiency & array computations 16 cells Math functions run directly on native storage format Dramatic storage efficiencies as # of dimensions & attributes grows 48 cells High performance on both sparse and dense data 16

An Example Array Engine DB Paradigm4/SciDB All-in-one: data management with massively scalable advanced analytics Data is updated; not overwritten Supports reproducibility for research and compliance Time-series data Scenario testing Supports uncertain data, provenance Open source Runs in cloud or private grid of commodity HW 17

Solution Options: Hadoop Simple analytics (Hive queries) 100 times slower than a parallel DBMS Complex analytics (Mahout or roll-your-own) X100 times Scalapack Parallel programming Parallel grep (great) Everything else (awful) Hadoop lacks Stateful computations Point-to-point communication 18

Solution Options: Hadoop Lot and lots and lots of people are piloting Hadoop Many will hit a scalability wall when they get to production Unless they are doing parallel grep My prediction: the bloom will come off the rose 19

Science Data Lots of arrays With complex UDFs Some graphs Model as sparse arrays to get a common environment RDBMS suck on both Under pain of death, do not use the file system!!!! Metadata Provenance Sharing 20

Big Velocity Trading volumes going through the roof Breaking all your infrastructure And it will just get worse 21

Big Velocity Sensor tagging everything of value sends velocity through the roof E.g. car insurance Smart phones as a mobile platform sends velocity through the roof State of multi-player internet games must be recorded sends velocity through the roof 22

Two Different Solutions Big pattern - little state (electronic trading) Find me a strawberry followed within 100 msec by a banana Complex event processing (CEP) is focused on this problem Patterns in a firehose P.S. I started StreamBase but I have no current relationship with the company 23

Two Different Solutions Big state - little pattern For every security, assemble my real-time global position And alert me if my exposure is greater than X Looks like high performance OLTP Want to update a database at very high speed 24

My Suspicion There are 3-4 Big state - little pattern problems for every one Big pattern little state problem 25

Solution Choices for New OLTP Old SQL The elephants No SQL 75 or so vendors giving up both SQL and ACID New SQL Retain SQL and ACID but go fast with a new architecture 26

Why Not Use Old SQL? Sloooow By a couple orders of magnitude Because of Disk Heavy-weight transactions Multi-threading See Through the OLTP Looking Glass VLDB 2007 27

No SQL Give up SQL Interesting to note that Cassandra and Mongo are moving to (yup) SQL Give up ACID If you need ACID, this is a decision to tear your hair out by doing it in user code Can you guarantee you won t need ACID tomorrow? 28

VoltDB: an example of New SQL A main memory SQL engine Open source Shared nothing, Linux, TCP/IP on jelly beans Light-weight transactions Run-to-completion with no locking Single-threaded Multi-core by splitting main memory About 100x RDBMS on TPC-C 29

Big Velocity CEP Next generation OLTP So-called New SQL Very different architecture than the elephants One-size-does-not-fit-all 30

Big Variety Typical enterprise has 5000 operational systems Only a few get into the data warehouse What about the rest? And what about all the rest of your data? Spreadsheets Access data bases Web pages And public data from the web? 31

The World of Data Integration the rest of your data enterprise data warehouse text 32

Summary The rest of your data (public and private) Is a treasure trove of incredibly valuable information Largely untapped 33

Data Tamer Integrate the rest of your data Has to Be scalable to 1000s of sites Deal with incomplete, conflicting, and incorrect data Be incremental Task is never done 34

Data Tamer in a Nutshell Apply machine learning and statistics to perform automatic: Discovery of structure Entity resolution Transformation With a human assist if necessary Crowd sourcing WYSIWYG tool (Wrangler) 35

Data Tamer MIT research project Looking for more integration problems Wanna partner? 36

Take away One size does not fit all Plan on (say) 6 DBMS architectures Use the right tool for the job Elephants are not competitive At anything Have a bad innovator s dilemma problem 37