MAD Skills: New Analysis Practices for Big Data



Similar documents
Data Warehousing. Jens Teubner, TU Dortmund Winter 2015/16. Jens Teubner Data Warehousing Winter 2015/16 1

SQL Server and MicroStrategy: Functional Overview Including Recommendations for Performance Optimization. MicroStrategy World 2016

OLAP and OLTP. AMIT KUMAR BINDAL Associate Professor M M U MULLANA

Cost-Effective Business Intelligence with Red Hat and Open Source

IST722 Data Warehousing

Practical Considerations for Real-Time Business Intelligence. Donovan Schneider Yahoo! September 11, 2006

Architectures for Big Data Analytics A database perspective

The Role of the Analyst in Business Analytics. Neil Foshay Schwartz School of Business St Francis Xavier U

Data Warehouse: Introduction

Next Generation Data Warehousing Appliances

Big Data and Your Data Warehouse Philip Russom

Data warehousing with PostgreSQL

SAP and Hortonworks Reference Architecture

OLAP & DATA MINING CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

COURSE OUTLINE. Track 1 Advanced Data Modeling, Analysis and Design

Moving From Hadoop to Spark

Big Data Can Drive the Business and IT to Evolve and Adapt

Ten Cornerstones of a Modern Data Warehouse Environment

Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.

BI, Analytics and Big Data A Modern-Day Perspective

CPS 216: Advanced Database Systems (Data-intensive Computing Systems) Shivnath Babu

Teradata Unified Big Data Architecture

Microsoft BI Platform Overview

Agile BI With SQL Server 2012

Information Architecture

Data Warehousing and Data Mining

BUILDING OLAP TOOLS OVER LARGE DATABASES

Establish and maintain Center of Excellence (CoE) around Data Architecture

Traditional BI vs. Business Data Lake A comparison

Safe Harbor Statement

Advanced Big Data Analytics with R and Hadoop

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

EMC/Greenplum Driving the Future of Data Warehousing and Analytics

THE DEVELOPER GUIDE TO BUILDING STREAMING DATA APPLICATIONS

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 15 - Data Warehousing: Cubes

Building an Effective Data Warehouse Architecture James Serra

End to End Microsoft BI with SQL 2008 R2 and SharePoint 2010

BIG DATA AND THE ENTERPRISE DATA WAREHOUSE WORKSHOP

Application of Predictive Analytics for Better Alignment of Business and IT

Report Data Management in the Cloud: Limitations and Opportunities

Beyond Conventional Data Warehousing. Florian Waas Greenplum Inc.

ANALYTICS IN BIG DATA ERA

The Microsoft Business Intelligence 2010 Stack Course 50511A; 5 Days, Instructor-led

Real Life Performance of In-Memory Database Systems for BI

Data Warehousing and OLAP Technology for Knowledge Discovery

The Principles of the Business Data Lake

MDM for the Enterprise: Complementing and extending your Active Data Warehousing strategy. Satish Krishnaswamy VP MDM Solutions - Teradata

Parallel Data Warehouse

Exadata in the Retail Sector

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Is Business Intelligence an Oxymoron?

How, What, and Where of Data Warehouses for MySQL

Data Warehousing and Data Mining in Business Applications

<Insert Picture Here> Oracle Retail Data Model Overview

The BIg Picture. Dinsdag 17 september 2013

Ganzheitliches Datenmanagement

Presented by: Jose Chinchilla, MCITP

Evaluating NoSQL for Enterprise Applications. Dirk Bartels VP Strategy & Marketing

Introduction to Data Warehousing. Ms Swapnil Shrivastava

IBM Netezza High Capacity Appliance

BIG DATA APPLIANCES. July 23, TDWI. R Sathyanarayana. Enterprise Information Management & Analytics Practice EMC Consulting

Introducing Oracle Exalytics In-Memory Machine

Tiber Solutions. Understanding the Current & Future Landscape of BI and Data Storage. Jim Hadley

QlikView Business Discovery Platform. Algol Consulting Srl

UNIFY YOUR (BIG) DATA

Agile Business Intelligence Data Lake Architecture

Best Practices in Creating a Successful Business Intelligence Program

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

Open Source Business Intelligence Intro

GeoKettle: A powerful open source spatial ETL tool

In-Memory Analytics: A comparison between Oracle TimesTen and Oracle Essbase

Oracle Big Data SQL Technical Update

Oracle Database 11g for Data Warehousing

Whitepaper. Data Warehouse/BI Testing Offering YOUR SUCCESS IS OUR FOCUS. Published on: January 2009 Author: BIBA PRACTICE

Advanced In-Database Analytics

Enterprise Data Warehouse (EDW) UC Berkeley Peter Cava Manager Data Warehouse Services October 5, 2006

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

Challenges for Data Driven Systems

A DATA WAREHOUSE SOLUTION FOR E-GOVERNMENT

Bussiness Intelligence and Data Warehouse. Tomas Bartos CIS 764, Kansas State University

BUILDING BLOCKS OF DATAWAREHOUSE. G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT

Week 3 lecture slides

Sybase IQ Supercharges Predictive Analytics

Data Warehousing and Decision Support. Torben Bach Pedersen Department of Computer Science Aalborg University

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

LEARNING SOLUTIONS website milner.com/learning phone

Applied Business Intelligence. Iakovos Motakis, Ph.D. Director, DW & Decision Support Systems Intrasoft SA

Luncheon Webinar Series May 13, 2013

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

OLAP Theory-English version

SAP Real-time Data Platform. April 2013

Transcription:

MAD Skills: New Analysis Practices for Big Data Jeffrey Cohen, Brian Dolan, Mark Dunlap Joseph M. Hellerstein, and Caleb Welton VLDB 2009 Presented by: Kristian Torp

Overview Enterprise Data Warehouse (EDW) vs. MAD Why MAD now MAD Database Design Overview Stack of Statistical Functions MAD DBMS Conclusion: Comparison EDW vs. MAD Critique Database Specialization Course 2010 2

Data Warehouse Architecture Existing databases and systems (OLTP) Appl. DB New databases and systems (OLAP) DM OLAP Appl. DB DM Data mining Appl. DB Trans. DW Appl. DB Global Data Warehouse DM Visualization Appl. DB Data Marts Thanks to TBP for the figure CaIn ikraft møde 2009-05-19 3

MAD Architecture db1 db2 db3 integrator Analysis me File 1 Model less, Integrate More Database Specialization Course 2010 4

MAD Acronym Magnetic sucks data in (not always carefully cleaned) Multiple formats Agile Mock-up based Rapid evolution Shoot-and-forget Deep Advanced statistical methods Database Specialization Course 2010 5

Why MAD now? Storage is cheap Terabytes for a few hundred bucks Cannot be found in the budget Many new data sources Click-streams, emails, discussion forums, etc Many understand the value of data analysis Previously mostly for top-level management Copy-out-and-use scenario Not as efficient as putting query to data Typically fit into main memory Security (Excel hell) Database Specialization Course 2010 6

BI Query 1. What is the sale of milk in Aalborg vs. Copenhagen compared to last year? 2. What is the average drive time on Boulevarden, weekdays between 7.00-7.15 in the north direction on non-rain days, in the summer half-year? Fairly simple statistics 1. How many female WWF enthusiasts under the age of 30 visited the Toyota community over the last four days and saw a medium rectangle? 2. How are the people similar to those that visited Nissan? Multi-dimensional statistical analysis

MAD Database Design Agility to the developer Note necessary fully integrated (against EDW idea) Analysis are early warning system Dirty data New interesting data (and non-interesting data) Have a deeper understanding than business EDW users New insight Analyst New data Developer Database Specialization Course 2010 8

MAD Database Design, cont. Staging schema layer Data: Raw data Users: Engineers and some analysts Production data warehouse layer Data: Aggregated, semi-cleaned, intergraded data Users: Analysts and sophisticated users Reporting schema layer Data: Aggregated, cleaned, integrated data Users: Reporting tools and casual users Sandbox layer Data: What ever (avoid Excel copies) Users: Analysts Not a strictly-layered architecture Cross layer joins possible for some users Database Specialization Course 2010 9

Statistics General approach: mathematical concepts in SQL Via extensible DBMS technology Vector arithmetic and higher levels Not supported in relational DBMSs Implemented as stored procedures/new operators New Existing Probability density functions Linear Algebra Vector Arithmetic SQL Functions Level of Abstraction Database Specialization Course 2010 10

MAD DBMS Getting data in and out (Loading/unloading) ETL Bulk load a necessity (core and basic functionality) External tables Under OS control and not DBMS control Simple wrapper of for example CSV file Problem is query optimization Parallel access to all data Must be fast (called ELT instead) Fast prototyping with LIMIT clause Storage and Partitioning Partitioning for speed up (standard technque) Storage hierarchies Often used data on SSD disk drives/ram drives Less-used data on SATA disks Database Specialization Course 2010 11

MAD DBMS, cont Storage engines Heap Append-only Column-store External tables Programming model Short iterations (agile) Prototyping with small data sets Many different programming languages SQL, Java, Matlab, Perl, Python, R Runs in the DBMS (in stored procedures) Map-Reduce Database Specialization Course 2010 12

Conclusion: EDW vs. MAD EDW One repository Waterfall (slow) Fixed Owner: Company Disciplined data integration SQL Basic agg. Functions Expensive hardware Top-down (management) Click-click-click (Excel) Expensive ETL Primary goal MAD One repository Agile (fast) Evolving Owner: Department/person Ad-hoc data integration SQL or MapReduce Advanced agg. Functions Whatever you can find Grass roots R, SAS, Python, Java, matlab Human dirty data Secondary goal Database Specialization Course 2010 13

Good Nice case-study Okay Greenplum feature discussion (sec. 6.1, 6.2 and 6.3) Not a big commercial for their system Useful in practice Good explanation of how used at Fox network Nice to see Perl, Python, R used with PostgreSQL Pushes the extensibility of a relational DBMS to the limit Nice support for map-reduce and SQL in same software stack Pick the best tool for the job (what you have used the most) Database Specialization Course 2010 14

Could be improve MPI, SVM acronym not introduced Slang: feeding frenzies, vanilla SQL, MAD Better comparison of EDW vs. MAD Section 5: Data Parallel statistics quite hard to follow in several cases All their figure are nice Missing some kind of conclusion Better description on how agile in Fox case study No performance graphs showing that the parallel functions scale This is an unproven claim in the paper Database Specialization Course 2010 15