Big Data and Big Analytics



Similar documents
Architectures for Big Data Analytics A database perspective

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Report Data Management in the Cloud: Limitations and Opportunities

Daniel J. Adabi. Workshop presentation by Lukas Probst

Cloud Computing at Google. Architecture

What Does Big Data Mean and Who Will Win? Michael Stonebraker

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Arrays in database systems, the next frontier?

From Spark to Ignition:

Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

I/O Considerations in Big Data Analytics

Amazon Redshift & Amazon DynamoDB Michael Hanisch, Amazon Web Services Erez Hadas-Sonnenschein, clipkit GmbH Witali Stohler, clipkit GmbH

SciDB: an open-source, array-oriented database management system

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

Big Data With Hadoop

PivotalR: A Package for Machine Learning on Big Data

Spark: Cluster Computing with Working Sets

Ad Hoc Analysis of Big Data Visualization

Lecture 10 - Functional programming: Hadoop and MapReduce

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

How To Handle Big Data With A Data Scientist

Bringing Big Data Modelling into the Hands of Domain Experts

In-Memory Analytics for Big Data

Advanced In-Database Analytics

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Big Data Visualization and Dashboards

Big Data Technologies Compared June 2014

Introduction to Cloud Computing

Open source Google-style large scale data analysis with Hadoop

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

High Performance Spatial Queries and Analytics for Spatial Big Data. Fusheng Wang. Department of Biomedical Informatics Emory University

ANALYTICS CENTER LEARNING PROGRAM

Advanced Big Data Analytics with R and Hadoop

HIVE + AMAZON EMR + S3 = ELASTIC BIG DATA SQL ANALYTICS PROCESSING IN THE CLOUD A REAL WORLD CASE STUDY

Open source large scale distributed data management with Google s MapReduce and Bigtable

Real Time Big Data Processing

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

BIG DATA Alignment of Supply & Demand Nuria de Lama Representative of Atos Research &

Composite Data Virtualization Composite Data Virtualization And NOSQL Data Stores

Domain driven design, NoSQL and multi-model databases

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010

Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges

Manifest for Big Data Pig, Hive & Jaql

R Tools Evaluation. A review by Global BI / Local & Regional Capabilities. Telefónica CCDO May 2015

Introduction to NoSQL Databases. Tore Risch Information Technology Uppsala University

Data Mining in the Swamp

Ganzheitliches Datenmanagement

Cloud Computing and Advanced Relationship Analytics

White Paper. How Streaming Data Analytics Enables Real-Time Decisions

Big Data and Databases

Integrating Big Data into the Computing Curricula

RevoScaleR Speed and Scalability

SEIZE THE DATA SEIZE THE DATA. 2015

Eastern Washington University Department of Computer Science. Questionnaire for Prospective Masters in Computer Science Students

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

Data Management in the Cloud. Zhen Shi

TECHNIQUES FOR OPTIMIZING THE RELATIONSHIP BETWEEN DATA STORAGE SPACE AND DATA RETRIEVAL TIME FOR LARGE DATABASES

Embedded Analytics & Big Data Visualization in Any App

White Paper: Datameer s User-Focused Big Data Solutions

In-Database Analytics

Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata

Teradata Unified Big Data Architecture

Introduction To Hive

Big Data and Data Science: Behind the Buzz Words

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

Databases 2 (VU) ( )

ADVANCED ANALYTICS AND FRAUD DETECTION THE RIGHT TECHNOLOGY FOR NOW AND THE FUTURE

How To Improve Performance In A Database

Week 3 lecture slides

A Next-Generation Analytics Ecosystem for Big Data. Colin White, BI Research September 2012 Sponsored by ParAccel

DBMS / Business Intelligence, SQL Server

NoSQL systems: introduction and data models. Riccardo Torlone Università Roma Tre

Affordable, Scalable, Reliable OLTP in a Cloud and Big Data World: IBM DB2 purescale

The Inside Scoop on Hadoop

Storage Architectures for Big Data in the Cloud

Technical. Overview. ~ a ~ irods version 4.x

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

HPC ABDS: The Case for an Integrating Apache Big Data Stack

Part I Courses Syllabus

Infrastructures for big data

Hadoop IST 734 SS CHUNG

BIG DATA What it is and how to use?

SQL Server 2005 Features Comparison

Making Sense ofnosql A GUIDE FOR MANAGERS AND THE REST OF US DAN MCCREARY MANNING ANN KELLY. Shelter Island

Visualization of Semantic Windows with SciDB Integration

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

Lecture Data Warehouse Systems

Transcription:

Big Data and Big Analytics Introducing SciDB Open source, massively parallel DBMS and analytic platform Array data model (rather than SQL, Unstructured, XML, or triple-store) Extensible micro-kernel architecture Reliable storage platform Analytic Example Problem Description: Big Data, Big Analytics Compare and Contrast SciDB with SQL and Map/Reduce Data, Queries and Scalability Graphs What does SciDB look like from a user s perspective Cloud deployment (EC2) up to 64 nodes

Introducing SciDB Open Source Released under GPL-3 http://www.scidb.org/ Sponsored by Paradigm4 Inc. Contributors from the US, Russia, India, and Europe Array Data Model

Why Arrays? Formal Mathematical Foundations Closed algebra of Array! Array operations Formal proofs of correctness for re-writes Mathematical Model has broad applications Computer Science Pedigree APL A Programming Language Lots of work on algorithms to compute Array! Array ops Lots of work on efficient storage and data manipulation Ongoing Area of Research and Development BLAS, LAPack, ScaLAPack open-source tools NAG Numerical Algorithmics Group proprietary platforms R, SaS, etc - each have an efficient Linear Algebra Library inside

Introducing SciDB data definition CREATE ARRAY Simple_Array <! v1 : double,! v2 : int64,! v3 : string >! [ I = 0:*, 5, 0, J = 0:9, 5, 0 ];! Attributes v1, v2, v3 Dimensions I, J Dimension size * indicates unbounded Chunk size Chunk overlap

2 Programming Interfaces Array Functional Language AFL aggregate(! filter ( Simple_Array,! v3 = Odd ),! I,! avg ( Simple_Array.v1 )! );! Array Query Language AQL SELECT avg ( S.v1 )! FROM Simple_Array S! WHERE S.v3 = Odd! GROUP BY S.I;!

SciDB: Analytic Features Versioned (no-overwrite) Storage System UPDATE operations are append only The state of the system can be reconstructed at a specified time Provenance Log of all queries to reconstruct how results were derived Uncertainty and Statistical Reasoning Types with error bars Functions to perform statistical tests and analytics Goal is to meet the requirements for scientific data management identified in Stonebraker. M, et al Requirements for Science Data Bases and SciDB, CIDR 2009

SciDB vs. RDBMSs on storage efficiency & complex computations Relational Database I J value 0 0 32.5 1 0 90.9 2 0 42.1 3 0 96.7 0 1 46.3 1 1 35.4 2 1 35.7 3 1 41.3 0 2 81.7 1 2 35.9 2 2 35.3 3 2 89.9 0 3 53.6 1 3 86.3 2 3 45.9 3 3 27.6 SciDB 32.5 46.3 81.7 53.6 90.9 35.4 35.9 86.3 42.1 35.7 35.3 45.9 96.7 41.3 89.9 27.6 16 cells Speeds up data access in a distributed database Dramatic storage efficiencies as # of dimensions & attributes grows Facilitates drill-down & clustering by like groups Math functions like linear algebra run directly on native storage format 48 cells

SQL vs. SciDB: Data Definition CREATE TABLE USNOB (!! Locn point not null,! -- RA double, DECL double!! B_mag double precision not null,!! R_mag double precision! );! not null! SQL CREATE ARRAY USNOB <! B_Mag : double,! >!! R_Mag : double! [ RA(double)=*,72000,720, DECL(double)=*,36000,360 ];!!!! SciDB

SQL vs. SciDB: Data Manipulation SELECT U1.Locn, COUNT(*)! FROM USNOB AS U1, USNOB AS U2! WHERE box(u1.locn, U1.Locn) &&! box(point(u1.locn[0] - 0.001,! GROUP BY U1.Locn; U1.Locn[1] + 0.001),! point(u1.locn[0] + 0.001,! U1.Locn[1] - 0.001))! SQL!!!!window (!!!);!!!USNOB,!!0.001, 0.001,!!count(*)! SciDB

SciDB Storage: Arrays and Chunks 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 [ I=0:9,5,2, J=0:6,2,1 ]

SciDB Storage: Arrays and Chunks 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 [ I=0:9,5,2, J=0:6,2,1 ]

SciDB Client ( iquery, Python ) Shared Nothing MPP Architecture Local Store SciDB Coordinator Node 1 SciDB 2 Engine PostgreSQL 3 Persistent System Catalog Service PostgreSQL Connections SciDB Inter- Node Communication SciDB Engine SciDB Engine SciDB Engine SciDB Engine 5 Local Store Local Store Local Store Local Store 6 SciDB Node SciDB Node SciDB Node SciDB Worker Nodes 4 SciDB Node

SciDB is not Map/Reduce Build on the Best of Map/Reduce s many good ideas Job scheduling over large numbers of nodes Use k-copy replication of chunks to implement reliable storage SciDB s Specialized Storage Model Conveys Advantages Block distribution and inter-node operations Local parallelism for individual block to block operations

SciDB is not Map/Reduce Usage Model supports ad hoc queries Use case lessons: Analysts like R and dislike programmers Interactive, conversational, explorative model Linear Algebra: hard to implement as a Map and a Reduce Linear Algebraic operations are the foundations of statistical analysis Linear Algebra operators are not embarrassingly parallelizable multiply ( transpose ( A ), A ) is an O ( N ^ 3 ) algorithm inverse ( A ) is best implemented as an iterative algorithm Cumulative sum, multi-dimensional WINDOW Deeply integrated data management & analytics has many advantages Data is updatable

Brief Performance Study Three Queries in our AFL Query Language multiply ( transpose ( Simple_Array ), Simple_Array );!! regrid( Simple_Array, 10, 10, avg (v2) );!! cumsum (!! filter ( Simple_Array, v1 = Odd ),! I, v1 );! Amazon EC2 Cluster of 8 Through 64 Nodes

Mathematical Operations

The It s Beer-o-Clock Slide SciDB Open-source, from the ground up array DBMS and analytic platform http://www.scidb.org Paradigm4 Contributed all the open-source development to date Will distribute and support enterprise-quality software along with P4 add-ons http://www.paradigm4.com Why SciDB? Query-centric usage model suits scientific / analytic users Arrays support linear algebra and statistical processing Shared-nothing architecture for scalability Building on the shoulders of giants Map/Reduce, RDBMSs, R Scalable data manipulation and linear algebra