INCREMENTAL, APPROXIMATE DATABASE QUERIES AND UNCERTAINTY FOR EXPLORATORY VISUALIZATION. Danyel Fisher Microso0 Research

Similar documents

An Open Dynamic Big Data Driven Applica3on System Toolkit

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Data Stream Algorithms in Storm and R. Radek Maciaszek

Graphing Parabolas With Microsoft Excel

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

INTRODUCING DRUID: FAST AD-HOC QUERIES ON BIG DATA MICHAEL DRISCOLL - CEO ERIC TSCHETTER - LEAD METAMARKETS

High Frequency Trading and NoSQL. Peter Lawrey CEO, Principal Consultant Higher Frequency Trading

Hadoop & SAS Data Loader for Hadoop

2015 The MathWorks, Inc. 1

<no narration for this slide>

Overview. Data Warehousing and Decision Support. Introduction. Three Complementary Trends. Data Warehousing. An Example: The Store (e.g.

Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data

Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014

Data Mining Techniques Chapter 5: The Lure of Statistics: Data Mining Using Familiar Tools

Data Management in the Cloud: Limitations and Opportunities. Annies Ductan

LITERATURE SURVEY ON DATA WAREHOUSE AND ITS TECHNIQUES

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

5 Signs You Might Be Outgrowing Your MySQL Data Warehouse*

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

CS 5150 So(ware Engineering Project Management

Implementing Data Models and Reports with Microsoft SQL Server

Mark Bennett. Search and the Virtual Machine

UAB Cyber Security Ini1a1ve

Foundations of Business Intelligence: Databases and Information Management

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

CS 378 Big Data Programming. Lecture 5 Summariza9on Pa:erns

TIBCO Spotfire Business Author Essentials Quick Reference Guide. Table of contents:

Guideline for stresstest Page 1 of 6. Stress test

ROLAP with Column Store Index Deep Dive. Alexei Khalyako SQL CAT Program Manager

ANALYTICS CENTER LEARNING PROGRAM

Testing Big data is one of the biggest

Data Exploration Data Visualization

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

BI Requirements Checklist

Some Security Challenges of Cloud Compu6ng. Kui Ren Associate Professor Department of Computer Science and Engineering SUNY at Buffalo

Cloud Scale Distributed Data Storage. Jürmo Mehine

Secondary Storage. Any modern computer system will incorporate (at least) two levels of storage: magnetic disk/optical devices/tape systems

Pastel Evolution BIC. Getting Started Guide

How To Use Splunk For Android (Windows) With A Mobile App On A Microsoft Tablet (Windows 8) For Free (Windows 7) For A Limited Time (Windows 10) For $99.99) For Two Years (Windows 9

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Performance Benchmark for Cloud Block Storage

Creating Connection with Hive

The 3 questions to ask yourself about BIG DATA

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

CHARTrunner Data Management System

Using distributed technologies to analyze Big Data

Microsoft SQL Server is great for storing departmental or company data. It. A Quick Guide to Report Builder In association with

Using RDBMS, NoSQL or Hadoop?

Inge Os Sales Consulting Manager Oracle Norway

Report Builder. Microsoft SQL Server is great for storing departmental or company data. It is. A Quick Guide to. In association with

Chapter 6 8/12/2015. Foundations of Business Intelligence: Databases and Information Management. Problem:

How To Scale Out Of A Nosql Database

BIG DATA VISUALIZATION. Team Impossible Peter Vilim, Sruthi Mayuram Krithivasan, Matt Burrough, and Ismini Lourentzou

XpoLog Center Suite Log Management & Analysis platform

Understanding the Value of In-Memory in the IT Landscape

How to Ingest Data into Google BigQuery using Talend for Big Data. A Technical Solution Paper from Saama Technologies, Inc.

Business Intelligence in SharePoint 2013

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Week 13: Data Warehousing. Warehousing

Business Intelligence and Healthcare

PEPPERDATA IN MULTI-TENANT ENVIRONMENTS

Infrastructures for big data

Sourcing Strategy: Oracle ULA. ITALCONSOR Procurement Consultants

Hunk & Elas=c MapReduce: Big Data Analy=cs on AWS

Information Technology Services will be updating the mark sense test scoring hardware and software on Monday, May 18, We will continue to score

Bringing Big Data Modelling into the Hands of Domain Experts

Using In-Memory Data Fabric Architecture from SAP to Create Your Data Advantage

Data Visualization Techniques

Getting Started Guide

Finding Anomalies in Time- Series using Visual Correla/on for Interac/ve Root Cause Analysis

SELLING PROJECTS ON THE MICROSOFT BUSINESS ANALYTICS PLATFORM

Tableau Your Data! Wiley. with Tableau Software. the InterWorks Bl Team. Fast and Easy Visual Analysis. Daniel G. Murray and

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

Chapter 6. Foundations of Business Intelligence: Databases and Information Management

Session 3119 Mobilizing Your Dashboards Best Tips for the Mobile Market. Scott Leaver, SAP and Matt Lloyd, SAP

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

Data Mining, Predictive Analytics with Microsoft Analysis Services and Excel PowerPivot

COM CO P 5318 Da t Da a t Explora Explor t a ion and Analysis y Chapte Chapt r e 3

Cloud Computing at Google. Architecture

Oracle Database In- Memory Op4on in Ac4on

Capacity Planning Process Estimating the load Initial configuration

Transcription:

INCREMENTAL, APPROXIMATE DATABASE QUERIES AND UNCERTAINTY FOR EXPLORATORY VISUALIZATION Danyel Fisher Microso0 Research

Exploratory Visualiza9on Ini9al Query Process query Get a response Change parameters

Handling Big Data for Infovis Megabytes: More data than there are pixels on screen Need to summarize, zoom Gigabytes: More bits than there are in memory Need to think about how to access Terabytes: More bits than there are on a single disk You will never look at every bit

Big Data Visualiza9on Ini9al Query Process query for a while Get a response Change parameters

Exploratory Big Data Visualiza9on Ini9al Query Process query Get a good enough, quick response Change parameters

Happiness over 9me 100% Online Traditional Time

What is good enough? I can act on this query I realize that this query is incorrect create a new query I want a detailed response Wait for the full query to complete

What is quick? Milliseconds: Feels real- 9me Seconds: Laggy but possible Minutes: Forget context Hours: Forget ques9on

Part I: INCREMENTAL DATABASES Part II: VISUALIZATIONS Part III: A PROTOTYPE

Part I INCREMENTAL DATABASES

Techniques for Speeding Big Data Pre- aggregate (e.g. OLAP) Fast but inflexible Parallel ComputaNon Hadoop Pig, Sawzall, DryadLinq: use exis9ng data structures and add visualiza9ons Dremel: Use novel data structures Sampling & approximate queries

Control (1999)

Sampling and Approximate Queries Joe Hellerstein (et al) s CONTROL project General concept: Grab a liale bit (more) of the database quickly Es9mate value & size of confidence interval Repeat What can we do? Aggregate. Some aggregates are very good. AVERAGE. COUNT. SUM. Some aggregates are really bad. MAX (or Top- K). MIN. Some aggregates have loose approxima9ons. PERCENTILE. COUNT DISTINCT.

Is that powerful enough? Some things are just histograms: Bar chart (sum) Tag Clouds Treemap (mul9level sum) Don t do a scaaerplot, do a 2D histogram Even some machine learning: K- Means: Locate average of group, find centroids, repeat

Compu9ng Confidence Interval Es9mator Total elements Mean seen so far Number of elements that cross the filter so far Intui9on: if you know about how the data you ve seen so far behaves, you can guess the rest. Based on: Standard devia9on seen so far Also nice: data min/max Some theorems: std dev overall

Why Not Just Do a Straight Sample? Don t know how good you are without confidence intervals May need larger sample (over memory) to get 9ght intervals. How big is big enough?

Why Isn t Everyone Doing This? A good sample is random but a random sample requires accessing (poten9ally) all rows Need to maintain some data for bounds E.g. column min, max Databases don t support incremental callbacks Joins can be tricky (but NoSQL?)

Part II THE VISUALIZATION CHALLENGES

Uncertainty Visualiza9on Confidence is something like uncertainty Lots of sources of uncertainty have been studied Credibility of sources Model uncertainty Simula9on uncertainty Incompleteness Sta)s)cal and Quan)ta)ve Uncertainty

Olston & Mackinlay (2002)

Streit & Pham (2008)

User Study of Uncertainty Sanyal Zhang et al (2009) No single technique was clear winner

Part III PROTOTYPING IT

Why Prototype a Front End? Back- End Technical implementa9on Adapt to NoSQL Ways to guide sampling Front- End What visual codings work beside error bars? How do we extend to mul9ple dimensions? How good is good enough? How fast is fast enough? What sorts of problems work well? User experience of these systems Reducing communica9on overhead

A Full System View Data query Query Terabyte Server or Distributed System Compute Front End Aggregate Value Metadata Es9mator Confidence interval

The Desktop Edi9on Query Liale Tiny Database + Compute Front End Es9mator Error range

Call to Ac9on There s work to be done here and a very compelling source of approximate data. Let s build these!

Thank you! (this work is not sponsored by DOE) DANYELF@MICROSOFT.COM RESEARCH.MICROSOFT.COM/~DANYELF

BONUS SLIDES

Joins Lots of database research dedicated to joins Hash ripple join 10% sample, twice, is a 1% sample assuming independence Sen9nel joins: Some joins are impossible to do incrementally E.g. (select count(direct reports) where manager=president) Or live without them NoSQL has very limited (and expensive) joining Denormalized tables for distributed computa9on