BIG DATA VISUALIZATION. Team Impossible Peter Vilim, Sruthi Mayuram Krithivasan, Matt Burrough, and Ismini Lourentzou



Similar documents
Clustering & Visualization

Hierarchical Data Visualization

Visualization Techniques in Data Mining

VISUALIZING HIERARCHICAL DATA. Graham Wills SPSS Inc.,

Hierarchical Data Visualization. Ai Nakatani IAT 814 February 21, 2007

CHAPTER 1 INTRODUCTION

Visualizing the Top 400 Universities

BIG DATA & ANALYTICS. Transforming the business and driving revenue through big data and analytics

NakeDB: Database Schema Visualization

The Scientific Data Mining Process

Information Visualization Multivariate Data Visualization Krešimir Matković

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012

SuperViz: An Interactive Visualization of Super-Peer P2P Network

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Reconstructing Self Organizing Maps as Spider Graphs for better visual interpretation of large unstructured datasets

What is Visualization? Information Visualization An Overview. Information Visualization. Definitions

Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets

Voronoi Treemaps in D3

Specific Usage of Visual Data Analysis Techniques

Agenda. TreeMaps. What is a Treemap? Basics

Graph Database Proof of Concept Report

The Data Mining Process

Big Data: Rethinking Text Visualization

HadoopRDF : A Scalable RDF Data Analysis System

White Paper. Redefine Your Analytics Journey With Self-Service Data Discovery and Interactive Predictive Analytics

an introduction to VISUALIZING DATA by joel laumans

<no narration for this slide>

How Companies are! Using Spark

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

ANALYTICS BUILT FOR INTERNET OF THINGS

MultiExperiment Viewer Quickstart Guide

Visualization methods for patent data

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

Visualizing Large Graphs with Compound-Fisheye Views and Treemaps

TIBCO Spotfire Business Author Essentials Quick Reference Guide. Table of contents:

2015 The MathWorks, Inc. 1

Image Search by MapReduce

Map-Parallel Scheduling (mps) using Hadoop environment for job scheduler and time span for Multicore Processors

Subgraph Patterns: Network Motifs and Graphlets. Pedro Ribeiro

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Chapter 7. Using Hadoop Cluster and MapReduce

The STC for Event Analysis: Scalability Issues

Security visualisation

TEXT-FILLED STACKED AREA GRAPHS Martin Kraus

DICON: Visual Cluster Analysis in Support of Clinical Decision Intelligence

Visualizing non-hierarchical and hierarchical cluster analyses with clustergrams

Information Management course

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Tree Visualization with Tree-Maps: 2-d Space-Filling Approach

Data Visualization Techniques

Big Data Analytics. Genoveva Vargas-Solar French Council of Scientific Research, LIG & LAFMIA Labs

An example. Visualization? An example. Scientific Visualization. This talk. Information Visualization & Visual Analytics. 30 items, 30 x 3 values

Treemaps for Search-Tree Visualization

DHL Data Mining Project. Customer Segmentation with Clustering

Microsoft Business Intelligence

CIS 700: algorithms for Big Data

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

Discovering Business Intelligence Using Treemap Visualizations

HierarchyMap: A Novel Approach to Treemap Visualization of Hierarchical Data

TDA and Machine Learning: Better Together

Clustering UE 141 Spring 2013

Data Visualization Techniques

Remote Graphical Visualization of Large Interactive Spatial Data

20 A Visualization Framework For Discovering Prepaid Mobile Subscriber Usage Patterns

Character Image Patterns as Big Data

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

Interactive Data Mining and Visualization

The Data Grid: Towards an Architecture for Distributed Management and Analysis of Large Scientific Datasets

Visual Data Mining with Pixel-oriented Visualization Techniques

Making confident decisions with the full spectrum of analysis capabilities

Topological Properties

Cloud Computing and Advanced Relationship Analytics

Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage

Hadoop and Map-Reduce. Swati Gore

1.1 Difficulty in Fault Localization in Large-Scale Computing Systems

Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies

Hadoop Cluster Applications

Bayesian networks - Time-series models - Apache Spark & Scala

SECONDARY STORAGE TERRAIN VISUALIZATION IN A CLIENT-SERVER ENVIRONMENT: A SURVEY

Advanced Big Data Analytics with R and Hadoop

Integration of Cluster Analysis and Visualization Techniques for Visual Data Analysis

Geovisual Analytics Exploring and analyzing large spatial and multivariate data. Prof Mikael Jern & Civ IngTobias Åström.

ECS 235A Project - NVD Visualization Using TreeMaps

InfiniteGraph: The Distributed Graph Database

Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data

Introduction to Data Mining

Cluster Analysis for Evaluating Trading Strategies 1

Similarity Search in a Very Large Scale Using Hadoop and HBase

High Performance Spatial Queries and Analytics for Spatial Big Data. Fusheng Wang. Department of Biomedical Informatics Emory University

Using In-Memory Computing to Simplify Big Data Analytics

Flexible Web Visualization for Alert-Based Network Security Analytics

Cluster Analysis: Advanced Concepts

By LaBRI INRIA Information Visualization Team

Transcription:

BIG DATA VISUALIZATION Team Impossible Peter Vilim, Sruthi Mayuram Krithivasan, Matt Burrough, and Ismini Lourentzou

Let s begin with a story Let s explore Yahoo s data! Dora the Data Explorer has a new job!

Dora s new job Explore Yahoo s data Yahoo has a ton of data though and they keep getting more Hard to make sense of all the data Dora runs into some challenges when working with this data

Screen resolution has its limitations! 40,000 data points from Yahoo data 1 Many data points

Tico the Squirrel has an idea! Increase the number of pixels! Get a Powerwall! 53.7 million pixel Powerwall at the University of Leeds

Boots is laughing at how ineffective this is From the beginning of recorded time until 2003, humans had created 5 exabytes (5 billion gigabytes) of data. In 2011, the same amount was created every two days Besides. NEW

Dora is too adventurous to stay in the office But look at these tiny screens!

How do you gain insight from your data? Even if you manage to fit all your data pixels on a screen Humans don t think in terms of hundreds of numbers let alone tens of thousands Can t draw insight from a large collection of points

We need to find ways to fully utilize/squeeze data points into the resolution we have And more importantly make sense out of our data as well!

Boots is thinking How about zoom in to the points we are interested in? This is similar to how you navigate online maps.

Boots has an idea How can Yahoo Maps show continents and also small islands? We need to be able to see the big picture and small outliers 2 Showing outliers

Boots is thinking But we have so many dimensions! We can t color everything differently! That won t work for all the dimensions we have. 3 Many dimensions

Hierarchies are a powerful way to explore Yahoo Maps is so powerful! It can zoom in and out MULTIPLE times and does it FAST! Can we do the same? 4 Interactive visualization

Can hierarchies help to better visualize big data?

Hierarchical Visualization Challenges 1 Technique 2 3 4 5 Many data points Showing outliers Many dimensions Interactive visualization Integrating into working software

First Challenge 1 Technique 2 3 4 5 Many data points Tree Maps Showing outliers Many dimensions Interactive visualization Integrating into working software

How Tree Maps Work 1 TreeMaps visualize hierarchical structures onto a rectangular region in a space-filling manner. A node's weight (bounding box) determines its display size o Measure of importance or degree of interest [1] Brian Johnson and Ben Shneiderman, Treemaps: a space-filling approach to the visualization of hierarchical information structures, 1991

Advantages and Disadvantages of Tree Maps 1 100% use of the available display space Allows users to set display properties (colors, borders, etc.) X Number and variety of domain properties visualized is limited X Cluttered [1] Brian Johnson and Ben Shneiderman, Treemaps: a space-filling approach to the visualization of hierarchical information structures, 1991

And Dora is crying 1 I can t focus on one set of points only, now I am stuck with the whole dataset! And everything looks the same! Now I have to do this again for the a smaller set! [1] Brian Johnson and Ben Shneiderman, Treemaps: a space-filling approach to the visualization of hierarchical information structures, 1991

Moving beyond just displaying data 1 Data exploration should maximize insight into a data set Interactively Retrieve meaningful relations Extract inferences from the data Uncover underlying structure Detect patterns or anomalies Test underlying assumptions Active process of discovery NOT passive display

Second Challenge 1 Technique 2 Technique 3 4 5 Many data points Tree Maps Showing outliers Zoom clustering Many dimensions Interactive visualization Integrating into working software

How Zoom Clustering Works 2 Cluster data points Store clusters in tree structure Allow branching and zooming into different areas of tree Support back tracking in zoom tree User zoom tree operation [7] Baoyuan Wang, Gang Chen, Jiajun Bu & Yizhou Yu, Multiscale Visualization of Relational Databases Using Layered Zoom Trees and Partial Data Cubes, 2010

[12] Mike Bostock, Zoomable Sunburst, http://bl.ocks.org/mbostock/4348373, 2012; Mike Bostock, Zoomable Area, http://bl.ocks.org/mbostock/4015254, 2012. Zoom Tree Examples 2 Seems easy to use!

Demo video: Zooming Plot Charts 2 http://youtu.be/8dflke95xcm?t=3m52s [7] Baoyuan Wang, Gang Chen, Jiajun Bu & Yizhou Yu, Multiscale Visualization of Relational Databases Using Layered Zoom Trees and Partial Data Cubes, 2010

Advantages and Disadvantages of Zoom Clustering 2 Advantages Not Overwhelmed with Data Points Can Process Data in Parallel Ability to Focus on Interesting Areas Disadvantages Many-Dimensional Data You ve shown me 3 columns worth of data. What am I supposed to do with my other 1,800 columns? [7] Baoyuan Wang, Gang Chen, Jiajun Bu & Yizhou Yu, Multiscale Visualization of Relational Databases Using Layered Zoom Trees and Partial Data Cubes, 2010

Third Challenge 1 Technique 2 Technique 3 Technique 4 5 Many data points Tree Maps Showing outliers Zoom clustering Many dimensions Parallel Coordinates Interactive visualization Integrating into working software

How Parallel Coordinates Work 3 Large number of dimensions Highlights relation between dimensions Difficult to distinguish the overall structure when the number of tuples becomes very large [9] Jimmy Johansson, Robert Treloar, and Mikael Jern, Integration of Unsupervised Clustering, Interaction and Parallel Coordinates for the Exploration of Large Multivariate Data, 2004

[9] Jimmy Johansson, Robert Treloar, and Mikael Jern, Integration of Unsupervised Clustering, Interaction and Parallel Coordinates for the Exploration of Large Multivariate Data, 2004 Self Organizing Map & Parallel Coordinates 3 SOM algorithm + Parallel Coordinates Self Organizing Map: Nonlinear projection from m-dimensional space onto the two-dimensional display space. Relies on distance, similarity and average.

[9] Jimmy Johansson, Robert Treloar, and Mikael Jern, Integration of Unsupervised Clustering, Interaction and Parallel Coordinates for the Exploration of Large Multivariate Data, 2004\ Self Organizing Maps Explained 3 http://www.tubechop.com/watch/2698630

Zooming in Parallel Coordinates 3 Interactive data analysis: Visual User Interface with drill down, filtering and zooming Zoom in by simply drawing a rectangle across the selected cluster bands. [9] Jimmy Johansson, Robert Treloar, and Mikael Jern, Integration of Unsupervised Clustering, Interaction and Parallel Coordinates for the Exploration of Large Multivariate Data, 2004

Advantages and Disadvantages of Parallel Coordinates Cool machine learning method! BUT.. X X X X X Parallel Coordinates does not provide a good overview as it becomes hard to see the structure in the data when the dataset gets large Runs out of encoding possibilities as the number of dimensions increases. Preprocessing or filtering the data is required Not efficient for visualizing datasets with non-numerical data Number of clusters/neurons predefined 3 [9] Jimmy Johansson, Robert Treloar, and Mikael Jern, Integration of Unsupervised Clustering, Interaction and Parallel Coordinates for the Exploration of Large Multivariate Data, 2004\

Fourth Challenge 1 Technique 2 Technique 3 Technique 4 Technique 5 Many data points Tree Maps Showing outliers Zoom clustering Many dimensions Parallel Coordinates Interactive visualization Parallel sampling Integrating into working software

Dora is frustrated! 4 I can t wait this long! I want to access the remote server faster. How do I quickly visualize hierarchical data? Which visualization approach is cost-effective?

Boots has some ideas about speed. 4 1. Subsampling and clustering a) Can be very fast b) Need to account for errors c) Grid and multiresolution 2. Parallel Servers a) Distributed computation (e.g. MapReduce, Spark, etc.) b) Don t have to account for errors c) Aren t restricted to certain data types d) Technically challenging to implement [4] Lori A. Freitag and Raymond M. Loy, Comparison of Remote Visualization Strategies for Interactive Exploration of Large Data Sets, 2001

Boots thinks sampling is a good idea 4 1. Uniform Grid Distributes points from original dataset into equal sized grids Single level representation Can be constructed quickly Uniform Hierarchical [4] Lori A. Freitag and Raymond M. Loy, Comparison of Remote Visualization Strategies for Interactive Exploration of Large Data Sets, 2001 2. Hierarchical Multiresolution Multi level representation Fewer approximation errors by showing more points where the data is changing rapidly Quadratic complexity makes large problems difficult

4 Isa comes to rescue Dora and Boots with another idea [4] Lori A. Freitag and Raymond M. Loy, Comparison of Remote Visualization Strategies for Interactive Exploration of Large Data Sets, 2001 Why don t we combine hierarchical clustering with parallel servers?

[4] Lori A. Freitag and Raymond M. Loy, Comparison of Remote Visualization Strategies for Interactive Exploration of Large Data Sets, 2001 Data Explorers are excited about PINK 4 Why PINK (Parallel Single Linkage)? Scalable parallel algorithm for single-linkage hierarchical clustering Structure of single linkage problem can be exploited for parallelism Single linkage hierarchical clustering dendrogram for a dataset and the MST of the corresponding complete graph produce identical clusters

[4] Lori A. Freitag and Raymond M. Loy, Comparison of Remote Visualization Strategies for Interactive Exploration of Large Data Sets, 2001 Connection to Minimum Spanning Tree 4 Single Linkage Hierarchical Clustering Minimum Spanning Tree

[12] William Hendrix, Diana Palsetia, Mostofa Patwary, and Ankit Agrawal, A Scalable Algorithm for Single- Linkage Hierarchical Clustering on Distributed-Memory Architectures, 2013 How does PINK work? 4 Split Data Evenly (a) (b) (c) (d) Problem domain decomposition with k partitions Two processes are each assigned two complete subgraphs Six processes are assigned one bipartite subgraph for the six pairs of partitions K 2 /2 processors

[12] William Hendrix, Diana Palsetia, Mostofa Patwary, and Ankit Agrawal, A Scalable Algorithm for Single-Linkage Hierarchical Clustering on Distributed-Memory Architectures, 2013 Generating the Minimum Spanning Tree 4 Solve subproblems using prim s algorithm Combine partial solutions Subproblems may have edges not in MST Treat partial solutions as candidate edges Apply Kruskal s algorithm to candidate edges

Binary Merging of Partial Solutions Combine two MSTs at a time from consecutive processors Add an edge that does not join vertices that are already in the same component 4

Explorers are happy with PINK s performance 4 Memory Usage [12] William Hendrix, Diana Palsetia, Mostofa Patwary, and Ankit Agrawal, A Scalable Algorithm for Single-Linkage Hierarchical Clustering on Distributed-Memory Architectures, 2013

ABC Why does PINK combine the dendrograms from consecutive processes? Overlapping data partitions Detect and eliminate edges sooner Cuts down memory and communication cost What are the limitations of PINK algorithm? Minimum processor requirement (K 2 /2) Binary Merge the entire dendrogram must fit in one processor

Fifth Challenge 1 Technique 2 3 4 5 Technique???? Many data points Tree Maps Showing outliers Zoom clustering Many dimensions Parallel Coordinates Interactive visualization Parallel sampling Integrating into working software

Open Challenges 5 1. We have the back end of big data but no front end 2. Most big data tools are focused on batch processing which isn t good for visualization This is changing

Open Challenges 5 3. Most front end tools don t integrate with the back end tools This is improving 4. Most front end tools don t handle this type of data well (high dimensionality and many data points)

Comparison of Hierarchical Techniques Tree Maps Many data points 1 2 3 4 5 Showing outliers Many dimensions Interactive visualization Working software Parallel Coordinates Zoom Clustering Parallel Sampling N/A N/A N/A N/A How can we combine these?

Can Dora use hierarchies to visualize big data? Yes! With the right combination of techniques

Our Hybrid Opinion Reviewed several techniques with different advantages and disadvantages Use hierarchical clustering to tackle many data points Use parallelization for performance Use sampling because simple parallelization isn t enough This also happens to be approach we picked for our project!