Visualising big data in R



Similar documents
netresponse probabilistic tools for functional network analysis

Bigger data analysis. Hadley Chief Scientist, RStudio. Thursday, July 18, 13

Printing Guide. MapInfo Pro Version Contents:

Estimating batch effect in Microarray data with Principal Variance Component Analysis(PVCA) method

MapInfo Professional Version Printing Guide

Creating a New Annotation Package using SQLForge

HowTo: Querying online Data

MMGD0203 Multimedia Design MMGD0203 MULTIMEDIA DESIGN. Chapter 3 Graphics and Animations

Data Visualization with R Language

Links. Blog. Great Images for Papers and Presentations 5/24/2011. Overview. Find help for entire process Quick link Theses and Dissertations

Using the generecommender Package

Modern PHP Graphics with Cairo. Michael Maclean FrOSCon 2011

Plotting: Customizing the Graph

massive aerial LiDAR point clouds

Crystal Reports Form Letters Replace database exports and Word mail merges with Crystal's powerful form letter capabilities.

4. Are you satisfied with the outcome? Why or why not? Offer a solution and make a new graph (Figure 2).

Reviewer s Guide. Morpheus Photo Animation Suite. Screenshots. Tutorial. Included in the Reviewer s Guide:

COC131 Data Mining - Clustering

Lecture Notes, CEng 477

1. Introduction to image processing

Excel Companion. (Profit Embedded PHD) User's Guide

Learning Objectives:

TIBCO Spotfire Business Author Essentials Quick Reference Guide. Table of contents:

Adding Animation With Cinema 4D XL

mapstats: An R Package for Geographic Visualization of Survey Data

Data exploration with Microsoft Excel: univariate analysis

Reduce File Size. Compatibility. Contents

DATA MINING TOOL FOR INTEGRATED COMPLAINT MANAGEMENT SYSTEM WEKA 3.6.7

Interactive Visualization

IT Fundamentals of Multimedia (Optional)

Monocle: Differential expression and time-series analysis for single-cell RNA-Seq and qpcr experiments

Package erp.easy. September 26, 2015

Data Visualization in R

MODELING AND ANIMATION

Module 2: Introduction to Quantitative Data Analysis

RISO (UK) Limited Getting Started Guide Using the HC5500 Network Scan Function HC5500 Network Scan Quick Guide - V.1

OpenEXR Image Viewing Software

R4RNA: A R package for RNA visualization and analysis

Analysis of System Performance IN2072 Chapter M Matlab Tutorial

Scientific data visualization

STATGRAPHICS Online. Statistical Analysis and Data Visualization System. Revised 6/21/2012. Copyright 2012 by StatPoint Technologies, Inc.

Getting started with qplot

mmnet: Metagenomic analysis of microbiome metabolic network

TUTORIAL 4 Building a Navigation Bar with Fireworks

Adobe Illustrator CS5

Package SHELF. February 5, 2016

THE ULTIMATE BEST PRACTICES

Preparing high-resolution files for conventional printing

1 FTire/editor s Main Menu 1

Graphics in R. Biostatistics 615/815

Neural Network Add-in

CHAPTER FIVE RESULT ANALYSIS

CHAPTER 6: GRAPHICS, DIGITAL MEDIA, AND MULTIMEDIA

International Journal of Advanced Computer Technology (IJACT) ISSN: PRIVACY PRESERVING DATA MINING IN HEALTH CARE APPLICATIONS

Tutorial for Tracker and Supporting Software By David Chandler

UNIVERSITY OF CALICUT

GEOsubmission. Alexandre Kuhn May 3, 2016

Package tagcloud. R topics documented: July 3, 2015

Data Analysis Software

Print Services User Guide

Using CyTOF Data with FlowJo Version Revised 2/3/14

Security visualisation

WebEx. Network Bandwidth White Paper. WebEx Communications Inc

Data Storage 3.1. Foundations of Computer Science Cengage Learning

5 Correlation and Data Exploration

A Basic Summary of Image Formats

Lession: 2 Animation Tool: Synfig Card or Page based Icon and Event based Time based Pencil: Synfig Studio: Getting Started: Toolbox Canvas Panels

The Saves Package. an approximate benchmark of performance issues while loading datasets. Gergely Daróczi

Tutorial for proteome data analysis using the Perseus software platform

MapInfo Professional 10 Printing Guide

Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller

Cluster Analysis using R

What s new in Carmenta Server 4.2

CNV Univariate Analysis Tutorial

Package neuralnet. February 20, 2015

Lesson 15 - Fill Cells Plugin

MapInfo Professional 11.5 Printing Guide

IV3Dm provides global settings which can be set prior to launching the application and are available through the device settings menu.

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

521466S Machine Vision Assignment #7 Hough transform

ACE: Illustrator CC Exam Guide

Using APSIM, C# and R to Create and Analyse Large Datasets

How To Use The Lutron Home Control+ App On An Ipad Or Ipod


CS130 - Intro to computer graphics. Dr. Victor B. Zordan vbz@cs.ucr.edu Objectives

Distributed Systems Seminar Spatio-Temporal Visualization System - STVS

Drawing Lines with Pixels. Joshua Scott March 2012

The Scientific Data Mining Process

Transcription:

Visualising big data in R April 2013 Birmingham R User Meeting Alastair Sanderson www.alastairsanderson.com 23rd April 2013

The challenge of visualising big data Only a few million pixels on a screen, but many more data points Therefore need to generate a suitable summary to plot instead Directly visualising raw big data is probably pointless, at least for static graphics (real-time manipulation of big data, e.g. y throughs is another matter... ) A typical 1D/2D plot of big data will have lots of overlapping & therefore obscured points: these dierent values will be visually indistinguishable

1 x Hidden plot points Single vs. multiple overlapping points pdf(file = "N1.pdf", compress = FALSE); plot(1, 1, pch = 19) bmp(file = "N1.bmp", antialias = "none"); plot(1, 1, pch = 19) x <- rep(1, 1e3) # 1000 identical points pdf(file = "N1000.pdf", compress = FALSE); plot(x, x, pch = 19) bmp(file = "N1000.bmp", antialias = "none"); plot(x, x, pch = 19) graphics.off() 0.6 0.8 1.0 1.2 1.4 0.6 0.8 1.0 1.2 1.4 0.6 0.8 1.0 1.2 1.4 0.6 0.8 1.0 1.2 1.4 1 x The 2 plots look identical (apart from subtle anti-aliasing eects)

Image le size comparison List the size of each image le (in bytes): sapply(list.files(pattern="n1.*"), function(f) file.info(f)$size) N1000.bmp N1000.pdf N1.bmp N1.pdf 231478 75390 231478 12452 The bitmap graphics raster images are identical in size, but the (uncompressed) vector graphics PDFs dier in size by a factor 6 Raster graphics resolve overlaps: 1000 overlapping points 1 point, but vector graphics retain each point as a separate entity Similar principle required for Big Data, but need more control...

The bigvis concept Download the paper (vita.had.co.nz/papers/bigvis.html), by Hadley Wickham: Bin-summarise-smooth: a framework for visualising large data Bigvis is a scheme for pre-processing big datasets output can then be handled by conventional (R) plot tools processing done in (fast) compiled C++ code, using Rcpp package bigvis goal: be able to plot 100 million points in under 5 seconds Also provides outlier removal and smoothing: big data means very rare cases can occur outliers may be more of a problem smoothing very important to highlight trends & suppress noise

Installing bigvis Project website: github.com/hadley/bigvis install.packages("devtools") library(devtools) install_github("bigvis") Recent blog article about bigvis: blog.revolutionanalytics.com/2013/04/visualize-large-data-sets-with-the-bigvis-package.html

bigvis applied to a small dataset library(bigvis); library(ggplot2) # load packages bindata <- with(airquality, condense(bin(ozone, 20), bin(temp, 5)) ) p <- ggplot(data=bindata, aes(temp, Ozone, fill=.count)) + geom_tile() p + geom_point(data=airquality, aes(fill=null), colour="orange") 0 50 100 150 60 70 80 90 100 Temp Ozone 2.5 5.0 7.5 10.0 12.5.count

Movie length vs. IMDB rating: big-ish data, with outliers! 130,000 row data frame (bigvis::movies, from imdb.com) Nbin <- 1e4 # number of bins bindata <- with(movies, condense(bin(length, find_width(length, Nbin)), bin(rating, find_width(rating, Nbin)))) ggplot(data=bindata, aes(length, rating, fill =.count)) + geom_tile() 10.0 rating 7.5 5.0.count 100 75 50 25 2.5 0 1000 2000 3000 4000 5000 length

... in case you were wondering Which lms are longer than... 1000 minutes?! ( 17 hours!) longest <- subset(movies, length > 1e3) longest[c("title", "length", "rating")] title length rating Cure for Insomnia, The 5220 3.8 Four Stars 1100 3 Longest Most Meaningless Movie in the World, The 2880 6.4...aptly named!

Bigvis plot with outliers removed Outliers a problem with Big Data: extreme events do occur Update previous plot to use bigvis peel function to strip o outermost (1%, by default) extreme values: last_plot() %+% peel(bindata) # same plot, different dataset 10.0 rating 7.5 5.0.count 100 75 50 25 2.5 0 100 200 300 400 length

Smoothing in bigvis Also use autoplot function from bigvis; peel o outliers rst, then smooth with dierent bandwidths for length & rating smoothbindata <- smooth(peel(bindata), h=c(20, 1)) autoplot(smoothbindata) 10.0 rating 7.5 5.0 2.5.count 30 20 10 0 0 100 200 300 400 length

Live Demo N <- 1e7 raw <- data.frame(x = rt(n, 2), y = rt(n, 2)) # 10 million rows ## ~3s to run: system.time( binned <- with(raw, condense(bin(x), bin(y)) ) ) ## Plot condensed (i.e. pre-processed) data: ggplot(data=binned, aes(x,y, fill=.count)) + geom_tile() ## Peel outliers & replot: system.time( peeled <- peel(binned) ) ggplot(data=peeled, aes(x,y, fill=.count)) + geom_tile()

Summary Pre-processing to generate statistical summaries is the key to plotting Big Data The R bigvis package is a very powerful tool for plotting large datasets and is still under active development includes features to strip outliers, smooth & summarise data v3.0.0 of R (released Apr 2013) represents a solid platform for extending the outstanding data analysis & visualisation capabilities of R to meet the challenge of Big Data, with excellent prospects for future releases

Part of Big Data Week 2013 bigdataweek.com/birmingham

Session information These slides were created with (Emacs) org mode. R version R version 3.0.0 (2013-04-03), i686-pc-linux-gnu locale LC_CTYPE=en_GB.UTF-8, LC_NUMERIC=C, LC_TIME=en_GB.UTF-8, LC_COLLATE=en_GB.UTF-8, LC_MONETARY=en_GB.UTF-8, LC_MESSAGES=en_GB.UTF-8, LC_PAPER=C, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_GB.UTF-8, LC_IDENTIFICATION=C attached base packages stats, graphics, grdevices, utils, datasets, methods, base other attached packages ascii_2.1, ggplot2_0.9.3.1, bigvis_0.1, Rcpp_0.10.3 loaded via a namespace (and not attached) codetools_0.2-8, colorspace_1.2-2, dichromat_2.0-0, digest_0.6.3, grid_3.0.0, gtable_0.1.2, labeling_0.1, MASS_7.3-26, munsell_0.4, plyr_1.8, proto_0.3-10, RColorBrewer_1.0-5, reshape2_1.2.2, scales_0.2.3, stringr_0.6.2, tools_3.0.0

Birmingham R User Meeting (BRUM) www.birminghamr.org Alastair Sanderson Ph.D. - www.alastairsanderson.com