Accessing bigger datasets in R using SQLite and dplyr



Similar documents
Teaching Precursors to Data Science in Introductory and Second Courses in Statistics

VOL. 5, NO. 2, August 2015 ISSN ARPN Journal of Systems and Software AJSS Journal. All rights reserved

Bigger data analysis. Hadley Chief Scientist, RStudio. Thursday, July 18, 13

Big Data Analysis with Revolution R Enterprise

Introduc)on to RHadoop Master s Degree in Informa1cs Engineering Master s Programme in ICT Innova1on: Data Science (EIT ICT Labs Master School)

Big Data Analysis with Revolution R Enterprise

Hands-On Data Science with R Dealing with Big Data. Graham.Williams@togaware.com. 27th November 2014 DRAFT

How To Calculate Estimatepi On An Ec2 Cluster

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Sawmill Log Analyzer Best Practices!! Page 1 of 6. Sawmill Log Analyzer Best Practices

LabStats 5 System Requirements

Classroom Demonstrations of Big Data

RevoScaleR 7.3 HPC Server Getting Started Guide

Project 2 Database Design and ETL

Cloud Computing Capstone Task 2 Report

Big Data for Science (and other) Students

Computational Statistics: A Crash Course using R for Biologists (and Their Friends)

Product Guide. Sawmill Analytics, Swindon SN4 9LZ UK tel:

Dashboard Engine for Hadoop

Bringing Big Data Modelling into the Hands of Domain Experts

BiDAl: Big Data Analyzer for Cluster Traces

HALOGEN. Technical Design Specification. Version 2.0

Big R: Large-scale Analytics on Hadoop using R

CSE 6040 Computing for Data Analytics: Methods and Tools. Lecture 1 Course Overview

How To Perform Predictive Analysis On Your Web Analytics Data In R 2.5

Turnout Response System (TRS) Version 1.8

Hardware and Software Requirements for Installing California.pro

Your Best Next Business Solution Big Data In R 24/3/2010

MySQL Manager. User Guide. July 2012

Tableau's data visualization software is provided through the Tableau for Teaching program.

User Guide. Analytics Desktop Document Number:

NaviCell Data Visualization Python API

The Five College Guide to Statistics

inforouter V8.0 Server & Client Requirements

Computing, technology & Data Analysis in the Graduate Curriculum. Duncan Temple Lang UC Davis Dept. of Statistics

Chapter 4 IT Infrastructure and Platforms

Introduction to SQL for Data Scientists

Excel Power Tools. David Onder and Alison Joseph. NCAIR 2015 Conference

Cookbook 23 September 2013 GIS Analysis Part 1 - A GIS is NOT a Map!

MultiAlign Software. Windows GUI. Console Application. MultiAlign Software Website. Test Data

A Dynamic Programming Approach for 4D Flight Route Optimization

Data Science and Business Analytics Certificate Data Science and Business Intelligence Certificate

The SkySQL Administration Console

Editing Locally and Using SFTP: the FileZilla-Sublime-Terminal Flow

LCMON Network Traffic Analysis

SUSE Cloud Installation: Best Practices Using an Existing SMT and KVM Environment

Visualizing More Than Twenty Years of Flight Data for the Raleigh-Durham International Airport

Department of Geography Program in Planning, Faculty of Arts and Science University of Toronto GGR 273 H1S: GIS II Course Outline Winter 2015

SkySQL Data Suite. A New Open Source Approach to MySQL Distributed Systems. Serge Frezefond V

BIOS 6660: Analysis of Biomedical Big Data Using R and Bioconductor, Fall 2015 Computer Lab: Education 2 North Room 2201DE (TTh 10:30 to 11:50 am)

Outline. What is Big data and where they come from? How we deal with Big data?

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

Background on Elastic Compute Cloud (EC2) AMI s to choose from including servers hosted on different Linux distros

Installation Manual for Grid Monitoring Tool

UQC103S1 UFCE Systems Development. uqc103s/ufce PHP-mySQL 1

Prediction and Fuzzy Logic at ThomasCook to automate price. automate price settings of last minute offers

SEIZE THE DATA SEIZE THE DATA. 2015

Case Study 2 SPR500 Fall 2009

Spam Marshall SpamWall Step-by-Step Installation Guide for Exchange 5.5

Android Development Tutorial. Nikhil Yadav CSE40816/ Pervasive Health Fall 2011

The data between TC Monitor and remote devices is exchanged using HTTP protocol. Monitored devices operate either as server or client mode.

LoadRunner and Performance Center v11.52 Technical Awareness Webinar Training

How, What, and Where of Data Warehouses for MySQL

1. Product Information

Online Backup Client User Manual

Session 85 IF, Predictive Analytics for Actuaries: Free Tools for Life and Health Care Analytics--R and Python: A New Paradigm!

MicroStrategy Analytics Express User Guide

Sisense. Product Highlights.

The Whole OS X Web Development System

Frequently Asked Questions. Frequently Asked Questions SSLPost Page 1 of 31 support@sslpost.com

MIGRATING TO AVALANCHE 5.0 WITH MS SQL SERVER

Machine Problem 3 (Option 1): Mobile Video Chat

1 Log visualization at CNES (Part II)

Using. DataTrust Secure Online Backup. To Protect Your. Hyper-V Virtual Environment. 1 P a g e

Understanding ArcGIS Deployments in Public and Private Cloud. Marwa Mabrouk

DATA MINING TOOL FOR INTEGRATED COMPLAINT MANAGEMENT SYSTEM WEKA 3.6.7

Data journalism: what it can do for you

1. The orange button 2. Audio Type 3. Close apps 4. Enlarge my screen 5. Headphones 6. Questions Pane. SparkR 2

WhatsUp Gold v16.2 MSP Edition Deployment Guide This guide provides information about installing and configuring WhatsUp Gold MSP Edition to central

Lesson 7 - Website Administration

Getting Started with R and RStudio 1

POPI Cloud Backups. Overview. The Challenges

MANIPULATION OF LARGE DATABASES WITH "R"

Inge Os Sales Consulting Manager Oracle Norway

Bomgar KACE Integration 101. System Requirements. Configure Dell KACE

Automated Penetration Testing with the Metasploit Framework. NEO Information Security Forum March 19, 2008

Transcription:

Accessing bigger datasets in R using SQLite and dplyr Amherst College, Amherst, MA, USA March 24, 2015 nhorton@amherst.edu

Thanks to Revolution Analytics for their financial support to the Five College Statistics Program and the Biostatistics Program at UMass/Amherst to Nick Reich, Emily Ramos, Kostis Gourgoulias, and Andrew Bray (co-organizers of the Meetup) suggestions of meeting topics, speakers and locations always welcomed

Upcoming meetups working with geographic data in R 3/26 at 7:00pm DataFest 3/27-3/29 learn about plotly 4/2 at 4:00pm updates from RStudio: J.J. Allaire 7/15 (save the date!) statistical analysis of big genetics and genomics data Xihong Lin, 11/16 (save the date!) using docker for fun and profit anyone willing?

Acknowledgements joint work with Ben Baumer (Smith College) and Hadley Wickham (Rice/RStudio) supported by NSF grant 0920350 (building a community around modeling, statistics, computation and calculus) more information at http://www.amherst.edu/~nhorton/precursors

Prelude New Frontier in statistical thinking (Chamandy, Google @ JSM 2013) teaching precursors to big data/foundations of data science as an intellectually coherent theme (Pruim, Calvin College) growing importance of computing and ability to think with data (Lambert, Google) key capacities in statistical computing (Nolan and Temple Lang, TAS 2010) Statistics and the modern student (Gould, ISR 2010) Building precursors for data science in the first and second courses in statistics (Horton, Baumer, and Wickham, 2015)

Prelude (cont.) How to accomplish this? start in the first course build on capacities in the second course develop more opportunities for students to apply their knowledge in practice (internships, collaborative research, teaching assistants) new courses focused on Data Science today s goal: helping DataFesters grapple with larger datasets

Data Expo 2009 Ask students: have you ever been stuck in an airport because your flight was delayed or cancelled and wondered if you could have predicted it if you d had more data? (Wickham, JCGS, 2011)

Data Expo 2009 dataset of flight arrival and departure details for all commercial flights within the USA, from October 1987 to April 2008 (but we now have through the begininning of 2015) large dataset: more than 180 million records aim: provide a graphical summary of important features of the data set winners presented at the JSM in 2009; details at http://stat-computing.org/dataexpo/2009

Airline Delays Codebook (abridged) Year 1987-2015 Month DayOfMonth DayOfWeek 1=Monday DepTime departure time UniqueCarrier OH = Comair, DL = Delta, etc. TailNum plane tail number ArrDelay arrival delay, in minutes Origin BDL, BOS, SFO, etc. Dest Full details at http://www.transtats.bts.gov/fields.asp?table_id=236

Data Expo 2009 winners

Data Expo 2009 winners

Data Expo 2009 winners

Data Expo 2009 winners

Data Expo 2009 winners

Goal facilitate use of database technology for instructors and students without training in this area demonstrate how to use SQLite in conjunction with dplyr in R to achieve this goal help enable the next generation of data scientists

Background on databases and SQL relational databases (invented in 1970) like electronic filing cabinets to organize masses of data (terabytes) fast and efficient useful reference: Learning MySQL, O Reilly 2007 Horton, Baumer, and Wickham (2015, in press), http://www.amherst.edu/~nhorton/precursors

Client and server model server: manages data client: ask server to do things use R as both the client and the server

SQL Structured Query Language special purpose programming language for managing data developed in early 1970 s standardized (multiple times) most common operation is query (using SELECT)

Why SQLite? advantage: free, quick, dirty, simple (runs locally), database can be copied from one system (Mac OS X) to Windows or Linux disadvantage: not as robust, fast, or flexible than other free alternatives such as MySQL (which run remotely) For personal use, SQLite is ideal (and is what I will demonstrate today). For a class, I d recommend MySQL.

Creating the airline delays database 1 download the data (30gb uncompressed) 2 load the data 3 add indices (to speed up access to the data, takes some time) 4 establish a connection (using src sqlite()) 5 start to make selections (which will be returned as R objects) using dplyr package 6 features lazy evaluation (data only accessed when needed)

Key idioms for dealing with big(ger) data select: subset variables filter: subset rows mutate: add new columns summarise: reduce to a single row arrange: re-order the rows Hadley Wickham, bit.ly/bigrdata4

Accessing the database (see test-sqlite.rmd) my_db <- src_sqlite(path="ontime.sqlite3") # link to some useful tables flights <- tbl(my_db, "flights") airports <- tbl(my_db, "airports") airlines <- tbl(my_db, "airlines") airplanes <- tbl(my_db, "airplanes")

Accessing the database (see test-sqlite.rmd) > airports Source: sqlite 3.8.6 [ontime.sqlite3] From: airports [3,376 x 7] IATA Airport City State 1 00M Thigpen Bay Springs MS 2 00R Livingston Municipal Livingston TX 3 00V Meadow Lake Colorado Springs CO 4 01G Perry-Warsaw Perry NY 5 01J Hilliard Airpark Hilliard FL 6 01M Tishomingo County Belmont MS.............. Variables not shown: Country (chr), Latitude (dbl), Longitude (dbl)

Accessing the database (see test-sqlite.rmd) airportcounts <- flights %>% filter(dest %in% c( ALB, BDL, BTV )) %>% group_by(year, DayOfMonth, DayOfWeek, Month, Dest) %>% summarise(count = n()) %>% collect() # this forces the evaluation airportcounts <- airportcounts %>% mutate(date = ymd(paste(year, "-", Month, "-", DayOfMonth, sep=""))) head(airportcounts)

GROUP BY ## Source: local data frame [6 x 7] ## Groups: Year, DayOfMonth, DayOfWeek, Month ## ## Year DayOfMonth Day Month Dest count Date ## 1 2014 1 3 1 ALB 17 2014-01-01 ## 2 2014 1 3 1 BDL 52 2014-01-01 ## 3 2014 1 3 1 BTV 6 2014-01-01 ## 4 2014 2 4 1 ALB 24 2014-01-02 ## 5 2014 2 4 1 BDL 60 2014-01-02 ## 6 2014 2 4 1 BTV 11 2014-01-02

SQL commands ## Year Month DayOfMonth Day DepTime CRSDepTime ArrTime CRS ## Variables not shown: UniqueCarrier (chr), FlightNum (int ## ActualElapsedTime (int), CRSElapsedTime (int), AirTime ( ## (int), DepDelay (int), Origin (chr), Dest (chr), Distanc tbl(my_db, sql("select * FROM flights LIMIT 4")) # or tbl(my_db, "flights") %>% head() ## Source: sqlite 3.8.6 [ontime.sqlite3] ## From: <derived table> [?? x 29] ## ## 1 2014 1 1 3 914 900 1238 1225 ## 2 2014 1 2 4 857 900 1226 1225 ## 3 2014 1 3 5 0 900 0 1225 ## 4 2014 1 4 6 1005 900 1324 1225 ## (int), TaxiOut (int), Cancelled (int), CancellationCode ## (chr), CarrierDelay (chr), WeatherDelay (chr), NASDelay ## SecurityDelay (chr), LateAircraftDelay (chr)

Which month is it best to travel (airline averages/bdl)? 30 AvgArrivalDelay 20 10 1 2 3 4 5 6 7 8 9 10 11 12

Which day is it best to travel (airline averages from BDL)? 30 Avg Delay by Airline (in minutes) 25 20 15 10 1 2 3 4 5 6 7

Setup on your own machine (data from 2014) Update existing packages: update.packages() Install necessary packages: RSQLite, dplyr, tidyr, mosaic, knitr, nycflights13, lubridate, igraph, markdown Create a new project in RStudio and directory to store the data and datab Download files: see list at http://www.amherst.edu/~nhorton/precursors Set up the database: run load-sqlite.r Take it for a spin: run test-sqlite.rmd

Next steps Want more data? See the additional instructions to access more than just the data from 2014 See also: the dplyr package vignette on databases See also: the precursors paper (in press) at http://www.amherst.edu/~nhorton/precursors

Accessing bigger datasets in R using SQLite and dplyr Amherst College, Amherst, MA, USA March 24, 2015 nhorton@amherst.edu