Introduction to the data.table package in R



Similar documents
Introduction to the data.table package in R

rm(list=ls()) library(sqldf) system.time({large = read.csv.sql("large.csv")}) # seconds, 4.23GB of memory used by R

IBM Unica Marketing Operations and Campaign Version 8 Release 6 May 25, Integration Guide

Hands-On Data Science with R Dealing with Big Data. Graham.Williams@togaware.com. 27th November 2014 DRAFT

IBM SPSS Statistics 23 Brief Guide

Group Theory and Molecular Symmetry

IBM License Metric Tool Version 9.0 (includes version 9.0.1, and ) Managing the Software Inventory Guide

Performance Tuning for the Teradata Database

IBM Tivoli Netcool Performance Manager Wireline Component January 2012 Document Revision R2E1. Pack Upgrade Guide

Extending the Database

IBM SPSS Custom Tables 22

Intro to Mail Merge. Contents: David Diskin for the University of the Pacific Center for Professional and Continuing Education. Word Mail Merge Wizard

IBM Tivoli Monitoring Version 6.3 Fix Pack 2. Windows OS Agent Reference

Unit Storage Structures 1. Storage Structures. Unit 4.3

IBM Universal Behavior Exchange Toolkit Release April 8, User's Guide IBM

Exploratory Data Analysis and Plotting

Files. Files. Files. Files. Files. File Organisation. What s it all about? What s in a file?

Lesson 07: MS ACCESS - Handout. Introduction to database (30 mins)

How To Design A Physical Data Warehouse

IBM Unica Campaign Version 8 Release 6 May 25, Data Migration Guide

FAQs about the data.table package in R

Dataframes. Lecture 8. Nicholas Christian BIOST 2094 Spring 2011

ERserver. iseries. DB2 Universal Database for iseries - Database Performance and Query Optimization

Notes on Factoring. MA 206 Kurt Bryan

Business Intelligence Guide

IBM EMM Reports Version 9 Release 1.1 November 26, Installation and Configuration Guide

z 0 and y even had the form

In-Memory Database User Guide

Quick Sort. Implementation

How To Find Out What A Key Is In A Database Engine

Dalhousie University CSCI 2132 Software Development Winter 2015 Lab 7, March 11

The Saves Package. an approximate benchmark of performance issues while loading datasets. Gergely Daróczi

SQL Server An Overview

The Relational Data Model

IBM Marketing Operations OnDemand November 17, Project Manager's Guide

IBM SmartCloud Monitoring - Application Insight. User Interface Help SC

Reading and writing files

14:440:127 Introduction to Computers for Engineers. Notes for Lecture 06

Adapter for Clarify CRM User Guide

ERserver. iseries. Journal management

Partitioning under the hood in MySQL 5.5

External Sorting. Why Sort? 2-Way Sort: Requires 3 Buffers. Chapter 13

Binary Heap Algorithms

IBM Tealeaf CX Version 9 Release 0.2 June 18, Tealeaf Databases Guide

Lightweight Directory Access Protocol. BladeCenter Management Module and IBM Remote Supervisor Adapters

IBM Rapid Restore Ultra Version 4.0. User s Guide

Optimal Pricing for Multiple Services in Telecommunications Networks Offering Quality of Service Guarantees

Introduction to ProSMART: Proforma s Sales Marketing and Relationship Tool How to Log On and Get Started

MySQL Cluster New Features. Johan Andersson MySQL Cluster Consulting johan.andersson@sun.com

Polynomial Degree and Finite Differences

ERserver. Embedded SQL programming. iseries. Version 5 Release 3

Parallel Data Preparation with the DS2 Programming Language

Sorting revisited. Build the binary search tree: O(n^2) Traverse the binary tree: O(n) Total: O(n^2) + O(n) = O(n^2)

Binary Heaps * * * * * * * / / \ / \ / \ / \ / \ * * * * * * * * * * * / / \ / \ / / \ / \ * * * * * * * * * *

How To Learn To Write A Report In A Database On A Microsoft Powerbook

IBM SPSS Advanced Statistics 22

Access Tutorial 3 Maintaining and Querying a Database. Microsoft Office 2013 Enhanced

Tutorial 3 Maintaining and Querying a Database

Optimizing Tension Control in Center-Driven Winders

Efficiently Identifying Inclusion Dependencies in RDBMS

Analysis of Binary Search algorithm and Selection Sort algorithm

Package bigrf. February 19, 2015

Physical Design. Meeting the needs of the users is the gold standard against which we measure our success in creating a database.

Integrating VoltDB with Hadoop

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Query 4. Lesson Objectives 4. Review 5. Smart Query 5. Create a Smart Query 6. Create a Smart Query Definition from an Ad-hoc Query 9

Tables so far. set() get() delete() BST Average O(lg n) O(lg n) O(lg n) Worst O(n) O(n) O(n) RB Tree Average O(lg n) O(lg n) O(lg n)

By Clicking on the Worksheet you are in an active Math Region. In order to insert a text region either go to INSERT -TEXT REGION or simply

Micro Focus Database Connectors

WebSphere Message Broker. Installation Guide. Version7Release0

Microsoft Word 2010 Mail Merge (Level 3)

Access Queries (Office 2003)

Tivoli Integrated Portal Administration and configuration guide. Version 1.0 Tivoli Integrated Portal 2.2

Polynomial and Synthetic Division. Long Division of Polynomials. Example 1. 6x 2 7x 2 x 2) 19x 2 16x 4 6x3 12x 2 7x 2 16x 7x 2 14x. 2x 4.

ERserver. iseries. Backup, Recovery and Media Services (BRMS)

4 Impulse and Impact. Table of contents:

LSP 121. LSP 121 Math and Tech Literacy II. Simple Databases. Today s Topics. Database Class Schedule. Simple Databases

CPU Animation. Introduction. CPU skinning. CPUSkin Scalar:

UNIX Logs Agent User s Guide

Relational Database: Additional Operations on Relations; SQL

A Document Retention System for Eye Care Practices. Release Notes. Version 7.5 October A Milner Technologies, Inc. Solution

Introduction to Microsoft Access 2003

Teradata SQL Assistant Version 13.0 (.Net) Enhancements and Differences. Mike Dempsey

æ A collection of interrelated and persistent data èusually referred to as the database èdbèè.

ERserver. iseries. Service tools

Web Intelligence User Guide

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 13-1

Using In-Memory Computing to Simplify Big Data Analytics

Scaling up = getting a better machine. Scaling out = use another server and add it to your cluster.

Package retrosheet. April 13, 2015

ScoMIS Encryption Service

Creating Tables ACCESS. Normalisation Techniques

DERIVATIVES AS MATRICES; CHAIN RULE

Dock Scheduling Guide

Using R for Windows and Macintosh

Microsoft Excel Tips & Tricks

Ummmm! Definitely interested. She took the pen and pad out of my hand and constructed a third one for herself:

Transcription:

Introduction to the data.table package in R Reised: October 2, 2014 (A later reision may be aailable on the homepage) Introduction This ignette is aimed at those who are already familiar with creating and subsetting data.frame in R. We aim for this quick introduction to be readable in 10 minutes, briefly coering a few features: 1. Keys; 2. Fast Grouping; and 3. Fast ordered join. Creation Recall that we create a data.frame using the function data.frame(): > DF = data.frame(=c("b","b","b","a","a"),=rnorm(5)) > DF 1 b 0.1913554 2 b 1.2556559 3 b -0.3871878 4 a 0.7918896 5 a -0.8527794 A data.table is created in eactly the same way: > DT = data.table(=c("b","b","b","a","a"),=rnorm(5)) > DT 4: a 1.3814964 5: a -0.3773382 Obsere that a data.table prints the row numbers with a colon so as to isually separate the row number from the first column. We can easily conert eisting data.frame objects to data.table. > CARS = data.table(cars) > head(cars) speed dist 1: 4 2 2: 4 10 3: 7 4 4: 7 22 5: 8 16 6: 9 10 1

We hae just created two data.tables: DT and CARS. It is often useful to see a list of all data.tables in memory: > tables() NAME NROW NCOL MB COLS [1,] CARS 50 2 1 speed,dist [2,] DT 5 2 1, Total: 2MB KEY The MB column is useful to quickly assess memory use and to spot if any redundant tables can be remoed to free up memory. Just like data.frames, data.tables must fit inside RAM. Some users regularly work with 20 or more tables in memory, rather like a database. The result of tables() is itself a data.table, returned silently, so that tables() can be used in programs. tables() is unrelated to the base function table(). To see the column types : > sapply(dt,class) "character" "numeric" You may hae noticed the empty column KEY in the result of tables() aboe. This is the subject of the net section. 1. Keys Let s start by considering data.frame, specifically rownames. We know that each row has eactly one row name. Howeer, a person (for eample) has at least two names, a first name and a second name. It s useful to organise a telephone directory sorted by surname then first name. In data.table, a key consists of one or more columns. These columns may be integer, factor or numeric as well as character. Furthermore, the rows are sorted by the key. Therefore, a data.table can hae at most one key because it cannot be sorted in more than one way. We can think of a key as like super-charged row names; i.e., mult-column and multi-type. Uniqueness is not enforced; i.e., duplicate key alues are allowed. Since the rows are sorted by the key, any duplicates in the key will appear consecutiely. Let s remind ourseles of our tables: > tables() NAME NROW NCOL MB COLS [1,] CARS 50 2 1 speed,dist [2,] DT 5 2 1, Total: 2MB KEY > DT 4: a 1.3814964 5: a -0.3773382 No keys hae been set yet. > DT[2,] # select row 2 2

1: b -0.9044895 > DT[=="b",] # select rows where column == "b" Aside: notice that we did not need to prefi with DT$. In data.table queries, we can use column names as if they are ariables directly. But since there are no rownames, the following does not work: > cat(try(dt["b",],silent=true)) Error in `[.data.table`(dt, "b", ) : When i is a data.table (or character ector), must be keyed (i.e. sorted, and, marked as sorte The error message tells us we need to use setkey(): > setkey(dt,) > DT 1: a 1.3814964 2: a -0.3773382 3: b 1.1279835 4: b -0.9044895 5: b -0.8075297 Notice that the rows in DT hae now been re-ordered according to the alues of. The two "a" rows hae moed to the top. We can confirm that DT does indeed hae a key using haskey(), key(), attributes(), or just running tables(). > tables() NAME NROW NCOL MB COLS KEY [1,] CARS 50 2 1 speed,dist [2,] DT 5 2 1, Total: 2MB Now that we are sure DT has a key, let s try again: > DT["b",] By default all the rows in the group are returned 1. The mult argument (short for multiple) allows the first or last row of the group to be returned instead. > DT["b",mult="first"] 1: b 1.127983 1 In contrast to a data.frame where only the first rowname is returned when the rownames contain duplicates. 3

> DT["b",mult="last"] 1: b -0.8075297 Also, the comma is optional. > DT["b"] Let s now create a new data.frame. We will make it large enough to demonstrate the difference between a ector scan and a binary search. > grpsize = ceiling(1e7/26^2) # 10 million rows, 676 groups [1] 14793 > tt=system.time( DF <- data.frame( + =rep(letters,each=26*grpsize), + y=rep(letters,each=grpsize), + =runif(grpsize*26^2), + stringsasfactors=false) + ) 5.696 0.272 5.981 > head(df,3) y 1 A a 0.59188461 2 A a 0.09419788 3 A a 0.35631785 > tail(df,3) y 10000066 Z z 0.6131090 10000067 Z z 0.8320838 10000068 Z z 0.6368332 > dim(df) [1] 10000068 3 We might say that R has created a 3 column table and inserted 10,000,068 rows. It took 5.981 secs, so it inserted 1,671,972 rows per second. This is normal in base R. Notice that we set stringsasfactors=false. This makes it a little faster for a fairer comparison, but feel free to eperiment. Let s etract an arbitrary group from DF: > tt=system.time(ans1 <- DF[DF$=="R" & DF$y=="h",]) # 'ector scan' 13.436 0.376 13.845 > head(ans1,3) 4

y 6642058 R h 0.7503649 6642059 R h 0.9192344 6642060 R h 0.6846727 > dim(ans1) [1] 14793 3 Now conert to a data.table and etract the same group: > DT = as.data.table(df) # but normally use fread() or data.table() directly, originally > system.time(setkey(dt,,y)) # one-off cost, usually 0.432 0.040 0.475 > ss=system.time(ans2 <- DT[list("R","h")]) # binary search 0.012 0.000 0.010 > head(ans2,3) y 1: R h 0.7503649 2: R h 0.9192344 3: R h 0.6846727 > dim(ans2) [1] 14793 3 > identical(ans1$, ans2$) [1] TRUE At 0.010 seconds, this was 1384 times faster than 13.845 seconds, and produced precisely the same result. If you are thinking that a few seconds is not much to sae, it s the relatie speedup that s important. The ector scan is linear, but the binary search is O(log n). It scales. If a task taking 10 hours is sped up by 100 times to 6 minutes, that is significant 2. We can do ector scans in data.table, too. In other words we can use data.table badly. > system.time(ans1 <- DT[=="R" & y=="h",]) # works but is using data.table badly 6.428 0.132 6.572 > system.time(ans2 <- DF[DF$=="R" & DF$y=="h",]) # the data.frame way 12.592 0.340 13.024 > mapply(identical,ans1,ans2) y TRUE TRUE TRUE 2 We wonder how many people are deploying parallel techniques to code that is ector scanning 5

If the phone book analogy helped, the 1384 times speedup should not be surprising. We use the key to take adantage of the fact that the table is sorted and use binary search to find the matching rows. We didn t ector scan; we didn t use ==. When we used =="R" we scanned the entire column, testing each and eery alue to see if it equalled R. We did it again in the y column, testing for h. Then & combined the two logical results to create a single logical ector which was passed to the [ method, which in turn searched it for TRUE and returned those rows. These were ectorized operations. They occurred internally in R and were ery fast, but they were scans. We did those scans because we wrote that R code. When i is a list (and data.table is a list too), we say that we are joining. In this case, we are joining DT to the 1 row, 2 column table returned by list("r","h"). Since we do this a lot, there is an alias for list:.(). > identical( DT[list("R","h"),], + DT[.("R","h"),]) [1] TRUE Both ector scanning and binary search are aailable in data.table, but one way of using data.table is much better than the other. The join synta is a short, fast to write and easy to maintain. Passing a data.table into a data.table subset is analogous to A[B] synta in base R where A is a matri and B is a 2-column matri 3. In fact, the A[B] synta in base R inspired the data.table package. There are other types of ordered joins and further arguments which are beyond the scope of this quick introduction. The merge method of data.table is ery similar to X[Y], but there are some differences. See FAQ 1.12. This first section has been about the first argument inside DT[...], namely i. The net section is about the 2nd and 3rd arguments: j and by. 2. Fast grouping The second argument to DT[...] is j and may consist of one or more epressions whose arguments are (unquoted) column names, as if the column names were ariables. Just as we saw earlier in i as well. > DT[,sum()] [1] 5000407 When we supply a j epression and a by epression, the j epression is repeated for each by group. > DT[,sum(),by=] V1 1: A 192394.9 2: B 192440.5 3: C 192119.2 4: D 192122.4 5: E 192349.2 6: F 192197.9 7: G 192277.7 8: H 192574.9 9: I 192246.2 10: J 192648.9 11: K 192287.5 3 Subsetting a keyed data.table by a n-column data.table is consistent with subsetting a n-dimension array by a n-column matri in base R 6

12: L 192049.7 13: M 192269.9 14: N 192338.9 15: O 192740.3 16: P 192241.7 17: Q 192269.7 18: R 192373.5 19: S 192115.2 20: T 192625.8 21: U 192179.6 22: V 192301.1 23: W 192043.1 24: X 192340.7 25: Y 192453.1 26: Z 192405.4 V1 The by in data.table is fast. Let s compare it to tapply. > ttt=system.time(tt <- tapply(dt$,dt$,sum)); ttt 7.360 0.976 8.529 > sss=system.time(ss <- DT[,sum(),by=]); sss 0.656 0.048 0.706 > head(tt) A B C D E F 192394.9 192440.5 192119.2 192122.4 192349.2 192197.9 > head(ss) V1 1: A 192394.9 2: B 192440.5 3: C 192119.2 4: D 192122.4 5: E 192349.2 6: F 192197.9 > identical(as.ector(tt), ss$v1) [1] TRUE At 0.706 sec, this was 12 times faster than 8.529 sec, and produced precisely the same result. Net, let s group by two columns: > ttt=system.time(tt <- tapply(dt$,list(dt$,dt$y),sum)); ttt 10.156 1.444 11.661 > sss=system.time(ss <- DT[,sum(),by=",y"]); sss 0.776 0.052 0.826 7

> tt[1:5,1:5] a b c d e A 7340.570 7370.085 7466.794 7350.710 7351.832 B 7383.704 7337.640 7345.794 7458.266 7413.636 C 7387.638 7363.944 7378.682 7366.018 7484.329 D 7413.228 7364.502 7388.166 7432.694 7467.051 E 7397.637 7402.158 7382.014 7336.074 7396.296 > head(ss) y V1 1: A a 7340.570 2: A b 7370.085 3: A c 7466.794 4: A d 7350.710 5: A e 7351.832 6: A f 7408.410 > identical(as.ector(t(tt)), ss$v1) [1] TRUE This was 14 times faster, and the synta is a little simpler and easier to read. 3. Fast ordered joins This is also known as last obseration carried forward (LOCF) or a rolling join. Recall that X[Y] is a join between data.table X and data.table Y. If Y has 2 columns, the first column is matched to the first column of the key of X and the 2nd column to the 2nd. An equi-join is performed by default, meaning that the alues must be equal. Instead of an equi-join, a rolling join is : X[Y,roll=TRUE] As before the first column of Y is matched to X where the alues are equal. The last join column in Y though, the 2nd one in this eample, is treated specially. If no match is found, then the row before is returned, proided the first column still matches. Further controls are rolling forwards, backwards, nearest and limited staleness. For eamples type eample(data.table) and follow the output at the prompt. 8