The Saves Package. an approximate benchmark of performance issues while loading datasets. Gergely Daróczi daroczig@rapporer.net.



Similar documents
Importing Data into R

rm(list=ls()) library(sqldf) system.time({large = read.csv.sql("large.csv")}) # seconds, 4.23GB of memory used by R

Getting your data into R

Dataframes. Lecture 8. Nicholas Christian BIOST 2094 Spring 2011

Using R for Windows and Macintosh

Preparing a SQL Server for EmpowerID installation

Reading and writing files

Hands-On Data Science with R Dealing with Big Data. Graham.Williams@togaware.com. 27th November 2014 DRAFT

Microsoft Dynamics CRM 2011 Guide to features and requirements

Enterprise Deployment

White paper. QNAP Turbo NAS with SSD Cache

Very Large Enterprise Network Deployment, 25,000+ Users

Dragon Medical Enterprise Network Edition Technical Note: Requirements for DMENE Networks with virtual servers

Your Best Next Business Solution Big Data In R 24/3/2010

ENHANCEMENTS TO SQL SERVER COLUMN STORES. Anuhya Mallempati #

Sawmill Log Analyzer Best Practices!! Page 1 of 6. Sawmill Log Analyzer Best Practices

NEWSTAR ENTERPRISE and NEWSTAR Sales System Requirements

R data import and export

Best Practices for Optimizing SQL Server Database Performance with the LSI WarpDrive Acceleration Card

Lecture 25: Database Notes

Very Large Enterprise Network, Deployment, Users

TriDoc Install Guide - Mac OS X Operating System This install guide was prepared for professional use!

Towards Terrabytes of TAQ

IN STA LLIN G A VA LA N C HE REMOTE C O N TROL 4. 1

System Requirements Table of contents

ManageEngine EventLog Analyzer. Best Practices Document

AppDynamics Lite Performance Benchmark. For KonaKart E-commerce Server (Tomcat/JSP/Struts)

Intel Xeon Processor 5560 (Nehalem EP)

Enterprise Network Deployment, 10,000 25,000 Users

Introduction to the data.table package in R

SQL Server Instance-Level Benchmarks with DVDStore

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Optimizing SQL Server Storage Performance with the PowerEdge R720

IMPLEMENTING GREEN IT

Package RPostgreSQL. February 19, 2015

IPRO ecapture Performance Report using BlueArc Titan Network Storage System

How good can databases deal with Netflow data

I.T. System Requirements 2015

Getting Started with ESXi Embedded

Analysis of VDI Storage Performance During Bootstorm

User Guide. NAS Compression Setup

DiskPulse DISK CHANGE MONITOR

Package RSQLite. February 19, 2015

VX Search File Search Solution. VX Search FILE SEARCH SOLUTION. User Manual. Version 8.2. Jan Flexense Ltd.

Scaling Graphite Installations

Fusion iomemory iodrive PCIe Application Accelerator Performance Testing

OPTOFORCE DATA VISUALIZATION 3D

Master Data Services Capacity Guidelines Summary Category: Applies to Source E-book publication date

Partek Flow Installation Guide

How to Install and Set Up the FASTER Dashboard

SafeCom Smart Printing Administrator s Quick Guide

OBSERVEIT DEPLOYMENT SIZING GUIDE

Symantec Backup Exec 12.5 for Windows Servers. Quick Installation Guide

Programming Tricks For Reducing Storage And Work Space Curtis A. Smith, Defense Contract Audit Agency, La Mirada, CA.

Getting Started with R and RStudio 1

EMBL-EBI. Database Replication - Distribution

Scaling out a SharePoint Farm and Configuring Network Load Balancing on the Web Servers. Steve Smith Combined Knowledge MVP SharePoint Server

By Ockham s Razor Solutions. A division of OCRgroup

Enterprise Architectures for Large Tiled Basemap Projects. Tommy Fauvell

ManageEngine EventLog Analyzer. Best Practices Document

Pimbox Zarafa migration manual. Migration to Pimbox Zarafa

Exadata Database Machine

IT of SPIM Data Storage and Compression. EMBO Course - August 27th! Jeff Oegema, Peter Steinbach, Oscar Gonzalez

Configuring Apache Derby for Performance and Durability Olav Sandstå

Symantec Backup Exec 2010 R2. Quick Installation Guide

SMS Alerting System. Version 0.1 ( ) Mohammed Aijaz Ahmed. Network Administrator. IST Dept.

Exam Number/Code : Exam Name: Name: PRO:MS SQL Serv. 08,Design,Optimize, and Maintain DB Admin Solu. Version : Demo.

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Tutorial 2: Reading and Manipulating Files Jason Pienaar and Tom Miller

اگرراواینررارو رر میوورریو ارنامارراو اا رریومرر راواصواررررواررنروماوادررکووراررامو صصررایوورریو مووررر موررر و صردرر ش یمو جابوارروادکوماوااج مو ماا

Data Collection Agent for Active Directory

SA-9600 Surface Area Software Manual

Historian SQL Server 2012 Installation Guide

Proval LS Database & Client Software (Trial or Full) Installation Guide

How To Write A Data Processing Pipeline In R

Aras Innovator 11. Platform Specifications

ARIS Education Package Process Design & Analysis Installation Guide. Version 7.2. Installation Guide

Installing R and the psych package

White Paper. Proving Scalability: A Critical Element of System Evaluation. Jointly Presented by NextGen Healthcare & HP

SERVER SETUP GUIDE CREATED BY JOHN SHEATHER 25 AUGUST SambaPOS Server Setup Guide V2.0 1 of 25

Best Practices for Deploying SSDs in a Microsoft SQL Server 2008 OLTP Environment with Dell EqualLogic PS-Series Arrays

Signiant Agent installation

ManageEngine Exchange Reporter Plus :: Help Documentation WELCOME TO EXCHANGE REPORTER PLUS... 4 GETTING STARTED... 7 DASHBOARD VIEW...

CMS Query Suite. CS4440 Project Proposal. Chris Baker Michael Cook Soumo Gorai

Version 3.2 for Windows June 2008

Hardware/Software Guidelines

SYSTEM SETUP FOR SPE PLATFORMS

SysPatrol - Server Security Monitor

NTP Software File Reporter Analysis Server

MDM Multidomain Edition (Version 9.6.0) For Microsoft SQL Server Performance Tuning

Big data in R EPIC 2015

NETWRIX CHANGE NOTIFIER

Using IRDB in a Dot Net Project

Package sjdbc. R topics documented: February 20, 2015

SQL Server 2012 Optimization, Performance Tuning and Troubleshooting

GraySort on Apache Spark by Databricks

Analyzing Network Servers. Disk Space Utilization Analysis. DiskBoss - Data Management Solution

Subsetting Observations from Large SAS Data Sets

Best Practices for Optimizing Your Linux VPS and Cloud Server Infrastructure

Transcription:

The Saves Package an approximate benchmark of performance issues while loading datasets Gergely Daróczi daroczig@rapporer.net December 27, 2013 1 Introduction The purpose of this package is to be able to save and load only the needed variables/columns of a dataframe. This is done with special binary files (tar archives), which seems to be a lot faster method than loading the whole binary object (Rdata files) via the basic load() function, or than loading columns from SQLite/MySQL databases via SQL commands. Performance gain on SSD drives is a lot more sensible compared to basic load() function. The performance improvement gained by loading only the chosen variables in binary format can be useful in some special cases (e.g. where merging data tables is not an option and very different datasets are needed for reporting), but be sure if using this package that you really need this, as non-standard file formats are used! 2 Requirements The package has no special requirements, it does not depend on any nonstandard package. 3 Usage Basic usage requires the user to save R objects (lists and data frames) to a special binary format via the saves() function. For example saving the diamonds data frame from ggplot2 package is done as follows: > saves ( diamonds )

This command will create a file named to diamonds.rdatas in the current working directory. This file is a compressed tar archive which contains the binary objects of all variables from the diamonds data frame (by issuing a simple save() on each of them). Custom filename can be specified via the file parameter, see: > saves ( diamonds, f i l e = f o o. b a r ) Loading of the saved binary format can be done via the loads() function: > df < loads ( diamonds, variables = c ( c o l o r, cut ) ) Of course the vector of variables to be load have to be specified. If you need to load all variables, you should use load() instead of loads(). 4 Advanced (crazy) usage Calling the ultra.fast option in saves() and loads() will make the commands run a lot faster but without any check made on user input, so be careful and use only if you are really know what you are up to! This option eliminates all inner control over given paramaters, checks are left out for greater performance, and also: no compression is done while saving the variables, all files are put in a new working directory. To save a data frame for later ultra.fast loading: > saves ( diamonds, u l t r a. f a s t = TRUE) Loading of the data frame (or list) is done via the loads() function just like above with an extra parameter: > df < loads ( diamonds, variables = c ( c o l o r, cut ), u l t r a. f a s t = TRUE) > s t r ( df ) L i s t o f 2 $ c o l o r : Factor w/ 7 l e v e l s D, E, F, G,.. : 2 2 2 6 7 7 6 5 2 5... $ cut : Factor w/ 5 l e v e l s Fair, Good,.. : 5 4 2 4 2 3 3 3 1 3... It is advised not to convert the R object to data frame from the list format for performance issues, so specifying to.data.frame = FALSE parameter might be a good choice (thought this is the default setting, it might change in the future): > df < loads ( diamonds, variables = c ( c o l o r, cut ), to.data.frame = FALSE, u l t r a. f a s t = TRUE) 2 of 10

5 Benchmarks The benchmark was run on a HP 6715b laptop (AMD64 X2 TL-60 2Ghz dual-core, 4 Gb RAM) with a standard SSD drive (). The most important part of the procedures were repeated in server environment also (2xIntel QuadCore 5410 at 2.3 Ghz, 8 Gb RAM, 2x500 Gb WD RE3 HDD in Raid1). The results shown in this paper are based on the experiments run on the laptop computer, the server s results are marked/annotated accordingly. To be able to compare the different procedures of loading data, the following environment was set. Loading the required packages: > library ( microbenchmark ) # f o r benchmarking ( i n s t e a d o f system.time ) > library ( f o r e i g n ) # d i f f e r e n t l o a d i n g a l g o r i t h m s and formats > library ( sqldf ) # > library (RMySQL) # > con < dbconnect(mysql( ), user= foo, dbname= bar, host= l o c a l h o s t, password= ) > library ( saves ) # > library ( g g p l o t 2 ) # f o r p l o t t i n g and a l s o f o r r e a d i n g diamonds d a t a s e t Getting required data: 1. diamonds data frame from ggplot2 via: dataset(diamonds) [53.940 obs. of 10 variables] 2. European Social Survey (Hungary, 4. wave) available at: http://www.esshu.hu/letolthetoadatbazisok [1.544 obs. of 508 variables] Transform each data frame to the required formats: > save ( diamonds, f i l e = diamonds.rdata ) > save ( diamonds, f i l e = diamonds nocompress.rdata, compress=false, precheck=false) > write.table ( diamonds, f i l e = diamonds. csv, sep =,, quote = FALSE, row.names=false) > names( data ) [ 6 ] < t a b l e e # as the d e f a u l t t a b l e i s i n t e r n a l i n MySQL > dbwritetable( con, diamonds, data, overwrite = TRUE) > saves ( diamonds ) > saves ( diamonds, f i l e = diamonds u f. R d a t a s, u l t r a. f a s t = TRUE) > rm( data ) File transformation to SPSS sav format was done in external software from the above made csv files. All the same was done for the ESS dataset (ESS HUN 4). Functions to test: > sav < function ( ) read.spss ( diamonds.sav ) > csv < function ( ) read.csv ( diamonds. csv ) > c l a s s e s < sapply ( diamonds, c l a s s ) > csv2 < function ( ) read.table ( diamonds. csv, header=true, sep =,, comment.char=, stringsasfactors=false, c o l C l a s s e s=c l a s s e s, nrows =53940) > sqldf < function ( ) { f < f i l e ( diamonds.csv ) 3 of 10

sqldf : : sqldf ( s e l e c t carat, c l a r i t y from f, dbname = tempfile ( ), file.format = l i s t ( header = TRUE, row.names = FALSE, sep =, ), drv= s q l d f. d r i v e r ) } > s q l < function ( ) { data < dbreadtable( con, diamonds ) query < ( SELECT carat, c l a r i t y FROM diamonds ) dbgetquery ( con, query ) } > binary < function ( ) load ( diamonds.rdata ) > binary2 < function ( ) load ( diamonds nocompress.rdata ) > loads < function ( ) saves : : loads ( diamonds.rdatas, c ( carat, c l a r i t y ) ) > loads2 < function ( ) saves : : loads ( diamonds, c ( carat, c l a r i t y ), u l t r a. f a s t = TRUE) And the appropriate form of the above for the ESS HUN 4 data frame also: > sav < function ( ) read.spss ( ESS HUN 4.sav ) > csv < function ( ) read.csv ( ESS HUN 4.csv ) > c l a s s e s < sapply ( diamonds, c l a s s ) > csv2 < function ( ) read.table ( ESS HUN 4.csv, header=true, sep =,, comment.char=, stringsasfactors=false, c o l C l a s s e s=c l a s s e s, nrows =53940) > sqldf < function ( ) { f < f i l e ( ESS HUN 4.csv ) sqldf : : sqldf ( s e l e c t length, a1 from f, dbname = tempfile ( ), file.format = l i s t ( header = TRUE, row.names = FALSE, sep =, ), drv= s q l d f. d r i v e r ) } > s q l < function ( ) { data < dbreadtable( con, data ) query < ( SELECT length, a1 FROM data ) dbgetquery ( con, query ) } > binary < function ( ) load ( ESS HUN 4.Rdata ) > binary2 < function ( ) load ( ESS HUN 4 nocompress.rdata ) > loads < function ( ) saves : : loads ( ESS HUN 4.Rdatas, c ( length, a1 ) ) > l o a d s 2 < function ( ) saves : : loads ( ESS HUN 4, c ( length, a1 ), u l t r a. f a s t = TRUE) Where the sav() stands for reading SPSS sav file, csv() for reading comma separated values, csv2() for the same but a lot optimized way, sqldf() and sql() are reading the data frames from a virtual and a real MySQL data frame. binary() reads the object from Rdata, where binary2() reads an uncompressed file. Reading data frame with loads() and loads2() are implemented in this package. The latter uses the ultra.fast = TRUE parameter. The difference of the required time to save data frames in binary format with the different parameters of save() (compress, precheck, ascii) can be simulated easily as follows: > x < rnorm ( 1000) > s1 < function ( ) save ( x, f i l e = temp, compress=false, precheck=false) > s2 < function ( ) save ( x, f i l e = temp, compress=false) > s3 < function ( ) save ( x, f i l e = temp ) > s4 < function ( ) save ( x, f i l e = temp, precheck=false) > s5 < function ( ) save ( x, f i l e = temp, precheck=false, a s c i i=true) > r e s s a v e s < microbenchmark ( s1 ( ), s2 ( ), s3 ( ), s4 ( ), s5 ( ), times =100) > boxplot ( r e s s a v e s ) 4 of 10

log(time) [ns] 5e+05 1e+06 2e+06 5e+06 s1() s2() s3() s4() s5() Expression For the sake of the Solid State Drive s longer life-span, the benhmarking processes (writing and loading from disks) were run only a hundred times. Previous experiments showed that longer benchmarking process did not resulted in more suprising details, but I welcome any donation of hardware for these purposes! :) Of course only reading the different types of data frames (without running estimates against write speeds) from disk would not lower the life-span of the drives, but running the processes 10.000 times will only be attached to the next version of this paper. Benchmarking of the above functions was done via: > microbenchmark ( sav ( ), csv ( ), csv2 ( ), sqldf ( ), s q l ( ), binary ( ), binary2 ( ), loads ( ), l o a d s 2 ( ), times =100) for both datasets (diamonds and ESS HUN 4). 5 of 10

6 Results - diamonds Run on the laptop with SSD attached (high IO capability): Unit : nanoeconds min l q median uq max sav ( ) 247768186 269080764 277842078 298155661 539805845 csv ( ) 558015701 575679546 592369539 681940860 1005660411 csv2 ( ) 412340206 432280842 447627948 468533077 684155492 sqldf ( ) 547616215 561715835 571669320 586092861 781810538 s q l ( ) 894087049 1116075540 1146487458 1181165774 1280771316 binary ( ) 123704605 124799280 129533188 137940692 330225725 binary2 ( ) 84195442 88034138 88451503 99073221 241455191 loads ( ) 96727569 108917898 118919434 123527491 170300594 l o a d s 2 ( ) 19032016 19889932 19952370 20595877 269095433 binary() binary2() csv() csv2() loads() loads2() sav() sql() sqldf() 2e+07 5e+07 1e+08 2e+08 5e+08 1e+09 Expression log(time) [ns] 6 of 10

Run in server environment (high CPU, but low IO capability): Unit : nanoeconds min l q median uq max sav ( ) 169058869 241284026 246358267 248229742 261133064 csv ( ) 447431996 512482638 517086442 522108758 555228387 csv2 ( ) 383554684 388878624 392469560 448931604 458952437 sqldf ( ) 477170533 485206866 493820346 514099318 867471078 s q l ( ) 576023018 597863746 650239598 667263688 851414226 binary ( ) 82564640 83481720 84699074 86497266 155718942 binary2 ( ) 80594466 81493388 82258268 84792060 149867083 loads ( ) 67956000 71764023 74152600 76218841 101806185 l o a d s 2 ( ) 16121788 16525269 16748340 18332214 28500877 log(time) [ns] 2e+07 5e+07 1e+08 2e+08 5e+08 1e+09 binary() binary2() csv() csv2() loads() loads2() sav() sql() sqldf() Expression 7 of 10

7 Results - ESS Run on the laptop with SSD attached (high IO capability): Unit : nanoeconds min l q median uq max sav ( ) 1546345401 1546345401 1555045257 1563745113 1563745113 csv ( ) 608659349 608659349 614847051 621034753 621034753 csv2 ( ) 432886076 432886076 446772544 460659013 460659013 s q l ( ) 22967372 22967372 23307353 23647334 23647334 binary ( ) 195957930 195957930 292467874 388977819 388977819 binary2 ( ) 173168231 173168231 180481340 187794448 187794448 loads ( ) 91112136 91112136 91452398 91792659 91792659 l o a d s 2 ( ) 1922003 1922003 2025506 2129010 2129010 log(time) [ns] 2e+06 5e+06 2e+07 5e+07 2e+08 5e+08 2e+09 binary() binary2() csv() csv2() loads() loads2() sav() sql() Expression 8 of 10

Run in server environment (high CPU, but low IO capability): Unit : nanoeconds min l q median uq max sav ( ) 1002706199 1069822150 1072309107 1074794738 1092789416 csv ( ) 429448685 437668012 456707466 489854026 550664349 csv2 ( ) 370109696 372899655 382606256 400250634 461689820 s q l ( ) 454294835 532463950 537078038 542445448 676130528 sqldf ( ) 458514868 475001036 483471416 498829774 722040292 binary ( ) 115832235 117273083 120065867 124422444 182590704 binary2 ( ) 126332841 127640386 130243322 133516098 191742881 loads ( ) 71019188 78846217 80092247 82805980 134582247 l o a d s 2 ( ) 1360472 1678182 2059142 2542211 6710979 log(time) [ns] 2e+06 5e+06 2e+07 5e+07 2e+08 5e+08 binary() binary2() csv() csv2() loads() loads2() sav() sql() sqldf() Expression 9 of 10

Any feedback is welcome! 10 of 10