The ff package: Handling Large Data Sets in R with Memory Mapped Pages of Binary Flat Files

Similar documents

High-Performance Processing of Large Data Sets via Memory Mapping A Case Study in R and C++

Your Best Next Business Solution Big Data In R 24/3/2010

Deal with big data in R using bigmemory package

Outline. hardware components programming environments. installing Python executing Python code. decimal and binary notations running Sage

Filesystems Performance in GNU/Linux Multi-Disk Data Storage

How To Write Portable Programs In C

The programming language C. sws1 1

Chapter 11 I/O Management and Disk Scheduling

Operating Systems. Design and Implementation. Andrew S. Tanenbaum Melanie Rieback Arno Bakker. Vrije Universiteit Amsterdam

White paper. QNAP Turbo NAS with SSD Cache

Outline. Operating Systems Design and Implementation. Chap 1 - Overview. What is an OS? 28/10/2014. Introduction

Sawmill Log Analyzer Best Practices!! Page 1 of 6. Sawmill Log Analyzer Best Practices

The Classical Architecture. Storage 1 / 36

Uses for Virtual Machines. Virtual Machines. There are several uses for virtual machines:

Challenges in Optimizing Job Scheduling on Mesos. Alex Gaudio

System Requirements Table of contents

Hardware Configuration Guide

IN STA LLIN G A VA LA N C HE REMOTE C O N TROL 4. 1

Lecture 17: Virtual Memory II. Goals of virtual memory

CLC Server Command Line Tools USER MANUAL

Computer Architecture. Secure communication and encryption.

Cloud n Service Presentation. NTT Communications Corporation Cloud Services

Digital Forensics Lecture 3. Hard Disk Drive (HDD) Media Forensics

CS 377: Operating Systems. Outline. A review of what you ve learned, and how it applies to a real operating system. Lecture 25 - Linux Case Study

CIS570 Modern Programming Language Implementation. Office hours: TDB 605 Levine

Chapter 11: File System Implementation. Operating System Concepts with Java 8 th Edition

Topology Aware Analytics for Elastic Cloud Services

Characteristics of Java (Optional) Y. Daniel Liang Supplement for Introduction to Java Programming

PERFORMANCE ENHANCEMENTS IN TreeAge Pro 2014 R1.0

RevoScaleR Speed and Scalability

Project: Simulated Encrypted File System (SEFS)

INSTALLATION MINIMUM REQUIREMENTS. Visit us on the Web

Cisco Prime Home 5.0 Minimum System Requirements (Standalone and High Availability)

Parallels Server 4 Bare Metal

Obj: Sec 1.0, to describe the relationship between hardware and software HW: Read p.2 9. Do Now: Name 3 parts of the computer.

Computer Layers. Hardware BOOT. Operating System. Applications

10CS35: Data Structures Using C

Objectives. Python Programming: An Introduction to Computer Science. Lab 01. What we ll learn in this class

Lesson 06: Basics of Software Development (W02D2

OPTOFORCE DATA VISUALIZATION 3D

High Availability of the Polarion Server

Encrypt-FS: A Versatile Cryptographic File System for Linux

Solid State Storage in Massive Data Environments Erik Eyberg

Casper Suite Release Notes. Version 9.1

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Indexing on Solid State Drives based on Flash Memory

Google File System. Web and scalability

Azure VM Performance Considerations Running SQL Server

Chapter 2: Computer-System Structures. Computer System Operation Storage Structure Storage Hierarchy Hardware Protection General System Architecture

Optional custom API wrapper. C/C++ program. M program

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Taking Linux File and Storage Systems into the Future. Ric Wheeler Director Kernel File and Storage Team Red Hat, Incorporated

8Gb Fibre Channel Adapter of Choice in Microsoft Hyper-V Environments

File Management. Chapter 12

Benchmarking Hadoop & HBase on Violin

Visual Basic Programming. An Introduction

Windows Server Performance Monitoring

openlca 1.4 overview and first steps

Machine Architecture and Number Systems. Major Computer Components. Schematic Diagram of a Computer. The CPU. The Bus. Main Memory.

Lecture 10: Dynamic Memory Allocation 1: Into the jaws of malloc()

CS 51 Intro to CS. Art Lee. September 2, 2014

Experience with Server Self Service Center (S3C)

Two Parts. Filesystem Interface. Filesystem design. Interface the user sees. Implementing the interface

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Physical Data Organization

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

How do Users and Processes interact with the Operating System? Services for Processes. OS Structure with Services. Services for the OS Itself

XTM Web 2.0 Enterprise Architecture Hardware Implementation Guidelines. A.Zydroń 18 April Page 1 of 12

System Calls and Standard I/O

SECURE, ENTERPRISE FILE SYNC AND SHARE WITH EMC SYNCPLICITY UTILIZING EMC ISILON, EMC ATMOS, AND EMC VNX

Government Girls Polytechnic, Bilaspur

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-17: Memory organisation, and types of memory

Main Memory Data Warehouses

Memory Management CS 217. Two programs can t control all of memory simultaneously

Systemverwaltung 2009 AIX / LPAR

Data Centers and Cloud Computing

DESKTOP VIRTUALIZATION SIZING AND COST MODELING PROCESSOR MEMORY STORAGE NETWORK

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

Virtuoso and Database Scalability

CS 40 Computing for the Web

COS 318: Operating Systems

Guide to SATA Hard Disks Installation and RAID Configuration

CHAPTER 7: The CPU and Memory

Virtualization. Explain how today s virtualization movement is actually a reinvention

Xen Live Migration. Networks and Distributed Systems Seminar, 24 April Matúš Harvan Xen Live Migration 1

Speeding Up Cloud/Server Applications Using Flash Memory

Sources: On the Web: Slides will be available on:

Eastern Washington University Department of Computer Science. Questionnaire for Prospective Masters in Computer Science Students

Enterprise Architectures for Large Tiled Basemap Projects. Tommy Fauvell

Rackspace Cloud Databases and Container-based Virtualization

Virtualization for Cloud Computing

Cloud Computing. Up until now

SAP Solutions on VMware Infrastructure 3: Customer Implementation - Technical Case Study

Mimer SQL Real-Time Edition White Paper

Cloud Computing CS

PARALLELS SERVER 4 BARE METAL README

Chapter 12 File Management

Transcription:

UseR! 2007, Iowa State University, Ames, August 8-108 2007 The ff package: Handling Large Data Sets in R with Memory Mapped Pages of Binary Flat Files D. Adler, O. Nenadić, W. Zucchini, C. Gläser Institute for Statistics and Econometrics, Georg-August-Universität Göttingen, Germany <dadler,onenadi,wzucchi,cglaese>@uni-goettingen.de 1

0. Overview Introduction The ff package Selected examples Architecture Summary and conclusion 2

1. Introduction Two issues when dealing with large data sets in R: Memory limitations On most computer systems it is not possible to use more than 2 GB of memory, i.e. it is not possible to use the data (in the usual way) Addressing limitations The range of indexes that can be used is limited, i.e. computers don t understand arbitrary large numbers 3

1. Introduction Memory limitations On 32-bit OSes the maximum amount of memory (virtual memory space) is limited to 2-4 GB; one cannot store larger data into memory In general, it is impracticable to handle data that is larger than the available RAM (resorting to virtual memory drastically slows down things) Another issue is given by the question whether all data need to be present in memory at the same time (e.g. when only a random sample of a large data set is considered) 4

1. Introduction Memory limitations, cont. A solution to the memory limitation problem is given by considering only parts of the data at a time, i.e. instead of loading the entire data set into memory only chunks thereof are loaded upon request R Memory Data on persistant storage The ff package was designed to provide convenient access to large data from persistant storage Only one small section of the data (typically 4-64KB) is mirrored into main memory at a time 5

1. Introduction Addressing limitations Specific issue for 32-bit machines: The maximum addressable range goes up to 2 31-1 ; this is the biggest representable (signed) integer further entries not accessable 2 31-1 entries > as.integer(2^31-1) [1] 2147483647 > as.integer(2^31) [1] NA Warning message: NAs introduced by coercion In other words, the addressing issue limits the size of the data that can be analyzed to 16 GB (for double) The memory limitation usually kicks in before the addressing limitation > x <- rep(0, 2^31-1) Error: cannot allocate vector of length 2147483647 6

1. Introduction Addressing limitations, cont. On 32-bit R systems things get complicated: R uses 32-bit integer arithmetics, while the hard disk is addressed with 64 bits (on most filing systems). Also, C++ provides 64-bit integer arithmetics on 32-bit systems. A simple trick to extend the addressable range on 32-bit machines is to introduce multi-indices, 32-bit index 32-bit index 64-bit index On the R side multiple 32-bit indices are used; these are converted into one 64- bit index on the C++ side 7

2. The ff package An overview of the ff package The ff package introduces a new R object type acting as a container. It operates on large binary flat files (double numeric vector). Changes to the R object are immediately written on the file. The ff package comprises the following two parts - a low-level layer written in C++ - a high-level layer in R The package was designed for convenient access to large data sets: - large data sets (i.e. ff objects) are accessed in the same way as ordinary R objects 8

2. The ff package The R Programming Interface ( high-level layer) The R layer comprises the following sections: Opening / Creating flat files Controlled by the two core functions ff and ffm. When a length or dim argument is specified, a new file is created, otherwise an existing file is opened I/O operations These are controlled by the [ operator (for reading) and the [<- operator for writing Generic functions and methods for ff and ffm objects Methods for dim and length are provided and the function sample is converted to a generic function Auxillary functions include e.g. seqpack for optimization purposes 9

3. Selected Examples Selected examples of usage Creating a (one-dimensional) flat file: > library("ff") > foo1 <- ff("foo1", length = 10) > foo1 $ff.attributes class file pagesize readonly "ff" "foo1" "65536" "FALSE" $first.values [1] 0 0 0 0 0 0 0 0 0 0 > foo1[1:10] [1] 0 0 0 0 0 0 0 0 0 0 Modifying data: > data("rivers") > foo1[1:10] <- rivers[1:10] > foo1[1:10] [1] 735 320 325 392 524 450 1459 135 465 600 10

3. Selected Examples Selected examples of usage, cont. Creating a (multi-dimensional) flat file: > m <- ffm("foom", dim = c(31, 3)) > data("trees") > m[1:31, 1:3] <- trees[1:31, 1:3] In order to interact with the biglm package the wrapper function ffm.data.frame is provided: > require(biglm) > ffmdf <- ffm.data.frame(m, c("girth", "Height", "Volume")) 11

3. Selected Examples Selected examples of usage, cont. Using biglm with ffm objects: > fg <- log(volume) ~ log(girth) + log(height) > m0 <- bigglm(fg, data = ffmdf, chunksize = 10, + sandwich = TRUE) > summary(m0) Large data regression model: bigglm(formula = formula, data = datafun,...) Sample size = 31 Coef (95% CI) SE p (Intercept) -6.632-8.087-5.176 0.728 0 log(girth) 1.983 1.871 2.094 0.056 0 log(height) 1.117 0.733 1.501 0.192 0 Sandwich (model-robust) standard errors 12

3. Selected Examples Selected examples of usage, cont. Loading a 14 GB flat file (US Census data from 2000 for Texas), taking a random sample of selected variables and plotting the sample: > # loading the flat file: > txdata <- ffm("g:/texas_p") > # drawing a sample of indices > set.seed(1337) > ind <- runique(10000, total = 750624) > agm <- txdata[ind, 394] > agf <- txdata[ind, 395] > # removing missing values (coded as '0') > in.c1 <- agm!= 0 & agf!= 0 > agm1 <- agm[in.c1] > agf1 <- agf[in.c1] > require(rgl) > plot3d(agm1, agf1, 0, size = 2, col = "red") > view3d(0, 0, fov = 1, zoom = 0.7) 13

4. Architecture The low-level layer Structure of the low-level C++ layer Array<T> Fi lem apper (dynamic) flat file MultiArray<T> FileSection MultiIndex (pre-set) virtual memory space The C++ layer consists of two parts: - abstractions to platform-specific services and - a collection of template container classes 14

4. Architecture The low-level layer Abstractions to platform-specific system services contain a FileMapping and a FileSection class (both are platform specific) Array<T> FileMapper (dynamic) flat file MultiArray<T> FileSection MultiIndex (pre-set) virtual memory space FileMapping class: Implementation of memory mapped file facilities; exposes a factory method to create FileSection objects FileSection class: Implementation of memory mapped file regions that exposes the pointer to the corresponding file region that is mapped to main memory 15

4. Architecture The low-level layer The template container classes implement a caching strategy on top of memory mapped pages of large files Array<T> FileMapper (dynamic) flat file MultiArray<T> FileSection MultiIndex (pre-set) virtual memory space Array<T> template class manages one FileSection object at a time Multiarray<T> template class implements a multi-dimensional array using a multiple integer index MultiIndex utility class translates between multiple integer indices an 64-bit indices 16

5. Summary and conclusion We have presented the ff package for handling large data sets in R; it was developed for the UseR! 2007 programming competition The package comprises two components, a low-level layer written in C++ and a high-level layer in R The package uses platform-specific features and has been ported to Windows, Linux, Mac OS X and FreeBSD With this approach it is possible to work on multiple large data sets simultaneously 64-bit systems also benefit from this approach The package is available from http://wsopuppenkiste.wiso.uni-goettingen.de/ff 17

5. Summary and conclusion Future work Support for further data types - besides doubles - is in progress The architecture of the package is modular - various storage and caching policies can be evaluated in the future Further I/O optimizations (performance gains) Re-implementing algorithms based on chunks (like the biglm package) Feedback and suggestions for improvement are welcome 18