Streaming Data, Concurrency And R



Similar documents
Big Data and Parallel Work with R

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

Presto/Blockus: Towards Scalable R Data Analysis

Your Best Next Business Solution Big Data In R 24/3/2010

Why Threads Are A Bad Idea (for most purposes)

Multi-core Programming System Overview

General Introduction

Parallel Computing: Strategies and Implications. Dori Exterman CTO IncrediBuild.

The Not Quite R (NQR) Project: Explorations Using the Parrot Virtual Machine

Programming in C# with Microsoft Visual Studio 2010

USE OF PYTHON AS A SATELLITE OPERATIONS AND TESTING AUTOMATION LANGUAGE

A Comparative Study on Vega-HTTP & Popular Open-source Web-servers

Real Time Market Data and Trade Execution with R

CSC 2405: Computer Systems II

Real-Time Analytics on Large Datasets: Predictive Models for Online Targeted Advertising

Enterprise Applications

Syllabus for CS 134 Java Programming

Benchmark Study on Distributed XML Filtering Using Hadoop Distribution Environment. Sanjay Kulhari, Jian Wen UC Riverside

EUROPEAN ORGANIZATION FOR NUCLEAR RESEARCH CERN ACCELERATORS AND TECHNOLOGY SECTOR A REMOTE TRACING FACILITY FOR DISTRIBUTED SYSTEMS

Enterprise Data Integration for Microsoft Dynamics CRM

MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu

SWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

Chapter 10 Case Study 1: LINUX

A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design

Filtering Mail with Milter. David F. Skoll Roaring Penguin Software Inc.

Effective Java Programming. efficient software development

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

Design Document. Offline Charging Server (Offline CS ) Version i -

Bringing Big Data Modelling into the Hands of Domain Experts

Chapter 6 Concurrent Programming

Eloquence Training What s new in Eloquence B.08.00

Engr. M. Fahad Khan Lecturer Software Engineering Department University Of Engineering & Technology Taxila

EWeb: Highly Scalable Client Transparent Fault Tolerant System for Cloud based Web Applications

Stream Processing on GPUs Using Distributed Multimedia Middleware

CRM Magic with Data Migration & Integration

Elemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Event-based middleware services

Project Convergence: Integrating Data Grids and Compute Grids. Eugene Steinberg, CTO Grid Dynamics May, 2008

Understanding the Value of In-Memory in the IT Landscape

BSC vision on Big Data and extreme scale computing

Deal with big data in R using bigmemory package

Titolo del paragrafo. Titolo del documento - Sottotitolo documento The Benefits of Pushing Real-Time Market Data via a Web Infrastructure

rm(list=ls()) library(sqldf) system.time({large = read.csv.sql("large.csv")}) # seconds, 4.23GB of memory used by R

System Software and TinyAUTOSAR

CSCI E 98: Managed Environments for the Execution of Programs

A distributed system is defined as

Real Time Programming: Concepts

New Design and Layout Tips For Processing Multiple Tasks

Harmonizing policy management with Murphy in GENIVI, AGL and TIZEN IVI

Cloud Computing. Up until now

Technical Properties. Mobile Operating Systems. Overview Concepts of Mobile. Functions Processes. Lecture 11. Memory Management.

MPI and Hybrid Programming Models. William Gropp

Between Mutual Trust and Mutual Distrust: Practical Fine-grained Privilege Separation in Multithreaded Applications

Parallel Computing 37 (2011) Contents lists available at ScienceDirect. Parallel Computing. journal homepage:

Fachbereich Informatik und Elektrotechnik SunSPOT. Ubiquitous Computing. Ubiquitous Computing, Helmut Dispert

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr Teruzzi Roberto matr IBM CELL. Politecnico di Milano Como Campus

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Java the UML Way: Integrating Object-Oriented Design and Programming

JMulTi/JStatCom - A Data Analysis Toolkit for End-users and Developers

Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle

Linux Process Scheduling Policy


FIVE SIGNS YOU NEED HTML5 WEBSOCKETS

Characteristics of Java (Optional) Y. Daniel Liang Supplement for Introduction to Java Programming

White Paper. Optimizing the Performance Of MySQL Cluster

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework

Distributed R for Big Data

IT of SPIM Data Storage and Compression. EMBO Course - August 27th! Jeff Oegema, Peter Steinbach, Oscar Gonzalez

System Structures. Services Interface Structure

Adobe LiveCycle Data Services 3 Performance Brief

A Look through the Android Stack

Japan Communication India Skill Development Center

CS 3530 Operating Systems. L02 OS Intro Part 1 Dr. Ken Hoganson

Parallel Computing for Data Science

PERFORMANCE COMPARISON OF COMMON OBJECT REQUEST BROKER ARCHITECTURE(CORBA) VS JAVA MESSAGING SERVICE(JMS) BY TEAM SCALABLE

Big-data Analytics: Challenges and Opportunities

ProTrack: A Simple Provenance-tracking Filesystem

Real-time Java Processor for Monitoring and Test

Globus Striped GridFTP Framework and Server. Raj Kettimuthu, ANL and U. Chicago

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

How To Handle Big Data With A Data Scientist

OPTIMIZING YOUR PI SDK APPLICATIONS BUILDERS' CAFÉ WEBINAR SERIES

Java Environment for Parallel Realtime Development Platform Independent Software Development for Multicore Systems

SOFT 437. Software Performance Analysis. Ch 5:Web Applications and Other Distributed Systems

Operating Systems 4 th Class

Cloud Based Application Architectures using Smart Computing

Review from last time. CS 537 Lecture 3 OS Structure. OS structure. What you should learn from this lecture

Kafka & Redis for Big Data Solutions

Transcription:

Streaming Data, Concurrency And R Rory Winston rory@theresearchkitchen.com

About Me Independent Software Consultant M.Sc. Applied Computing, 2000 M.Sc. Finance, 2008 Apache Committer Working in the financial sector for the last 7 years or so Interested in practical applications of functional languages and machine learning Relatively recent convert to R ( 2 years)

R - Pros and Cons Pro Can be extremely elegant Comprehensive extension library Open-source Huge parallelization effort Fantastic reporting capabilities Con Can be clunky (S4) Bewildering array of overlapping extensions Inherently single-threaded

R - Pros and Cons Pro Can be extremely elegant Comprehensive extension library Open-source Huge parallelization effort Fantastic reporting capabilities Con Can be clunky (S4) Bewildering array of overlapping extensions Inherently single-threaded

R - Pros and Cons Pro Can be extremely elegant Comprehensive extension library Open-source Huge parallelization effort Fantastic reporting capabilities Con Can be clunky (S4) Bewildering array of overlapping extensions Inherently single-threaded

R - Pros and Cons Pro Can be extremely elegant Comprehensive extension library Open-source Huge parallelization effort Fantastic reporting capabilities Con Can be clunky (S4) Bewildering array of overlapping extensions Inherently single-threaded

R - Pros and Cons Pro Can be extremely elegant Comprehensive extension library Open-source Huge parallelization effort Fantastic reporting capabilities Con Can be clunky (S4) Bewildering array of overlapping extensions Inherently single-threaded

R - Pros and Cons Pro Can be extremely elegant Comprehensive extension library Open-source Huge parallelization effort Fantastic reporting capabilities Con Can be clunky (S4) Bewildering array of overlapping extensions Inherently single-threaded

R - Pros and Cons Pro Can be extremely elegant Comprehensive extension library Open-source Huge parallelization effort Fantastic reporting capabilities Con Can be clunky (S4) Bewildering array of overlapping extensions Inherently single-threaded

R - Pros and Cons Pro Can be extremely elegant Comprehensive extension library Open-source Huge parallelization effort Fantastic reporting capabilities Con Can be clunky (S4) Bewildering array of overlapping extensions Inherently single-threaded

R - Pros and Cons Pro Can be extremely elegant Comprehensive extension library Open-source Huge parallelization effort Fantastic reporting capabilities Con Can be clunky (S4) Bewildering array of overlapping extensions Inherently single-threaded

R - Pros and Cons Pro Can be extremely elegant Comprehensive extension library Open-source Huge parallelization effort Fantastic reporting capabilities Con Can be clunky (S4) Bewildering array of overlapping extensions Inherently single-threaded

R - Pros and Cons Pro Can be extremely elegant Comprehensive extension library Open-source Huge parallelization effort Fantastic reporting capabilities Con Can be clunky (S4) Bewildering array of overlapping extensions Inherently single-threaded

R - Pros and Cons Pro Can be extremely elegant Comprehensive extension library Open-source Huge parallelization effort Fantastic reporting capabilities Con Can be clunky (S4) Bewildering array of overlapping extensions Inherently single-threaded

R - Pros and Cons Pro Can be extremely elegant Comprehensive extension library Open-source Huge parallelization effort Fantastic reporting capabilities Con Can be clunky (S4) Bewildering array of overlapping extensions Inherently single-threaded

R - Pros and Cons Pro Can be extremely elegant Comprehensive extension library Open-source Huge parallelization effort Fantastic reporting capabilities Con Can be clunky (S4) Bewildering array of overlapping extensions Inherently single-threaded

R - Pros and Cons Pro Can be extremely elegant Comprehensive extension library Open-source Huge parallelization effort Fantastic reporting capabilities Con Can be clunky (S4) Bewildering array of overlapping extensions Inherently single-threaded

Parallelization vs. Concurrency R interpreter is single threaded Some historical context for this (BLAS implementations) Not necessarily a limitation in the general context Multithreading can be complex and problematic Instead a focus on parallelization: Distributed computation: gridr, nws, snow Multicore/multi-cpu scaling: Rmpi, Romp, pnmath/pnmath0 Interfaces to Pthreads/PBLAS/OpenMP/MPI/Globus/etc. Parallelization suits cpu-bound large data processing applications

Other Scalability and Performance Work JIT/bytecode compilation (Ra) Implicit vectorization a la Matlab (code analysis) Large ( RAM) dataset handling (bigmemory,ff) Many incremental performance improvements (e.g. less internal copying) Next: GPU/massive multicore...?

What Benefit Concurrency? Real-time (streaming to be more precise) data analysis Growing Interest in using R for streaming data, not just offline analyis GUI toolkit integration Fine-grained control over independent task execution "I believe that explicit concurrency management tools (i.e. a threads toolkit) are what we really need in R at this point." - Luke Tierney, 2001

Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)

Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)

Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)

Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)

Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)

Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)

Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)

Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)

Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)

Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)

Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)

Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)

Example Application Based on work I did last year and presented at UseR! 2008 Wrote a real-time and historical market data service from Reuters/R The real-time interface used the Reuters C++ API R extension in C++ that spawned listening thread and handled updates

Simplified Architecture R extension (C++) realtime bus

Example Usage rsub <- function(duration, items, callback) The call rsub will subscribe to the specified rate(s) for the duration of time specified by duration (ms). When a tick arrives, the callback function callback is invoked, with a data frame containing the fields specified in items. Multiple market data items may be subscribed to, and any combination of fields may be be specified. Uses the underlying RFA API, which provides a C++ interface to real-time market updates.

Real-Time Example # Specify field names to retrieve fields <- c("bid","ask","timcor") # Subscribe to EUR/USD and GBP/USD ticks items <- list() items[[1]] <- c("idn_selectfeed", "EUR=", fields) items[[2]] <- c("idn_selectfeed", "GBP=", fields) # Simple Callback Function callback <- function(df) { print(paste("received",df)) } # Subscribe for 1 hour ONE_HOUR <- 1000*(60)^2 rsub(one_hour, items, callback)

Issues With This Approach As R interpreter is single threaded, cannot spawn thread for callbacks Thus, interpreter thread is locked for the duration of subscription Not a great user experience Need to find alternative mechanism

Alternative Approach If we cannot run subscriber threads in-process, need to decouple Standard approach: add an extra layer and use some form of IPC For instance, we could: Subscribe in a dedicated R process (A) Push incoming data onto a socket R process (B) reads from a listening socket Sockets could also be another IPC primitive, e.g. pipes Also note that R supports asynchronous I/O (?isincomplete) Look at the ibrokers package for examples of this

The bigmemory package From the description: "Use C++ to create, store, access, and manipulate massive matrices" Allows creation of large matrices These matrices can be mapped to files/shared memory It is the shared memory functionality that we will use The next version (3.0) will be unveiled at UseR! 2009 big.matrix(nrow, ncol, type = "integer",...) shared.big.matrix(nrow, ncol, type = "integer",...) filebacked.big.matrix(nrow, ncol, type = "integer",...)

Sample Usage > library(bigmemory) # Note: I'm using pre-release > X <- shared.big.matrix(type="double", ncol=1000, nrow=1000) > X An object of class big.matrix Slot "address": <pointer: 0x7378a0>

$rownames NULL Create Shared Memory Descriptor > desc <- describe(x) > desc $sharedtype [1] "SharedMemory" $sharedname [1] "53f14925-dca1-42a8-a547-e1bccae999ce" $nrow [1] 1000 $ncol [1] 1000

Export the Descriptor In R session 1: > dput(desc, file="~/matrix.desc") In R session 2: > library(bigmemory) > desc <- dget("~/matrix.desc") > X <- attach.big.matrix(desc) Now R sessions A and B share the same big.matrix instance

Share Data Between Sessions R session 1: > X[1,1] <- 1.2345 R session 2: > X[1,1] [1] 1.2345 Thus, streaming data can be continuously fed into session A And concurrently processed in session B

Summary Lack of threads not a barrier to concurrent analysis Packages like bigmemory, nws, etc. facilitate decoupling via IPC nws goes a step further, with a distributed workspace Many applications for streaming data: Data collection/monitoring Development of pricing/risk algorithms Low-frequency execution (??)...

References http://cran.r-project.org/web/packages/bigmemory/ http://www.cs.uiowa.edu/ luke/r/thrgui/ http://www.milbo.users.sonic.net/ra/index.html http://www.cs.kent.ac.uk/projects/cxxr/ http://www.theresearchkitchen.com/blog