Package RCassandra. R topics documented: February 19, 2015. Version 0.1-3 Title R/Cassandra interface



Similar documents
Package retrosheet. April 13, 2015

Package sjdbc. R topics documented: February 20, 2015

Package TSfame. February 15, 2013

Package PKI. July 28, 2015

Package TimeWarp. R topics documented: April 1, 2015

Package hive. January 10, 2011

Integrating VoltDB with Hadoop

Apache Cassandra Query Language (CQL)

Package uptimerobot. October 22, 2015

Package urltools. October 11, 2015

Simba ODBC Driver with SQL Connector for Apache Cassandra

Hypertable Architecture Overview

Package png. February 20, 2015

Package PKI. February 20, 2013

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Package hive. July 3, 2015

Simba Apache Cassandra ODBC Driver

Raima Database Manager Version 14.0 In-memory Database Engine

Comparing SQL and NOSQL databases

Architecting the Future of Big Data

Configuring and Monitoring Hitachi SAN Servers

2. Basic Relational Data Model

Architecting the Future of Big Data

Package HadoopStreaming

Design Notes for an Efficient Password-Authenticated Key Exchange Implementation Using Human-Memorable Passwords

KMx Enterprise: Integration Overview for Member Account Synchronization and Single Signon

Services. Relational. Databases & JDBC. Today. Relational. Databases SQL JDBC. Next Time. Services. Relational. Databases & JDBC. Today.

Utility Software II lab 1 Jacek Wiślicki, jacenty@kis.p.lodz.pl original material by Hubert Kołodziejski

Semester Thesis Traffic Monitoring in Sensor Networks

A programming model in Cloud: MapReduce

Using Object Database db4o as Storage Provider in Voldemort

Managing Users and Identity Stores

A Brief Introduction to MySQL

ABSTRACT 1. INTRODUCTION. Kamil Bajda-Pawlikowski

Copyright 2013 Consona Corporation. All rights reserved

ENHANCEMENTS TO SQL SERVER COLUMN STORES. Anuhya Mallempati #

Package httprequest. R topics documented: February 20, 2015

Fleet Connectivity Extension

Architecting the Future of Big Data

The Power Loader GUI

TrendWorX32 SQL Query Engine V9.2 Beta III

DEPLOYMENT GUIDE Version 1.1. Deploying the BIG-IP LTM v10 with Citrix Presentation Server 4.5

IoT-Ticket.com. Your Ticket to the Internet of Things and beyond. IoT API

Online signature API. Terms used in this document. The API in brief. Version 0.20,

Pervasive Data Integrator. Oracle CRM On Demand Connector Guide

Package WikipediR. January 13, 2016

Commerce Services Documentation

Generalised Socket Addresses for Unix Squeak

Introduction to Wireshark Network Analysis

Easy-Cassandra User Guide

Symbol Tables. Introduction

Getting Started with SandStorm NoSQL Benchmark

Configuring and Monitoring Citrix Branch Repeater

Вовченко Алексей, к.т.н., с.н.с. ВМК МГУ ИПИ РАН

MOC 20461C: Querying Microsoft SQL Server. Course Overview

IBM SPSS Collaboration and Deployment Services Version 6 Release 0. Single Sign-On Services Developer's Guide

Axway API Gateway. Version 7.4.1

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Unified XML/relational storage March The IBM approach to unified XML/relational databases

Job Reference Guide. SLAMD Distributed Load Generation Engine. Version 1.8.2

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

Instant SQL Programming

Automating SQL Injection Exploits

Apache Cassandra for Big Data Applications

The release notes provide details of enhancements and features in Cloudera ODBC Driver for Impala , as well as the version history.

Storing Measurement Data

Physical Design. Meeting the needs of the users is the gold standard against which we measure our success in creating a database.

Package searchconsoler

The English translation Of MBA Standard 0301

FAQs. This material is built based on. Lambda Architecture. Scaling with a queue. 8/27/2015 Sangmi Pallickara

Performance Tuning for the Teradata Database

Contents About the Contract Management Post Installation Administrator's Guide... 5 Viewing and Modifying Contract Management Settings...

PHP Integration Kit. Version User Guide

The HTTP Plug-in. Table of contents

Package dsstatsclient

Oracle Database 12c: Introduction to SQL Ed 1.1

Monitoring QNAP NAS system

Teradata Connector for Hadoop Tutorial

CA Nimsoft Service Desk

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Use Enterprise SSO as the Credential Server for Protected Sites

HireDesk API V1.0 Developer s Guide

ANDROID APPS DEVELOPMENT FOR MOBILE GAME

Practical Cassandra. Vitalii

Fairsail REST API: Guide for Developers

(n)code Solutions CA A DIVISION OF GUJARAT NARMADA VALLEY FERTILIZERS COMPANY LIMITED P ROCEDURE F OR D OWNLOADING

Serious Threat. Targets for Attack. Characterization of Attack. SQL Injection 4/9/2010 COMP On August 17, 2009, the United States Justice

Package fimport. February 19, 2015

Hadoop Kelvin An Overview

Chapter 6: Physical Database Design and Performance. Database Development Process. Physical Design Process. Physical Database Design

Using DOTS as Apache Derby System Test

SnapLogic Salesforce Snap Reference

An Oracle White Paper June RESTful Web Services for the Oracle Database Cloud - Multitenant Edition

DEPLOYMENT GUIDE DEPLOYING THE BIG-IP LTM SYSTEM WITH CITRIX PRESENTATION SERVER 3.0 AND 4.5

Capturx for SharePoint 2.0: Notification Workflows

Administrator's Guide

Configuration Worksheets for Oracle WebCenter Ensemble 10.3

Still Aren't Doing. Frank Kim

Apache Shiro - Executive Summary

Hadoop and Map-reduce computing

Transcription:

Version 0.1-3 Title R/Cassandra interface Package RCassandra February 19, 2015 Author Simon Urbanek <simon.urbanek@r-project.org> Maintainer Simon Urbanek <simon.urbanek@r-project.org> This packages provides a direct interface (without the use of Java) to the most basic functionality of Apache Cassanda such as login, updates and queries. License GPL-2 URL http://www.rforge.net/rcassandra NeedsCompilation yes Repository CRAN Date/Publication 2013-12-03 22:24:59 R topics documented: RCassandra-package.................................... 1 RC.cluster.name....................................... 2 RC.ect......................................... 3 RC.get............................................ 5 RC.insert.......................................... 8 RC.read.table........................................ 9 Index 11 RCassandra-package R/Cassandra interface This packages provides a direct interface (without the use of Java) to the most basic functionality of Apache Cassanda ditributed NoSQL database such as login, updates and queries. The focus is on effciency and speed. 1

2 RC.cluster.name Details RC.ect is used to ect to a Cassandra instance. The obtained handle is then used for all operations until RC.close is used to close the ection. A set of RC.get functions can be used to query the database. Specialized high-level interface for fixed-column tables (not the most common in Cassandra, thgouh) is also available with RC.read.table. Updates and inserts can be performed either individually using the RC.insert function or batchmutations using RC.mutate. Auxiliary functions retrieving meta-information from the database are described on the RC.version help page. Currently, communication to Cassanra is performed directly on a blocking TCP/IP socket. This implies that transactions currently cannot be interrupted on the R side and there is no timeout. This may change in future versions. The code does not use R ections to avoid extra overhead. Package: License: URL: RCassandra GPL-2 http://www.rforge.net/rcassandra Index: RC.cluster.name RC.ect RC.get RC.insert RC.read.table Functions retrieving meta-information from a Cassandra ection Connect, login, close ection to Cassandra Functions for querying Cassandra database Update function to insert data into Cassandra Read and write tables into column families in Cassandra Simon Urbanek <simon.urbanek@r-project.org> RC.cluster.name Functions retrieving meta-information from a Cassandra ection RC.cluster.name returns the name of the cluster. RC.version returns the protocol version. RC.describe.keyspace returns a keyspace definition. RC.describe.keyspaces returns a list of definitions for all keyspaces.

RC.ect 3 Usage RC.cluster.name() RC.version() RC.describe.keyspaces() RC.describe.keyspace(, keyspace) Arguments keyspace ection handle are returned by RC.ect string, name of the keyspace to describe Value For RC.cluster.name and RC.version a string. For RC.describe.keyspace a structure describing the keyspace - see KsDef structure in Cassandra. For RC.describe.keyspaces a list of the KsDef strctures. Simon Urbanek See Also RC.ect, RC.get RC.ect Connect, login, close ection to Cassandra RC.ect ects a to a host running Cassandra. All subsequent operations are performed on the handle returned by this function. RC.close closes a Cassandra ection. RC.login perform an authentication request. Usage RC.ect(host = NULL, port = 9160L) RC.close() RC.login(, username = "default", password = "")

4 RC.ect Arguments host port username password host name or IP address to ect to using TCP/IP port to ect to on the above host ection handle as returned by RC.ect username for the authentication dictionary password for the authentication dictionary Details RC.ect return an opaque ection handle that has to be used for all subsequent calls on the same ection. RCassandra uses low-level system calls to communicate with Cassandra, this handle is not an R ection. RC.close closes an existing Cassandra ection. RC.login sends an authenticataion request with the given credentials. How this is processed depend on the authentication module use in the Cassandra instance this ection ects to. Value RC.ect returns a Cassandra ection handle. RC.close return NULL RC.login returns. Simon Urbanek Examples ## Not run: c <- RC.ect("cassandra-host") RC.login(c, "foo", "bar") RC.cluster.name(c) RC.describe.keyspaces(c) RC.close(c) ## End(Not run)

RC.get 5 RC.get Functions for querying Cassandra database Usage RC.use selects the keyspace (aka database) to use for all subsequent operations. All functions described below require keyspace to be set using this function. RC.get queries one key and a fixed list of columns RC.get.range queries one key and multiple columns RC.mget.range queries multiple keys and multiple columns RC.get.range.slices queries a range of keys (or tokens) and a range of columns RC.consistency sets the desired consistency level for all query operations RC.use(, keyspace, cache.def = TRUE) RC.get(, c.family, key, c.names, comparator = NULL, validator = NULL) RC.get.range(, c.family, key, first = "", last = "", reverse = FALSE, limit = 1e+07, comparator = NULL, validator = NULL) RC.mget.range(, c.family, keys, first = "", last = "", reverse = FALSE, limit = 1e+07, comparator = NULL, validator = NULL) RC.get.range.slices(, c.family, k.start = "", k.end = "", first = "", last = "", reverse = FALSE, limit = 1e+07, k.limit = 1e+07, tokens = FALSE, fixed = FALSE, comparator = NULL, validator = NULL) RC.consistency(, level = c("one", "quorum", "local.quorum", "each.quorum", "all", "any", "two", "three")) Arguments keyspace cache.def c.family key c.names comparator ection handle as returned by RC.ect name of the keyspace to use if TRUE then in addition to setting the keyspace a query on the keyspace definition is sent and the result cached. This allows automatic detection of comparators and validators, see details section for more information. column family (aka table) name row key vector of column names string, type of the column keys (comparator in Cassandra speak) or NULL to rely on cached schema definitions

6 RC.get validator first last reverse limit keys k.start k.end k.limit tokens fixed level string, type of the values (validator in Cassandra speak) or NULL to rely on cached schema definitions starting column name ending column name if TRUE the resutl is returned in reverse order return at most as many columns per key row keys (character vector) start key (or token) end key (or token) return at most as many keys (rows) if TRUE then keys are interpreted as tokens (i.e. values after hashing) if TRUE then the result if be a single data frame consisting of rows and keys and all columns ever encountered - essentially assuming fixed column structure the desired consistency level for query operations on this ection. "one" is the default if not explicitly set. Details The nomenclature can be a bit confusing and it comes from the literature and the Cassandra API. Put in simple terms, keyspace is comparable to a database, and column family is somewhat comparable to a table. However, a table may have different number of columns for each row, so it can be used to create a flexible two-dimensional query structure. A row is defined by a (row) key. A query is performed by first finding out which row(s) will be fetched according to the key (RC.get, RC.get.range), keys (RC.mget.range) or key range (RC.get.range.slices), then selecting the columns of interest. Empty string ("") can be used to denote an unspecified range (so the default is to fetch all columns). comparator and validator specify the types of column keys and values respectively. Every key or value in Cassandra is simply a byte string, so it can deal with arbitrary values, but sometimes it is convenient to impose some structure on that content by declaring what is represented by that byte string. Unfortunately Cassandra does not include that information in the results, so the user has to define how column names and values are to be interpreted. The default interpretation is simply as a UTF-8 encoded string, but RCassandra also supports following conversions: "UTF8Type", "AsciiType" (stored as character vectors), "BytesType" (opaque stream of bytes, stored as raw vector), "LongType" (8-bytes integer, stored as real vector in R), "DateType" (8-bytes integer, stored as POSIXct in R), "BooleanType" (one byte, logical vector in R), "FloatType" (4-bytes float, real vector in R), "DoubleType" (8-bytes float, real vector in R) and "UUIDType" (16-bytes, stored as UUID-formatted string). No other conversions are supported at this point. If the value is NULL then RCassandra attempts to guess the proper value by taking into account the schema definition obtained by RC.use(..., cache.def=true), otherwise it falls back to "UTF8Type". You can always get the raw form using "BytesType" and decode the values in R. The comparator also determines how the values of first and last will be interpreted. Regardless of the comparator, it is always possible to pass either NULL, "" (both denoting 0-length value) or a raw vector. Other supported types must match the comparator. Most users will be happy with the default settings, but if you want to save every nanosecond you can, call RC.use(..., cache.def = FALSE) (which saves one extra RC.describe.keyspace

RC.get 7 Value request to the Cassandra instance) and always specify both comparator and validator (even if it is just "UTF8String"). Cassandra collects results in memory so key (k.limit) and column (limit) limits are mandatory. Future versions of RCassandra may abstract this limitation out (by using a limit and repeating queries with new start key/column based on the last result row), but not at this point. Note that in Cassandra keys are typically hashed, so key range may be counter-intuitive as it is based on the hash and not on the actual value. Columns are always sorted by their name (=key). The result of queries may be also counter-intuitive, especially when querying fixed column tables as it is not returned in the form that would be expected from a relational database. See RC.read.table and RC.write.table for retrieving and storing relational structures in rectangular tables (column families with fixed columns). But you have to keep in mind that Cassandra is essentailly key/key/value storage (row key, column key, value) with partitioning on row keys and sorting of column keys, so designing the correct schema for a task needs some thought. Dynamic columns are what makes it so powerful. RC.use and RC.consistency returns RC.get and RC.get.range return a data frame with columns key (column name), value (value in that column) and ts (timestamp). RC.mget.range and RC.get.range.slices return a named list of data frames as described in RC.get.range with names being the row keys, except if fixed=true in which case the result is a data frame with row names as keys and values as elements (timestamps are not retrieved in that case). See Also Simon Urbanek RC.ect, RC.read.table, RC.write.table Examples ## Not run: c <- RC.ect("cassandra-host") RC.use(c, "testdb") ## you will have to use cassandra-cli to create the schema for the "iris" CF RC.write.table(c, "iris", iris) RC.get(c, "iris", "1", c("sepal.length", "Species")) RC.get.range(c, "iris", "1") ## list of 150 data frames r <- RC.get.range.slices(c, "iris") ## use limit=0 to obtain all row keys without pulling any data rk <- RC.get.range.slices(c, "iris", limit=0) y <- RC.read.table(c, "iris") y <- y[order(as.integer(row.names(y))),] RC.close(c)

8 RC.insert ## End(Not run) RC.insert Update functions to insert data into Cassandra RC.insert updates or inserts new column/value pairs RC.mutate batchwise updates or inserts a list of keys, column families and columns/values. Usage RC.insert(, c.family, key, column, value = NULL, comparator = NULL, validator = NULL) RC.mutate(, mutation) Arguments c.family key column value comparator validator mutation ection handle obtained from RC.ect name of the column family (string) row key name (string) or a vector of (preferably contiguous) keys to use with the column names vector column name - any vector supported by the comparator optinally values to add into the columns - if specified, must be the same length as column. If NULL only the column is created comparator (column name type) to be used - see RC.get for details validator (value type) to be used - see RC.get for details a structure describing the desired mutation (see Cassandra documentation). In its simplest form it is a nested list: list(row.key1=list(c.family1=list(col1=val1,...),...),. so to add column "foo" with value "bar" to column family "table" and row "key" the mutation would be list(key=list(table=list(foo="bar"))). The innermost list can optionally be a character vector (if unnamed it specifies the column names, otherwise names are column names and elements are values). Only string keys and column names are supproted. Value

RC.read.table 9 Note RC.insert supports multi-column insertions where column and value are vectors. For the scalar case insert message is used, for vector case batch_mutate. If key is a scalar, all column/value pairs are added to that row key. Alternatively, key can be a vector of the same length as column in which case the mutation will consist of key/column/value triplets. Note that key should be contiguous as the mutation will only group contiguous sequences (see coalesce from the fastmatch package for a fast way of obtaining contiguous sequences). RC.insert honors both the validator and comparator (the latter is taken from the cache if not specified). RC.mutate currently only uses "UTF8Type" validator and comparator as there is no way to specify either in the mutation object. Cassandra requires timestamps on all objects that specify columns/values for conflict resolution. All functions above generate such timestamps from the system time as POSIX time in milliseconds. Simon Urbanek See Also RC.ect, RC.use, RC.get RC.read.table Read and write tables into column families in Cassandra Usage RC.read.table reads the contents of a column family into a data frame RC.write.table writes the contents of a data frame into a column familly RC.read.table(, c.family, convert = TRUE, na.strings = "NA", as.is = FALSE, dec = ".") RC.write.table(, c.family, df) Arguments c.family convert na.strings as.is dec df ection handle as obtained form RC.ect column family name (string) logical, if TRUE the resulting data frame is processed using type.convert, otherwise all columns will be character vectors passed to type.convert passed to type.convert passed to type.convert data frame - it must have both row and column names

10 RC.read.table Details Value Note Cassandra is a key/value store with dynamic columns, so tables are not the native format. Row names are used as keys and columns are treated as fixed. RC.read.table is really jsut a wrapper for RC.get.range.slices(, c.family, fixed=true). RC.write.table uses the same facility as RC.mutate but without actually creating the mutation object on the R side. Note that all updates in Cassandra are "upserts", i.e., RC.write.table updates any existing row key/coumn name combinations or creates new ones where not present (insert). Additonal columns (or even keys) may still exist in the column family and they will not be touched. RC.read.table creates a data frame from all columns that are ever encountered in at least one key. All other values are filled with NAs. RC.read.table returns the resulting data frame RC.write.table returns IMPORTANT: Cassandra does NOT preserve order of keys and columns. Internally, keys are ordered by their hash value and columns are ordered lexicographically (treated as bytes). However, due to the fact that columns are dynamic the order of columns will vary if keys have different columns, because columns are added to the data frame in the sequence they are encountered as the keys are loaded. You may want to use df <- df[order(as.integer(row.names(df))),] on the result of RC.read.table for tables with automatic row names to obtain the original order of rows. RC.read.table is more effcient than RC.get.range.slices because it can store columns into vectors and can pre-allocate the whole structure in advance. Note that the current implementation of tables (RC.read.table and RC.write.table) supports only string-based representation of columns and values ("UTF8Type", "AsciiType" or similar). See Also Simon Urbanek RC.ect, RC.use, RC.get

Index Topic interface RC.cluster.name, 2 RC.ect, 3 RC.get, 5 RC.insert, 8 RC.read.table, 9 Topic package RCassandra-package, 1 RC.close, 2 RC.close (RC.ect), 3 RC.cluster.name, 2 RC.ect, 2, 3, 3, 5, 7 10 RC.consistency (RC.get), 5 RC.describe.keyspace, 6 RC.describe.keyspace (RC.cluster.name), 2 RC.describe.keyspaces (RC.cluster.name), 2 RC.get, 2, 3, 5, 8 10 RC.get.range.slices, 10 RC.insert, 2, 8 RC.login (RC.ect), 3 RC.mget.range (RC.get), 5 RC.mutate, 2, 10 RC.mutate (RC.insert), 8 RC.read.table, 2, 7, 9 RC.use, 9, 10 RC.use (RC.get), 5 RC.version, 2 RC.version (RC.cluster.name), 2 RC.write.table, 7 RC.write.table (RC.read.table), 9 RCassandra (RCassandra-package), 1 RCassandra-package, 1 type.convert, 9 11