Computing with large data sets



Similar documents
Package RSQLite. February 19, 2015

Using Databases in R

Session 6: ROracle. <Insert Picture Here> Mark Hornick, Director, Oracle Advanced Analytics Development Oracle Advanced Analytics

Lecture 25: Database Notes

UQC103S1 UFCE Systems Development. uqc103s/ufce PHP-mySQL 1

Your Best Next Business Solution Big Data In R 24/3/2010

1 File Processing Systems

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

SQL and programming languages

9. Handling large data

ICOM 6005 Database Management Systems Design. Dr. Manuel Rodríguez Martínez Electrical and Computer Engineering Department Lecture 2 August 23, 2001

Database Management System Choices. Introduction To Database Systems CSE 373 Spring 2013

7. Working with Big Data

Package RPostgreSQL. February 19, 2015

Recovery and the ACID properties CMPUT 391: Implementing Durability Recovery Manager Atomicity Durability

DBMS / Business Intelligence, SQL Server

rm(list=ls()) library(sqldf) system.time({large = read.csv.sql("large.csv")}) # seconds, 4.23GB of memory used by R

David Dye. Extract, Transform, Load

A Performance Evaluation of Open Source Graph Databases. Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader

SQL Programming. CS145 Lecture Notes #10. Motivation. Oracle PL/SQL. Basics. Example schema:

INSTALLING, CONFIGURING, AND DEVELOPING WITH XAMPP

Database Administration with MySQL

DBX. SQL database extension for Splunk. Siegfried Puchbauer

MySQL Storage Engines

news from Tom Bacon about Monday's lecture

Intro to Databases. ACM Webmonkeys 2011

Accessing Your Database with JMP 10 JMP Discovery Conference 2012 Brian Corcoran SAS Institute

Why do statisticians need to know about databases?

Chapter 9 Java and SQL. Wang Yang wyang@njnet.edu.cn

Database 10g Edition: All possible 10g features, either bundled or available at additional cost.

CSI 2132 Lab 3. Outline 09/02/2012. More on SQL. Destroying and Altering Relations. Exercise: DROP TABLE ALTER TABLE SELECT

Lecture #11 Relational Database Systems KTH ROYAL INSTITUTE OF TECHNOLOGY

The MongoDB Tutorial Introduction for MySQL Users. Stephane Combaudon April 1st, 2014

Database Extension 1.5 ez Publish Extension Manual

Bryan Tuft Sr. Sales Consultant Global Embedded Business Unit

Unit 5.1 The Database Concept

A basic create statement for a simple student table would look like the following.

Databases and SQL. The Bioinformatics Lab SS Wiki topic 10. Tikira Temu. 04. June 2013

<Insert Picture Here> Oracle In-Memory Database Cache Overview

Sisense. Product Highlights.

HiDb: A Haskell In-Memory Relational Database

In This Lecture. Physical Design. RAID Arrays. RAID Level 0. RAID Level 1. Physical DB Issues, Indexes, Query Optimisation. Physical DB Issues

Bridge from Entity Relationship modeling to creating SQL databases, tables, & relations

VBA and Databases (see Chapter 14 )

Sybase Replication Server 15.6 Real Time Loading into Sybase IQ

Revolutionized DB2 Test Data Management

How To Use The Correlog With The Cpl Powerpoint Powerpoint Cpl.Org Powerpoint.Org (Powerpoint) Powerpoint (Powerplst) And Powerpoint 2 (Powerstation) (Powerpoints) (Operations

Database Design and Programming

Lab # 5. Retreiving Data from Multiple Tables. Eng. Alaa O Shama

Databases in Engineering / Lab-1 (MS-Access/SQL)

White paper FUJITSU Software Enterprise Postgres

Lab 2: PostgreSQL Tutorial II: Command Line

PUBLIC Performance Optimization Guide

The Saves Package. an approximate benchmark of performance issues while loading datasets. Gergely Daróczi

Relational Databases. Christopher Simpkins

Geodatabase Programming with SQL

Using IRDB in a Dot Net Project

7- PHP and MySQL queries

Mul$media im Netz (Online Mul$media) Wintersemester 2014/15. Übung 03 (Nebenfach)

AWS Schema Conversion Tool. User Guide Version 1.0

Database Fundamentals

A Brief Introduction to MySQL

The process of database development. Logical model: relational DBMS. Relation

Multimedia im Netz Online Multimedia Winter semester 2015/16

CSE 530A Database Management Systems. Introduction. Washington University Fall 2013

Scaling up = getting a better machine. Scaling out = use another server and add it to your cluster.

dbext for Vim David Fishburn h5p://

Online Multimedia Winter semester 2015/16

DATABASE MANAGEMENT SYSTEM PERFORMANCE ANALYSIS AND COMPARISON. Margesh Naik B.E, Veer Narmad South Gujarat University, India, 2008 PROJECT

FileMaker 11. ODBC and JDBC Guide

Using Object Database db4o as Storage Provider in Voldemort

OLTP Meets Bigdata, Challenges, Options, and Future Saibabu Devabhaktuni

SAP Business Objects Business Intelligence platform Document Version: 4.1 Support Package Data Federation Administration Tool Guide

Cloud Computing. With MySQL and Pentaho Data Integration. Matt Casters Chief Data Integration at Pentaho Kettle project founder

CSE 544 Principles of Database Management Systems. Magdalena Balazinska (magda) Spring 2006 Lecture 1 - Class Introduction

SQLITE C/C++ TUTORIAL

Cassandra vs MySQL. SQL vs NoSQL database comparison

Scalability of web applications. CSCI 470: Web Science Keith Vertanen

Time Series Database Interface: R MySQL (TSMySQL)

SQL Server for developers. murach's TRAINING & REFERENCE. Bryan Syverson. Mike Murach & Associates, Inc. Joel Murach

Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344

Recovery Principles in MySQL Cluster 5.1

G563 Quantitative Paleontology. SQL databases. An introduction. Department of Geological Sciences Indiana University. (c) 2012, P.

Financial Data Access with SQL, Excel & VBA

PL/SQL Programming Workbook

Outline. Failure Types

In This Lecture. Security and Integrity. Database Security. DBMS Security Support. Privileges in SQL. Permissions and Privilege.

Relational Database Basics Review

Review: The ACID properties

Raima Database Manager Version 14.0 In-memory Database Engine

Technology Foundations. Conan C. Albrecht, Ph.D.

Topics. Database Essential Concepts. What s s a Good Database System? Using Database Software. Using Database Software. Types of Database Programs

Guide to Performance and Tuning: Query Performance and Sampled Selectivity

Database System Architecture & System Catalog Instructor: Mourad Benchikh Text Books: Elmasri & Navathe Chap. 17 Silberschatz & Korth Chap.

Chapter 9, More SQL: Assertions, Views, and Programming Techniques

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam

Introduction to Database Systems. Chapter 1 Introduction. Chapter 1 Introduction

PostgreSQL Functions By Example

CHAPTER 1 Overview of SAS/ACCESS Interface to Relational Databases

Transcription:

Computing with large data sets Richard Bonneau, spring 009 mini-lecture 1(week 7): big data, R databases, RSQLite, DBI, clara

different reasons for using databases There are a multitude of reasons for using transactional databases when programing with data. 1. the word data is in the word database. Reducing the active memory needed to carry out an operation or search over a large dataset (look for best cor over a matrix with 1,000,000 rows given a single row). Organize multiple interlnked datatypes and use SQL sysntax to conviniently construct queries, relying on SQL to organize data under the hood. 4. Share complex data with another program / language / multiple threads. v.0480: computing with data, Richard Bonneau Lecture 1

splitting up large memory opperations What system memmory would be required to run lars( y ~ X ) if we had 1,000,000 observations and,000 predictors? What system memory do we need to cluster 500,000 genetic changes mesured for 0,000 individuals? Assuming there are many fewer classes / clusters / model-components than observations we can use a block aproach, where a small fraction of the data is needed at any given time. Several methods for dividing opperations of large matricies or datasets exist and databases help us optimise and structure our codes access to arbitrary subsets of the data, flushing what we arn t currently looking at from active memory. v.0480: computing with data, Richard Bonneau Lecture 1

what are transactional databases We will use SQL type DBMS in this class ( a subset of transactional databases) Transactional databases are ACID - Atomic : all of a transaction is completed OR none - Consistent: all completed transactions leave DB in a state compliant with rules - Isolated: you can t see results until transaction is complete - Durable: is transaction complete then the change persists even if program crashes after, system goes down, etc. (within reason i.e. no lightning strike clause) v.0480: computing with data, Richard Bonneau Lecture 1

relational databases SQL is the most common type of relational database management system. MySQL, oracle, SQLite, etc. are all SQL relational databases relational refers to the grouping of entries in the DB by conditional statements on their attributes. return all rows with attribute.x = x return parts of rows with attribute.y > y.thresh examples below and in the exercise. v.0480: computing with data, Richard Bonneau Lecture 1

relational databases databases Advantages of SQL like DBMS: Code and style of code are roughly compatible across many systems. Methods for porting and converting from one system to another exist Or could easily be built. Libraries in nearly all languages exist. v.0480: computing with data, Richard Bonneau Lecture 1

a sigle page SQL tutorial and links We ll use R s functions to create tables. Many good tutorials and brief docs exist: http://www.sqlcourse.com/ http://www.mysql.com/ http://www.sqlite.org/ select * from USArrests select Murder from USArrests select row_names, Murder from USArrests where Murder < 10.0 insert into employee (first, last, age, address, city, state) values ('Rich', 'Bonneau',, ' washington sq.', 'New York', 'NY') delete from employee where lastname = 'Gentlemen' SELECT id, firstn, lastn, title, salary FROM employee_info WHERE salary >= 55000.00 AND title = 'Saucier' SELECT g.id, g.mirror, g.diam, e.voltage FROM geom_table as g, elec_measures as e WHERE g.id = e.id and g.mirrortype = inside ORDER BY g.diam v.0480: computing with data, Richard Bonneau Lecture 1

databases in R Core connection to DB - DBI. http://cran.r-project.org/web/packages/dbi/vignettes/dbi.pdf DBI requires a driver for the specific DBMS system used http://cran.r-project.org/web/packages/rmysql/index.html http://cran.r-project.org/web/packages/rsqlite/index.html v.0480: computing with data, Richard Bonneau Lecture 1

SQLite SQLite is (according to its website the most widely used DBMS... I m not sure I buy that... but it is certainly very handy) SQLite is a self contained DBMS that is contained in a single C library (a single file that can be compiled and linked as part of nearly any program) It uses local files instead of remote connections. It has many disadvantages when databases are very large, need to support many connections or threads, and is not beefy enough for lots of tasks (thus the Lite) It has dynamic typing (columns in tables are typed element-wise)... this drives lots of people nuts. We are using is because our main interest is breaking up operations and organizing data within a single thread, and because the principles translate to MySQL, etc. v.0480: computing with data, Richard Bonneau Lecture 1

SQLite in R require( DBI ) ## specific DBMS require( RSQLite ) ## could be: Berkeley DB, MySQL, Oracle, ODBC, PostgreSQL ## we choose : SQLite because we're slackers! # create a SQLite instance and create one connection. m <- dbdriver("sqlite") # initialize a new database to a tempfile and copy some data.frame # from the base package into it tfile <- tempfile() con <- dbconnect(m, dbname = tfile) data(usarrests) dbwritetable(con, "USArrests", USArrests) require( lattice ) data( barley ) dbwritetable(con, "barley", barley) v.0480: computing with data, Richard Bonneau Lecture 1

DBI -> RSQLite -> SQLite The rest of the commands will be DBI and DBI wrapping SQL. SQLite stuff handled in a mostly silent way by the DBI connection to RSQLite to SQLite database (just a file) v.0480: computing with data, Richard Bonneau Lecture 1

making a dataframe into a table require( lattice ) data( barley ) dbwritetable(con, "barley", barley) rs <- dbsendquery(con, "select * from USArrests") d1 <- fetch(rs, n = 10) # extract data in chunks of 10 rows fetch( rs, n = 1) d <- fetch(rs, n = -1) # extract all remaining data dbclearresult(rs) dblisttables(con) rs <- dbsendquery(con, "select Murder from USArrests") fetch( rs ) dbclearresult(rs) rs <- dbsendquery(con, paste("select row_names, ", " Murder from USArrests where Murder < 10.0" )) fetch( rs, n = 10) ## get first 10 fetch( rs, n = -1) ## get rest dbclearresult(rs) dblisttables(con) dbdisconnect(con) v.0480: computing with data, Richard Bonneau Lecture 1

R slices in SQL rs <- dbsendquery(con, "select * from USArrests") d1 <- fetch(rs, n = 10) # extract data in chunks of 10 rows ## returns if rs has un-fetched records left fetch( rs, n = 10)[1:, :] fetch( rs, n = 1) d <- fetch(rs, n = -1) # extract all remaining data dbclearresult(rs) dblisttables(con) # clean up rs <- dbsendquery(con, "select Murder from USArrests") fetch( rs ) dbclearresult(rs) rs <- dbsendquery(con, paste("select row_names, ", " Murder from USArrests where Murder < 10.0" )) fetch( rs, n = 10) dbclearresult(rs) dblisttables(con) dbdisconnect(con) file.info(tfile) file.remove(tfile) v.0480: computing with data, Richard Bonneau Lecture 1

R slices in SQL require( DBI ) require( RSQLite ) load("baa.ratios.rda") ## stay away from dots when using SQL!!! rownames( ratios ) <- gsub( "\\.", "\\_", rownames( ratios ) ) colnames( ratios ) <- gsub( "\\.", "\\_", colnames( ratios ) ) mm <- dbdriver("sqlite") con <- dbconnect(mm, dbname = tfile) sql.file <- "ba.ratios.sqlite" ## not legal name ## dbwritetable(con, "ba_ratios", as.data.frame(ratios) ) dbwritetable(con, "ba_ratios", data.frame( ratios ) ) ## colnames of dataframe are col names of table rs <- dbsendquery(con, "select * from ba_ratios") d1 <- fetch(rs, n = 10) dbclearresult(rs) col.names <- colnames( d1 ) rm( d1 ) ## getting gene names in table rs <- dbsendquery( con, "select row_names from ba_ratios") row.names <- fetch( rs, n = -1) dbclearresult(rs) ## don't need if fetched all v.0480: computing with data, Richard Bonneau Lecture 1

R slices in SQL ### slicing out rows par( mfrow = c(, 1) ) genes.selected <- c(,4,45) matplot( t(ratios[ genes.selected, ]), type = "b", main = "using R matrix") rs <- dbsendquery( con, paste("select * from ba_ratios where row_names in ( \'", paste( rownames( ratios )[genes.selected], collapse = "\',\'"), "\' )", sep = "" ) ) d1 <-fetch( rs, n = -1) matplot( t(d1[, -1]), type = "b", main = "sliced from SQLite db" ) ## -1 gets rid of row_names col using R matrix dbclearresult(rs) dblisttables(con) dbdisconnect(con) file.info(tfile) file.remove(tfile) t(ratios[genes.selected, ]) 1 0 1 1111 1 1 1 1 111 1 1 11 1 1111 1111 1 1 111 1 1 1 1 11 1 111 1 111 1 1 11 1 11 1 0 10 0 0 40 50 sliced from SQLite db t(d1[, 1]) 1 0 1 1111 1 1 1 1 111 1 1 11 1 1111 1111 1 1 111 1 1 1 1 11 1 111 1 111 1 1 11 1 11 1 0 10 0 0 40 50 v.0480: computing with data, Richard Bonneau Lecture 1

reading and assignment SQL, SQLite, RSQLite, MySQL doc and tutorials. non-graded Assignment: 1. create a SQLite database and make a new table holding ratios (from baa.ratios.rda). rm( ratios ) ; gc(). Redo the cor.explore function using your SQLite db... never have more than 0 rows in active memory. 4. make and fill a new table that stores for each gene the names of the genes with correlation > 0.75 I nees a volunteer to: 1. create a SQLite database and make a new table holding ratios (from baa.ratios.rda). quite out of R, keeping the SQLite database on disk.. write a python program that then accesses the saved SQLite DB and given a gene-name outputs the row of ratios for that gene. v.0480: computing with data, Richard Bonneau Lecture 1