Computing with large data sets

Computing with large data sets Richard Bonneau, spring 009 mini-lecture 1(week 7): big data, R databases, RSQLite, DBI, clara

different reasons for using databases There are a multitude of reasons for using transactional databases when programing with data. 1. the word data is in the word database. Reducing the active memory needed to carry out an operation or search over a large dataset (look for best cor over a matrix with 1,000,000 rows given a single row). Organize multiple interlnked datatypes and use SQL sysntax to conviniently construct queries, relying on SQL to organize data under the hood. 4. Share complex data with another program / language / multiple threads. v.0480: computing with data, Richard Bonneau Lecture 1

splitting up large memory opperations What system memmory would be required to run lars( y ~ X ) if we had 1,000,000 observations and,000 predictors? What system memory do we need to cluster 500,000 genetic changes mesured for 0,000 individuals? Assuming there are many fewer classes / clusters / model-components than observations we can use a block aproach, where a small fraction of the data is needed at any given time. Several methods for dividing opperations of large matricies or datasets exist and databases help us optimise and structure our codes access to arbitrary subsets of the data, flushing what we arn t currently looking at from active memory. v.0480: computing with data, Richard Bonneau Lecture 1

what are transactional databases We will use SQL type DBMS in this class ( a subset of transactional databases) Transactional databases are ACID - Atomic : all of a transaction is completed OR none - Consistent: all completed transactions leave DB in a state compliant with rules - Isolated: you can t see results until transaction is complete - Durable: is transaction complete then the change persists even if program crashes after, system goes down, etc. (within reason i.e. no lightning strike clause) v.0480: computing with data, Richard Bonneau Lecture 1

relational databases SQL is the most common type of relational database management system. MySQL, oracle, SQLite, etc. are all SQL relational databases relational refers to the grouping of entries in the DB by conditional statements on their attributes. return all rows with attribute.x = x return parts of rows with attribute.y > y.thresh examples below and in the exercise. v.0480: computing with data, Richard Bonneau Lecture 1

relational databases databases Advantages of SQL like DBMS: Code and style of code are roughly compatible across many systems. Methods for porting and converting from one system to another exist Or could easily be built. Libraries in nearly all languages exist. v.0480: computing with data, Richard Bonneau Lecture 1

a sigle page SQL tutorial and links We ll use R s functions to create tables. Many good tutorials and brief docs exist: http://www.sqlcourse.com/ http://www.mysql.com/ http://www.sqlite.org/ select * from USArrests select Murder from USArrests select row_names, Murder from USArrests where Murder < 10.0 insert into employee (first, last, age, address, city, state) values ('Rich', 'Bonneau',, ' washington sq.', 'New York', 'NY') delete from employee where lastname = 'Gentlemen' SELECT id, firstn, lastn, title, salary FROM employee_info WHERE salary >= 55000.00 AND title = 'Saucier' SELECT g.id, g.mirror, g.diam, e.voltage FROM geom_table as g, elec_measures as e WHERE g.id = e.id and g.mirrortype = inside ORDER BY g.diam v.0480: computing with data, Richard Bonneau Lecture 1

databases in R Core connection to DB - DBI. http://cran.r-project.org/web/packages/dbi/vignettes/dbi.pdf DBI requires a driver for the specific DBMS system used http://cran.r-project.org/web/packages/rmysql/index.html http://cran.r-project.org/web/packages/rsqlite/index.html v.0480: computing with data, Richard Bonneau Lecture 1

SQLite SQLite is (according to its website the most widely used DBMS... I m not sure I buy that... but it is certainly very handy) SQLite is a self contained DBMS that is contained in a single C library (a single file that can be compiled and linked as part of nearly any program) It uses local files instead of remote connections. It has many disadvantages when databases are very large, need to support many connections or threads, and is not beefy enough for lots of tasks (thus the Lite) It has dynamic typing (columns in tables are typed element-wise)... this drives lots of people nuts. We are using is because our main interest is breaking up operations and organizing data within a single thread, and because the principles translate to MySQL, etc. v.0480: computing with data, Richard Bonneau Lecture 1

SQLite in R require( DBI ) ## specific DBMS require( RSQLite ) ## could be: Berkeley DB, MySQL, Oracle, ODBC, PostgreSQL ## we choose : SQLite because we're slackers! # create a SQLite instance and create one connection. m <- dbdriver("sqlite") # initialize a new database to a tempfile and copy some data.frame # from the base package into it tfile <- tempfile() con <- dbconnect(m, dbname = tfile) data(usarrests) dbwritetable(con, "USArrests", USArrests) require( lattice ) data( barley ) dbwritetable(con, "barley", barley) v.0480: computing with data, Richard Bonneau Lecture 1

DBI -> RSQLite -> SQLite The rest of the commands will be DBI and DBI wrapping SQL. SQLite stuff handled in a mostly silent way by the DBI connection to RSQLite to SQLite database (just a file) v.0480: computing with data, Richard Bonneau Lecture 1

making a dataframe into a table require( lattice ) data( barley ) dbwritetable(con, "barley", barley) rs <- dbsendquery(con, "select * from USArrests") d1 <- fetch(rs, n = 10) # extract data in chunks of 10 rows fetch( rs, n = 1) d <- fetch(rs, n = -1) # extract all remaining data dbclearresult(rs) dblisttables(con) rs <- dbsendquery(con, "select Murder from USArrests") fetch( rs ) dbclearresult(rs) rs <- dbsendquery(con, paste("select row_names, ", " Murder from USArrests where Murder < 10.0" )) fetch( rs, n = 10) ## get first 10 fetch( rs, n = -1) ## get rest dbclearresult(rs) dblisttables(con) dbdisconnect(con) v.0480: computing with data, Richard Bonneau Lecture 1

R slices in SQL rs <- dbsendquery(con, "select * from USArrests") d1 <- fetch(rs, n = 10) # extract data in chunks of 10 rows ## returns if rs has un-fetched records left fetch( rs, n = 10)[1:, :] fetch( rs, n = 1) d <- fetch(rs, n = -1) # extract all remaining data dbclearresult(rs) dblisttables(con) # clean up rs <- dbsendquery(con, "select Murder from USArrests") fetch( rs ) dbclearresult(rs) rs <- dbsendquery(con, paste("select row_names, ", " Murder from USArrests where Murder < 10.0" )) fetch( rs, n = 10) dbclearresult(rs) dblisttables(con) dbdisconnect(con) file.info(tfile) file.remove(tfile) v.0480: computing with data, Richard Bonneau Lecture 1

R slices in SQL require( DBI ) require( RSQLite ) load("baa.ratios.rda") ## stay away from dots when using SQL!!! rownames( ratios ) <- gsub( "\\.", "\\_", rownames( ratios ) ) colnames( ratios ) <- gsub( "\\.", "\\_", colnames( ratios ) ) mm <- dbdriver("sqlite") con <- dbconnect(mm, dbname = tfile) sql.file <- "ba.ratios.sqlite" ## not legal name ## dbwritetable(con, "ba_ratios", as.data.frame(ratios) ) dbwritetable(con, "ba_ratios", data.frame( ratios ) ) ## colnames of dataframe are col names of table rs <- dbsendquery(con, "select * from ba_ratios") d1 <- fetch(rs, n = 10) dbclearresult(rs) col.names <- colnames( d1 ) rm( d1 ) ## getting gene names in table rs <- dbsendquery( con, "select row_names from ba_ratios") row.names <- fetch( rs, n = -1) dbclearresult(rs) ## don't need if fetched all v.0480: computing with data, Richard Bonneau Lecture 1

R slices in SQL ### slicing out rows par( mfrow = c(, 1) ) genes.selected <- c(,4,45) matplot( t(ratios[ genes.selected, ]), type = "b", main = "using R matrix") rs <- dbsendquery( con, paste("select * from ba_ratios where row_names in ( \'", paste( rownames( ratios )[genes.selected], collapse = "\',\'"), "\' )", sep = "" ) ) d1 <-fetch( rs, n = -1) matplot( t(d1[, -1]), type = "b", main = "sliced from SQLite db" ) ## -1 gets rid of row_names col using R matrix dbclearresult(rs) dblisttables(con) dbdisconnect(con) file.info(tfile) file.remove(tfile) t(ratios[genes.selected, ]) 1 0 1 1111 1 1 1 1 111 1 1 11 1 1111 1111 1 1 111 1 1 1 1 11 1 111 1 111 1 1 11 1 11 1 0 10 0 0 40 50 sliced from SQLite db t(d1[, 1]) 1 0 1 1111 1 1 1 1 111 1 1 11 1 1111 1111 1 1 111 1 1 1 1 11 1 111 1 111 1 1 11 1 11 1 0 10 0 0 40 50 v.0480: computing with data, Richard Bonneau Lecture 1

reading and assignment SQL, SQLite, RSQLite, MySQL doc and tutorials. non-graded Assignment: 1. create a SQLite database and make a new table holding ratios (from baa.ratios.rda). rm( ratios ) ; gc(). Redo the cor.explore function using your SQLite db... never have more than 0 rows in active memory. 4. make and fill a new table that stores for each gene the names of the genes with correlation > 0.75 I nees a volunteer to: 1. create a SQLite database and make a new table holding ratios (from baa.ratios.rda). quite out of R, keeping the SQLite database on disk.. write a python program that then accesses the saved SQLite DB and given a gene-name outputs the row of ratios for that gene. v.0480: computing with data, Richard Bonneau Lecture 1