Analyzing moderately large data sets

Size: px
Start display at page:

Download "Analyzing moderately large data sets"

Transcription

1 Analyzing moderately large data sets Thomas Lumley R Core Development Team user Rennes

2 Problem Analyzing a data set that contains N observations on P variables, where N P is inconveniently large (from 100Mb on small computers up to many Gb). I will mostly assume that any tasks can be broken down into pieces that use only p variables, where p is small. Since N is assumed large, only tasks with complexity (nearly) linear in N are feasible. Mixture of concrete implementations for linear models, general ideas.

3 Outline Technologies Connections netcdf SQL Memory profiling Strategies Load only the necessary variables Incremental bounded-memory algorithms Randomized subset algorithms Outsource the computing to SQL

4 Examples Linear regression and subset selection on large data sets (large N, small P ) Analysis of large surveys (N , P 500, p 10 ) Whole-genome association studies (N 5000, P 10 6, p 1) Whole-genome data cleaning (P 5000, N 10 6, p 1)

5 File connections file() opens a connection to a file. read.table() and friends can read incrementally from the connection read 1000 records process them lather, rinse, repeat Examples: summaryrprof(), read.fwf() Chunk size should be small enough not to cause disk thrashing, large enough that the fixed costs of read.table() don t matter. Empirically, performance is constant over a fairly wide range.

6 File connections Since read.table() is the bottleneck, make sure to use it efficiently Specify correct types in colclasses= argument to get faster conversions Specify "NULL" for columns that you don t need to read.

7 netcdf netcdf is an array-structured storage format developed for atmospheric science research (HDF5 from NCSA is similar, but R support is not as good). Allows efficient reading of rows, columns, layers, general hyperrectangles of data. We have used this for whole-genome genetic data. Martin Morgan (FHCRC) reports that this approach to whole-genome data scales to a cluster with hundreds of nodes.

8 Large data netcdf was designed by the NSF-funded UCAR consortium, who also manage the National Center for Atmospheric Research. Atmospheric data are often array-oriented: eg temperature, humidity, wind speed on a regular grid of (x, y, z, t). Need to be able to select rectangles of data eg range of (x, y, z) on a particular day t. Because the data are on a regular grid, the software can work out where to look on disk without reading the whole file: efficient data access.

9 WGA Array oriented data (position on genome, sample number) for genotypes, probe intensities. Potentially very large data sets: 2,000 people 300,000 = tens of Gb 16,000 people 1,000,000 SNPs = hundreds of Gb. Even worse after imputation to 2,500,000 SNPs. R can t handle a matrix with more than billion entries even if your computer has memory for it. Even data for one chromosome may be too big.

10 Using netcdf data With the ncdf package: open.ncdf() opens a netcdf file and returns a connection to the file (rather than loading the data) get.var.ncdf() retrieves all or part of a variable. close.ncdf() closes the connection to the file.

11 Dimensions Variables can use one or more array dimensions of a file!%&'()$!"#$ *)+,-.')/$ 012,&,/,&)$

12 Example Finding long homozygous runs (possible deletions) nc <- open.ncdf("hapmap.nc") ## read all of chromosome variable chromosome <- get.var.ncdf(nc, "chr", start=1, count=-1) ## set up list for results runs<-vector("list", nsamples) for(i in 1:nsamples}{ ## read all genotypes for one person genotypes <- get.var.ncdf(nc, "geno", start=c(1,i),count=c(-1,1)) ## zero for htzygous, chrm number for hmzygous hmzygous <- genotypes!= 1 hmzygous <- as.vector(hmzygous*chromosome)

13 Example } ## consecutive runs of same value r <- rle(hmzygous) begin <- cumsum(r$lengths) end <- cumsum(c(1, r$lengths)) long <- which ( r$lengths > 250 & r$values!=0) runs[[i]] <- cbind(begin[long], end[long], r$lengths[long]) close.ncdf(nc) Notes chr uses only the SNP dimension, so start and count are single numbers geno uses both SNP and sample dimensions, so start and count have two entries. rle compresses runs of the same value to a single entry.

14 Creating netcdf files Creating files is more complicated Define dimensions Define variables and specify which dimensions they use Create an empty file Write data to the file.

15 Dimensions Specify the name of the dimension, the units, and the allowed values in the dim.def.ncdf function. One dimension can be unlimited, allowing expansion of the file in the future. An unlimited dimension is important, otherwise the maximum variable size is 2Gb. snpdim<-dim.def.ncdf("position","bases", positions) sampledim<-dim.def.ncdf("seqnum","count",1:10, unlim=true)

16 Variables Variables are defined by name, units, and dimensions varchrm <- var.def.ncdf("chr","count",dim=snpdim, missval=-1, prec="byte") varsnp <- var.def.ncdf("snp","rs",dim=snpdim, missval=-1, prec="integer") vargeno <- var.def.ncdf("geno","base",dim=list(snpdim, sampledim), missval=-1, prec="byte") vartheta <- var.def.ncdf("theta","deg",dim=list(snpdim, sampledim), missval=-1, prec="double") varr <- var.def.ncdf("r","copies",dim=list(snpdim, sampledim), missval=-1, prec="double")

17 Creating the file The file is created by specifying the file name ad a list of variables. genofile<-create.ncdf("hapmap.nc", list(varchrm, varsnp, vargeno, vartheta, varr)) The file is empty when it is created. Data can be written using put.var.ncdf(). Because the whole data set is too large to read, we might read raw data and save to netcdf for one person at a time. for(i in 1:4000){ geno<-readrawdata(i) ## somehow put.var.ncdf(genofile, "geno", genc, start=c(1,i), count=c(-1,1)) }

18 SQL Relational databases are the natural habitat of many large data sets. They are good at Data management Merging data sets Selecting subsets Simple data summaries Relational databases all speak Structured Query Language (SQL), which is more or less standardized. Most of the complications of SQL deal with transaction processing, which we don t have. Initially we care about the SELECT statement.

19 SQL SELECT var1, var2 FROM table WHERE condition SELECT sum(var1) as svar1, sum(var2) as svar2 FROM table GROUP BY category SELECT * FROM table limit 1 SELECT count(*) FROM table SELECT count(*) FROM table GROUP BY category

20 SQL DBI package defines an interface to databases, implemented by eg RSQLite. RODBC defines a similar but simpler interface to ODBC (predominantly a Windows standard interface). SQLite (http://www.sqlite.org), via RSQLite or sqldf, is the easiest database to use. It isn t suitable for very large databases or for large numbers of simultaneous users, but it is designed to be used as embedded storage for other programs. Gabor Grothendieck s sqldf package makes it easier to load data into SQLite and manipulate it in R.

21 DBI/RSQLite library(rsqlite) sqlite <- dbdriver("sqlite") conn <- dbconnect(sqlite, "~/CHIS/brfss07.db") dbgetquery(conn, "select count(*) from brfss") results <- dbsendquery(conn, "select HSAGEIR from brfss") fetch(results,10) fetch(results,10) dbclearresult(result) Change the dbdriver() call to connect to a different database program (eg Postgres).

22 Memory profiling When memory profiling is compiled into R tracemem(object) reports all copying of object Rprof(,memory.profling=TRUE) writes information on duplications, large memory allocations, new heap pages at each tick of the time-sampling profiler Rprofmem() is a pure allocation profiler: it writes a stack trace for any large memory allocation or any new heap page. Bioconductor programmers have used tracemem() to cut down copying for single large data objects. Rprofmem() showed that foreign::read.dta() had a bottleneck in as.data.frame.list(), which copies each column 10 times.(!)

23 Load-on-demand Use case: large surveys. The total data set size is large (eg N 10 5, P 10 3 ), but any given analysis uses only p 10 variables. Loading the whole data set is feasible on 64-bit systems, but not on my laptop. We want an R object that represents the data set and can load the necessary variables when they are needed. Difficult in general: data frames work because they are specialcased by the internal code, especially model.frame() Formula + data.frame interface is tractable, as is formula + expression.

24 Basic idea dosomething <- function(formula, database){ varlist <- paste( all.vars(formula), collapse=", ") query <- paste("select",varlist,"from",database$tablename) dataframe <- dbgetquery(database$connection, query) ## now actually do Something fitmodel(formula, dataframe) } First construct a query to load all the variables you need, then submit it to the database to get the data frame, then proceed as usual. Refinements: some variables may be in memory, not in the database, we may need to define new variables, we may want to wrap an existing set of code.

25 Wrapping existing code Define a generic function to dispatch on the second (data) argument dosomething <- function(formula, data,...){ UseMethod("doSomething", data) } and set the existing function as the default method dosomething.database <- (formula, database,...){ varlist <- paste( all.vars(formula), collapse=", ") query <- paste("select",varlist,"from",database$tablename) dataframe <- dbgetquery(database$connection, query) ## now actually do Something dosomething(formula, dataframe,...) }

26 Allowing variables in memory To allow the function to pick up variables from memory, just restrict the database query to variables that are in the database dbvars <- names(dbgetquery(conn, "select * from table limit 1")) formulavars <- all.vars(formula) varlist <- paste( setdiff(formulavars, dbvars), collapse=", ") [In practice we would find the list of names in the database first and cache it in an R object] Now model.frame() will automatically pick up variables in memory, unless they are masked by variables in the database table the same situation as for data frames.

27 Allowing updates Three approaches: Write new variables into the database with SQL code: needs permission, reference semantics, restricted to SQL syntax Create new variables in memory and save to the database: needs permission, reference semantics, high network traffic Store the expressions and create new variables on data load: wasted effort The third strategy wins when data loading is the bottleneck and the variable will eventually have to be loaded into R.

28 Design A database object stores the connection, table name, new variable information New variables are created with the update method mydata <- update(mydata, avgchol = (chol1 + chol2)/2, hibp = (systolic>140) (diastolic>90) ) An expression can use variables in the database or previously defined ones, but not simultaneously defined ones. Multiple update()s give a stack of lists of expressions Use all.vars going down the stack to find which variables to query Return up the stack making the necessary transformations with eval() Implemented in survey, mitools packages.

29 Object structure survey.design objects contain a data frame of variables, and a couple of matrices of survey design metadata. DBIsvydesign contain the survey metadata and database information: connection, table name, driver name, database name. All analysis functions are already generic, dispatching on the type of survey metadata (replicate weights vs strata and clusters). Define methods for DBIsvydesign that load the necessary variables into the $variables slot and call NextMethod to do the work Ideally we would just override model.frame(), but it is already generic on its first argument, probably a mistake in hindsight.

30 Object structure > svymean function (x, design, na.rm = FALSE,...) {.svycheck(design) UseMethod("svymean", design) } > survey:::svymean.dbisvydesign function (x, design,...) { design$variables <- getvars(x, design$db$connection, design$db$tablename, updates = design$updates) NextMethod("svymean", design) }

31 Incremental algorithms: biglm Linear regression takes O(np 2 +p 3 ) time, which can t be reduced easily (for large p you can replace p 3 by p log 2 7, but not usefully). The R implementation takes O(np+p 2 ) memory, but this can be reduced dramatically by constructing the model matrix in chunks. Compute X T X and X T y in chunks and use ˆβ = (X T X) 1 X T y Compute the incremental QR decomposition of X to get R and Q T Y, solve Rβ = Q T y The second approach is slightly slower but more accurate. Coding it would be a pain, but Alan Miller has done it.(applied Statistics algorithm 274) Fortran subroutine INCLUD in biglm and in leaps updates R and Q T Y by adding one new observation.

32 Incremental algorithms: biglm The only new methodology in biglm is an incremental computation for the Huber/White sandwich variance estimator. The middle term in the sandwich is n i=1 x i (y i x iˆβ) 2 x T i and since ˆβ is not known in advance this appears to require a second pass through the data. In fact, we can accumulate a (p+1) 2 (p+1) 2 matrix of products of x and y that is sufficient to compute the sandwich estimator without a second pass.

33 biglm: interface biglm() uses the traditional formula/ data.frame interface to set up a model update() adds a new chunk of data to the model. The model matrices have to be compatible, so data-dependent terms are not allowed. Factors must have their full set of levels specified (not necessarily present in the data chunk), splines must use explicitly specified knots, etc. One subtle issue is checking for linear dependence. This can t be done until the end, because the next chunk of data could fix the problem. The checking is done by the coef() and vcov() methods.

34 bigglm Generalized linear models require iteration, so the same data has to be used repeatedly. The bigglm() function needs a data source that can be reread during each iteration. The basic code is in bigglm.function(). It iterates over calls to biglm, and uses a data() function with one argument data(reset=false) supplies the next chunk of data or NULL or a zero-row data frame if there is no more. data(reset=true) resets to the start of the data set, for the next iteration.

35 bigglm and SQL One way to write the data() function is to load data from a SQL query. The query needs to be prepared once, in the same way as for load-on-demand databases. modelvars <- all.vars(formula) dots <- as.list(substitute(list(...)))[-1] dotvars <- unlist(lapply(dots, all.vars)) vars <- unique(c(modelvars, dotvars)) tablerow <- dbgetquery(data, paste("select * from ", tablename, " limit 1")) tablevars <- vars[vars %in% names(tablerow)] query <- paste("select ", paste(tablevars, collapse = ", "), " from ", tablename)

36 bigglm and SQL The query is resent when reset=true and otherwise a chunk of the result set is loaded. chunk <- function(reset = FALSE) { if (reset) { if (got > 0) { dbclearresult(result) result <<- dbsendquery(data, query) got <<- 0 } return(true) } rval <- fetch(result, n = chunksize) got <<- got + nrow(rval) if (nrow(rval) == 0) return(null) return(rval) }

37 bigglm iterations If p is not too large and the data are reasonably well-behaved so that the loglikelihood is well-approximated by a quadratic, three iterations should be sufficient and good starting values will cut this to two iterations or even to one. Good starting values may be available by fitting the model to a subsample. If p is large or the data are sparse it may be preferable to use a fixed small number of iterations as a regularization technique.

38 Subset selection and biglm Forward/backward/exhaustive search algorithms for linear models need only the fitted QR decomposition or matrix of sums and products, so forward search is O(np 2 + p 3 + p 2 ) and exhaustive search is O(np 2 + p p ) In particular, the leaps package uses the same AS274 QR decomposition code and so can easily work with biglm The subset selection code does not have to fit all the regression models, only work out the residual sum of squares for each model. The user has to fit the desired models later; the package has vcov() and coef() methods.

39 Subset selection and glm For a generalized linear model it is not possible to compute the loglikelihood without a full model fit, so the subset selection shortcuts do not work. They do work approximately, which is good enough for many purposes. Fit the maximal model with bigglm Do subset selection on the QR decomposition object, which represents the last iteration of weighted least squares, returning several models of each size (eg nbest=10) Refit the best models with bigglm to compute the true loglikelihood, AIC, BIC, CIC, DIC,...,MIC(key mouse). joke (c) Scott Zeger

40 Random subsets Even when analyzing a random subset of the data is not enough for final results it can be a useful step. Fitting a glm to more than N observations and then taking one iteration from these starting values for the full data is asymptotically equivalent to the MLE. Computing a 1 ɛ confidence interval for the median on a subsample, followed up by one pass through the data will be successful with probability 1 ɛ; otherwise repeat until successful. The same trick works for quantile regression. Averaging an estimator over a lot of random subsamples will be valid and close to efficient for any n-consistent estimator if the subsamples are large enough for the first-order term in the sampling error to dominate.

41 Median Take a subset at random of size n N Construct a 99.99% confidence interval (l, u) for the median. The interval will cover about 2/ n fraction of your data. Take one pass through the data, counting how many observations m l are below l and how many m u above u, and loading any that are between l and u. This takes about 2N/ n memory If m l < N/2 and m u < N/2 the median is the N/2 m l th element of those loaded into memory, and this happens with probability 99.99%. Otherwise repeat until it works (average number of trials is ).

42 Median Total memory use is n + 2N/ n, optimized by n = N 2/3. With N = 10 6, n = 10 4, ie, 99% reduction If the data are already in random order we can use the first n for the random subsample, so a total of only one pass is needed. Otherwise need two passes. Potentially large improvement over sorting, which takes lg N lg n passes to sort N items with working memory that holds n. For quantile regression the same idea works with n = (Np) 2/3. See Roger Koenker s book.

43 Two-phase sampling For survival analysis and logistic regression, can do much better than random sampling. When studying a rare event, most of the information comes from the small number of cases. The case cohort design uses the cases and a stratified random sample of non-cases, to achieve nearly full efficiency at much reduced computational effort. Used in analyzing large national health databases in Scandinavia. Analyse using cch() in the survival package.

44 Outsourcing to SQL SQL can do addition and multiplication within a record and sums across records This means it can construct the design matrix X and X T X and X T y for a linear regression of y on X R can then do the last step of matrix inversion and multiplication to compute ˆβ. This approach is bounded-memory and also bounded data transfer (if the database is on a remote computer). There is no upper limit on data set size except limits on the database server.

45 Model frame and model matrix The structure of the design matrix does not depend on the data values and so can be computed by model.frame() and model.matrix() even from a zero-row version of the data. Current implementation allows only contr.treatment contrasts, and interactions are allowed only between factor variables. Model frame and model matrix are created as new tables in the database (views might be better), and a finalizer is registered to drop the tables when the objects are garbage-collected. Subsetting is handled by using zero weight for observations outside the subset. Not the most efficient implementation, but never modifies tables and has pass-by-value semantics.

46 Model frame and model matrix > source("sqlm.r") > sqlapi<-sqldataset("api.db",table.name="apistrat",weights="pw") Loading required package: DBI Warning message: In is.na(rows) : is.na() applied to non-(list or vector) of type NULL > s1<-sqllm(api00~api99+emer+stype,sqlapi) > s1 _api99 _emer stypeh stypem _Intercept_ attr(,"var") _api99 _emer stypeh stypem _Intercept api _emer stypeh stypem _Intercept_ > sqrt(diag(attr(s1,"var"))) _api99 _emer stypeh stypem _Intercept_

47 Model frame and model matrix > dblisttables(sqlapi$conn) [1] "_mf_10d63af1" "_mm_60b7acd9" "apiclus1" "apiclus2" "apistrat" > gc() used (Mb) gc trigger (Mb) max used (Mb) Ncells Vcells > dblisttables(sqlapi$conn) [1] "apiclus1" "apiclus2" "apistrat"

48 Other models Chen & Ripley (DSC 2003) described how to use this approach more generally: the iterative optimization for many models can be written as arithmetic on large data sets and parameters to create a small update vector, followed by a small amount of computation to get new parameter values. For example, fitting a glm by IWLS involves repeated linear model fits, with the working weights and working response recomputed after each fit. In a dialect of SQL that has EXP() the working weights and working response can be computed by a select statement.

Storing and retrieving large data

Storing and retrieving large data Storing and retrieving large data Thomas Lumley Ken Rice UW Biostatistics Seattle, June 2009 Large data R is well known to be unable to handle large data sets. Solutions: Get a bigger computer: Linux computer

More information

9. Handling large data

9. Handling large data 9. Handling large data Thomas Lumley Ken Rice Universities of Washington and Auckland Seattle, June 2011 Large data R is well known to be unable to handle large data sets. Solutions: Get a bigger computer:

More information

7. Working with Big Data

7. Working with Big Data 7. Working with Big Data Thomas Lumley Ken Rice Universities of Washington and Auckland Lausanne, September 2014 Large data R is well known to be unable to handle large data sets. Solutions: Get a bigger

More information

Your Best Next Business Solution Big Data In R 24/3/2010

Your Best Next Business Solution Big Data In R 24/3/2010 Your Best Next Business Solution Big Data In R 24/3/2010 Big Data In R R Works on RAM Causing Scalability issues Maximum length of an object is 2^31-1 Some packages developed to help overcome this problem

More information

Big Data and Parallel Work with R

Big Data and Parallel Work with R Big Data and Parallel Work with R What We'll Cover Data Limits in R Optional Data packages Optional Function packages Going parallel Deciding what to do Data Limits in R Big Data? What is big data? More

More information

STATISTICA Formula Guide: Logistic Regression. Table of Contents

STATISTICA Formula Guide: Logistic Regression. Table of Contents : Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

More information

INTRODUCTION TO SURVEY DATA ANALYSIS THROUGH STATISTICAL PACKAGES

INTRODUCTION TO SURVEY DATA ANALYSIS THROUGH STATISTICAL PACKAGES INTRODUCTION TO SURVEY DATA ANALYSIS THROUGH STATISTICAL PACKAGES Hukum Chandra Indian Agricultural Statistics Research Institute, New Delhi-110012 1. INTRODUCTION A sample survey is a process for collecting

More information

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011 Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis

More information

Lecture 25: Database Notes

Lecture 25: Database Notes Lecture 25: Database Notes 36-350, Fall 2014 12 November 2014 The examples here use http://www.stat.cmu.edu/~cshalizi/statcomp/ 14/lectures/23/baseball.db, which is derived from Lahman s baseball database

More information

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder APPM4720/5720: Fast algorithms for big data Gunnar Martinsson The University of Colorado at Boulder Course objectives: The purpose of this course is to teach efficient algorithms for processing very large

More information

Fast Analytics on Big Data with H20

Fast Analytics on Big Data with H20 Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,

More information

Spark: Cluster Computing with Working Sets

Spark: Cluster Computing with Working Sets Spark: Cluster Computing with Working Sets Outline Why? Mesos Resilient Distributed Dataset Spark & Scala Examples Uses Why? MapReduce deficiencies: Standard Dataflows are Acyclic Prevents Iterative Jobs

More information

MANIPULATION OF LARGE DATABASES WITH "R"

MANIPULATION OF LARGE DATABASES WITH R MANIPULATION OF LARGE DATABASES WITH "R" Ana Maria DOBRE, Andreea GAGIU National Institute of Statistics, Bucharest Abstract Nowadays knowledge is power. In the informational era, the ability to manipulate

More information

Common Patterns and Pitfalls for Implementing Algorithms in Spark. Hossein Falaki @mhfalaki hossein@databricks.com

Common Patterns and Pitfalls for Implementing Algorithms in Spark. Hossein Falaki @mhfalaki hossein@databricks.com Common Patterns and Pitfalls for Implementing Algorithms in Spark Hossein Falaki @mhfalaki hossein@databricks.com Challenges of numerical computation over big data When applying any algorithm to big data

More information

Table of Contents. June 2010

Table of Contents. June 2010 June 2010 From: StatSoft Analytics White Papers To: Internal release Re: Performance comparison of STATISTICA Version 9 on multi-core 64-bit machines with current 64-bit releases of SAS (Version 9.2) and

More information

Package dsmodellingclient

Package dsmodellingclient Package dsmodellingclient Maintainer Author Version 4.1.0 License GPL-3 August 20, 2015 Title DataSHIELD client site functions for statistical modelling DataSHIELD

More information

Distributed Aggregation in Cloud Databases. By: Aparna Tiwari tiwaria@umail.iu.edu

Distributed Aggregation in Cloud Databases. By: Aparna Tiwari tiwaria@umail.iu.edu Distributed Aggregation in Cloud Databases By: Aparna Tiwari tiwaria@umail.iu.edu ABSTRACT Data intensive applications rely heavily on aggregation functions for extraction of data according to user requirements.

More information

Overview. Sparse Matrices for Large Data Modeling. Acknowledgements

Overview. Sparse Matrices for Large Data Modeling. Acknowledgements Overview Sparse Matrices for Large Data Modeling Acknowledgements Martin Maechler ETH Zurich Switzerland maechler@r-project.org user!2007, Ames, Iowa, USA Aug 9 2007 Doug Bates has introduced me to sparse

More information

Distributed R for Big Data

Distributed R for Big Data Distributed R for Big Data Indrajit Roy HP Vertica Development Team Abstract Distributed R simplifies large-scale analysis. It extends R. R is a single-threaded environment which limits its utility for

More information

Report Paper: MatLab/Database Connectivity

Report Paper: MatLab/Database Connectivity Report Paper: MatLab/Database Connectivity Samuel Moyle March 2003 Experiment Introduction This experiment was run following a visit to the University of Queensland, where a simulation engine has been

More information

Setting Up ALERE with Client/Server Data

Setting Up ALERE with Client/Server Data Setting Up ALERE with Client/Server Data TIW Technology, Inc. November 2014 ALERE is a registered trademark of TIW Technology, Inc. The following are registered trademarks or trademarks: FoxPro, SQL Server,

More information

Understanding the Benefits of IBM SPSS Statistics Server

Understanding the Benefits of IBM SPSS Statistics Server IBM SPSS Statistics Server Understanding the Benefits of IBM SPSS Statistics Server Contents: 1 Introduction 2 Performance 101: Understanding the drivers of better performance 3 Why performance is faster

More information

Deal with big data in R using bigmemory package

Deal with big data in R using bigmemory package Deal with big data in R using bigmemory package Xiaojuan Hao Department of Statistics University of Nebraska-Lincoln April 28, 2015 Background What Is Big Data Size (We focus on) Complexity Rate of growth

More information

Lecture 10: Dynamic Memory Allocation 1: Into the jaws of malloc()

Lecture 10: Dynamic Memory Allocation 1: Into the jaws of malloc() CS61: Systems Programming and Machine Organization Harvard University, Fall 2009 Lecture 10: Dynamic Memory Allocation 1: Into the jaws of malloc() Prof. Matt Welsh October 6, 2009 Topics for today Dynamic

More information

The ff package: Handling Large Data Sets in R with Memory Mapped Pages of Binary Flat Files

The ff package: Handling Large Data Sets in R with Memory Mapped Pages of Binary Flat Files UseR! 2007, Iowa State University, Ames, August 8-108 2007 The ff package: Handling Large Data Sets in R with Memory Mapped Pages of Binary Flat Files D. Adler, O. Nenadić, W. Zucchini, C. Gläser Institute

More information

Large Datasets and You: A Field Guide

Large Datasets and You: A Field Guide Large Datasets and You: A Field Guide Matthew Blackwell m.blackwell@rochester.edu Maya Sen msen@ur.rochester.edu August 3, 2012 A wind of streaming data, social data and unstructured data is knocking at

More information

lm {stats} R Documentation

lm {stats} R Documentation lm {stats} R Documentation Fitting Linear Models Description lm is used to fit linear models. It can be used to carry out regression, single stratum analysis of variance and analysis of covariance (although

More information

Gamma Distribution Fitting

Gamma Distribution Fitting Chapter 552 Gamma Distribution Fitting Introduction This module fits the gamma probability distributions to a complete or censored set of individual or grouped data values. It outputs various statistics

More information

Parallel Computing for Data Science

Parallel Computing for Data Science Parallel Computing for Data Science With Examples in R, C++ and CUDA Norman Matloff University of California, Davis USA (g) CRC Press Taylor & Francis Group Boca Raton London New York CRC Press is an imprint

More information

Parallelization Strategies for Multicore Data Analysis

Parallelization Strategies for Multicore Data Analysis Parallelization Strategies for Multicore Data Analysis Wei-Chen Chen 1 Russell Zaretzki 2 1 University of Tennessee, Dept of EEB 2 University of Tennessee, Dept. Statistics, Operations, and Management

More information

Whitepaper: performance of SqlBulkCopy

Whitepaper: performance of SqlBulkCopy We SOLVE COMPLEX PROBLEMS of DATA MODELING and DEVELOP TOOLS and solutions to let business perform best through data analysis Whitepaper: performance of SqlBulkCopy This whitepaper provides an analysis

More information

Advanced Big Data Analytics with R and Hadoop

Advanced Big Data Analytics with R and Hadoop REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional

More information

Package biganalytics

Package biganalytics Package biganalytics February 19, 2015 Version 1.1.1 Date 2012-09-20 Title A library of utilities for big.matrix objects of package bigmemory. Author John W. Emerson and Michael

More information

2015 Workshops for Professors

2015 Workshops for Professors SAS Education Grow with us Offered by the SAS Global Academic Program Supporting teaching, learning and research in higher education 2015 Workshops for Professors 1 Workshops for Professors As the market

More information

Oracle Database In-Memory The Next Big Thing

Oracle Database In-Memory The Next Big Thing Oracle Database In-Memory The Next Big Thing Maria Colgan Master Product Manager #DBIM12c Why is Oracle do this Oracle Database In-Memory Goals Real Time Analytics Accelerate Mixed Workload OLTP No Changes

More information

Package bigrf. February 19, 2015

Package bigrf. February 19, 2015 Version 0.1-11 Date 2014-05-16 Package bigrf February 19, 2015 Title Big Random Forests: Classification and Regression Forests for Large Data Sets Maintainer Aloysius Lim OS_type

More information

Factoring. Factoring 1

Factoring. Factoring 1 Factoring Factoring 1 Factoring Security of RSA algorithm depends on (presumed) difficulty of factoring o Given N = pq, find p or q and RSA is broken o Rabin cipher also based on factoring Factoring like

More information

Package neuralnet. February 20, 2015

Package neuralnet. February 20, 2015 Type Package Title Training of neural networks Version 1.32 Date 2012-09-19 Package neuralnet February 20, 2015 Author Stefan Fritsch, Frauke Guenther , following earlier work

More information

Cloud Computing at Google. Architecture

Cloud Computing at Google. Architecture Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale

More information

Base One's Rich Client Architecture

Base One's Rich Client Architecture Base One's Rich Client Architecture Base One provides a unique approach for developing Internet-enabled applications, combining both efficiency and ease of programming through its "Rich Client" architecture.

More information

Computing with large data sets

Computing with large data sets Computing with large data sets Richard Bonneau, spring 009 mini-lecture 1(week 7): big data, R databases, RSQLite, DBI, clara different reasons for using databases There are a multitude of reasons for

More information

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB Overview of Databases On MacOS Karl Kuehn Automation Engineer RethinkDB Session Goals Introduce Database concepts Show example players Not Goals: Cover non-macos systems (Oracle) Teach you SQL Answer what

More information

Distributed File Systems

Distributed File Systems Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.

More information

High-Performance Oracle: Proven Methods for Achieving Optimum Performance and Availability

High-Performance Oracle: Proven Methods for Achieving Optimum Performance and Availability About the Author Geoff Ingram (mailto:geoff@dbcool.com) is a UK-based ex-oracle product developer who has worked as an independent Oracle consultant since leaving Oracle Corporation in the mid-nineties.

More information

Hardware Configuration Guide

Hardware Configuration Guide Hardware Configuration Guide Contents Contents... 1 Annotation... 1 Factors to consider... 2 Machine Count... 2 Data Size... 2 Data Size Total... 2 Daily Backup Data Size... 2 Unique Data Percentage...

More information

ORACLE OLAP. Oracle OLAP is embedded in the Oracle Database kernel and runs in the same database process

ORACLE OLAP. Oracle OLAP is embedded in the Oracle Database kernel and runs in the same database process ORACLE OLAP KEY FEATURES AND BENEFITS FAST ANSWERS TO TOUGH QUESTIONS EASILY KEY FEATURES & BENEFITS World class analytic engine Superior query performance Simple SQL access to advanced analytics Enhanced

More information

The PageRank Citation Ranking: Bring Order to the Web

The PageRank Citation Ranking: Bring Order to the Web The PageRank Citation Ranking: Bring Order to the Web presented by: Xiaoxi Pang 25.Nov 2010 1 / 20 Outline Introduction A ranking for every page on the Web Implementation Convergence Properties Personalized

More information

Architectures for Big Data Analytics A database perspective

Architectures for Big Data Analytics A database perspective Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum

More information

Hadoop Fair Scheduler Design Document

Hadoop Fair Scheduler Design Document Hadoop Fair Scheduler Design Document October 18, 2010 Contents 1 Introduction 2 2 Fair Scheduler Goals 2 3 Scheduler Features 2 3.1 Pools........................................ 2 3.2 Minimum Shares.................................

More information

Data Mining in the Swamp

Data Mining in the Swamp WHITE PAPER Page 1 of 8 Data Mining in the Swamp Taming Unruly Data with Cloud Computing By John Brothers Business Intelligence is all about making better decisions from the data you have. However, all

More information

Big data in R EPIC 2015

Big data in R EPIC 2015 Big data in R EPIC 2015 Big Data: the new 'The Future' In which Forbes magazine finds common ground with Nancy Krieger (for the first time ever?), by arguing the need for theory-driven analysis This future

More information

Penalized regression: Introduction

Penalized regression: Introduction Penalized regression: Introduction Patrick Breheny August 30 Patrick Breheny BST 764: Applied Statistical Modeling 1/19 Maximum likelihood Much of 20th-century statistics dealt with maximum likelihood

More information

Classification and Regression by randomforest

Classification and Regression by randomforest Vol. 2/3, December 02 18 Classification and Regression by randomforest Andy Liaw and Matthew Wiener Introduction Recently there has been a lot of interest in ensemble learning methods that generate many

More information

SQL Server Array Library 2010-11 László Dobos, Alexander S. Szalay

SQL Server Array Library 2010-11 László Dobos, Alexander S. Szalay SQL Server Array Library 2010-11 László Dobos, Alexander S. Szalay The Johns Hopkins University, Department of Physics and Astronomy Eötvös University, Department of Physics of Complex Systems http://voservices.net/sqlarray,

More information

EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES

EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES ABSTRACT EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES Tyler Cossentine and Ramon Lawrence Department of Computer Science, University of British Columbia Okanagan Kelowna, BC, Canada tcossentine@gmail.com

More information

Package EstCRM. July 13, 2015

Package EstCRM. July 13, 2015 Version 1.4 Date 2015-7-11 Package EstCRM July 13, 2015 Title Calibrating Parameters for the Samejima's Continuous IRT Model Author Cengiz Zopluoglu Maintainer Cengiz Zopluoglu

More information

Lecture 10: HBase! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Lecture 10: HBase! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl Big Data Processing, 2014/15 Lecture 10: HBase!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind the

More information

rm(list=ls()) library(sqldf) system.time({large = read.csv.sql("large.csv")}) #172.97 seconds, 4.23GB of memory used by R

rm(list=ls()) library(sqldf) system.time({large = read.csv.sql(large.csv)}) #172.97 seconds, 4.23GB of memory used by R Big Data in R Importing data into R: 1.75GB file Table 1: Comparison of importing data into R Time Taken Packages Functions (second) Remark/Note base read.csv > 2,394 My machine (8GB of memory) ran out

More information

Microsoft Office 2010: Access 2010, Excel 2010, Lync 2010 learning assets

Microsoft Office 2010: Access 2010, Excel 2010, Lync 2010 learning assets Microsoft Office 2010: Access 2010, Excel 2010, Lync 2010 learning assets Simply type the id# in the search mechanism of ACS Skills Online to access the learning assets outlined below. Titles Microsoft

More information

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011 SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications Jürgen Primsch, SAP AG July 2011 Why In-Memory? Information at the Speed of Thought Imagine access to business data,

More information

GWAS Data Cleaning. GENEVA Coordinating Center Department of Biostatistics University of Washington. January 13, 2016.

GWAS Data Cleaning. GENEVA Coordinating Center Department of Biostatistics University of Washington. January 13, 2016. GWAS Data Cleaning GENEVA Coordinating Center Department of Biostatistics University of Washington January 13, 2016 Contents 1 Overview 2 2 Preparing Data 3 2.1 Data formats used in GWASTools............................

More information

Flat Clustering K-Means Algorithm

Flat Clustering K-Means Algorithm Flat Clustering K-Means Algorithm 1. Purpose. Clustering algorithms group a set of documents into subsets or clusters. The cluster algorithms goal is to create clusters that are coherent internally, but

More information

Application-Centric Analysis Helps Maximize the Value of Wireshark

Application-Centric Analysis Helps Maximize the Value of Wireshark Application-Centric Analysis Helps Maximize the Value of Wireshark The cost of freeware Protocol analysis has long been viewed as the last line of defense when it comes to resolving nagging network and

More information

ESSBASE ASO TUNING AND OPTIMIZATION FOR MERE MORTALS

ESSBASE ASO TUNING AND OPTIMIZATION FOR MERE MORTALS ESSBASE ASO TUNING AND OPTIMIZATION FOR MERE MORTALS Tracy, interrel Consulting Essbase aggregate storage databases are fast. Really fast. That is until you build a 25+ dimension database with millions

More information

Logistic Regression (1/24/13)

Logistic Regression (1/24/13) STA63/CBB540: Statistical methods in computational biology Logistic Regression (/24/3) Lecturer: Barbara Engelhardt Scribe: Dinesh Manandhar Introduction Logistic regression is model for regression used

More information

White Paper. Optimizing the Performance Of MySQL Cluster

White Paper. Optimizing the Performance Of MySQL Cluster White Paper Optimizing the Performance Of MySQL Cluster Table of Contents Introduction and Background Information... 2 Optimal Applications for MySQL Cluster... 3 Identifying the Performance Issues.....

More information

Arrays in database systems, the next frontier?

Arrays in database systems, the next frontier? Arrays in database systems, the next frontier? What is an array? An array is a systematic arrangement of objects, usually in rows and columns Get(A, X, Y) => Value Set(A, X, Y)

More information

E(y i ) = x T i β. yield of the refined product as a percentage of crude specific gravity vapour pressure ASTM 10% point ASTM end point in degrees F

E(y i ) = x T i β. yield of the refined product as a percentage of crude specific gravity vapour pressure ASTM 10% point ASTM end point in degrees F Random and Mixed Effects Models (Ch. 10) Random effects models are very useful when the observations are sampled in a highly structured way. The basic idea is that the error associated with any linear,

More information

Package missforest. February 20, 2015

Package missforest. February 20, 2015 Type Package Package missforest February 20, 2015 Title Nonparametric Missing Value Imputation using Random Forest Version 1.4 Date 2013-12-31 Author Daniel J. Stekhoven Maintainer

More information

Tackling Big Data with R

Tackling Big Data with R New features and old concepts for handling large and streaming data in practice Simon Urbanek R Foundation Overview Motivation Custom connections Data processing pipelines Parallel processing Back-end

More information

Using the SQL Server Linked Server Capability

Using the SQL Server Linked Server Capability Using the SQL Server Linked Server Capability SQL Server s Linked Server feature enables fast and easy integration of SQL Server data and non SQL Server data, directly in the SQL Server engine itself.

More information

Building risk prediction models - with a focus on Genome-Wide Association Studies. Charles Kooperberg

Building risk prediction models - with a focus on Genome-Wide Association Studies. Charles Kooperberg Building risk prediction models - with a focus on Genome-Wide Association Studies Risk prediction models Based on data: (D i, X i1,..., X ip ) i = 1,..., n we like to fit a model P(D = 1 X 1,..., X p )

More information

database abstraction layer database abstraction layers in PHP Lukas Smith BackendMedia smith@backendmedia.com

database abstraction layer database abstraction layers in PHP Lukas Smith BackendMedia smith@backendmedia.com Lukas Smith database abstraction layers in PHP BackendMedia 1 Overview Introduction Motivation PDO extension PEAR::MDB2 Client API SQL syntax SQL concepts Result sets Error handling High level features

More information

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...

More information

Bootstrapping Big Data

Bootstrapping Big Data Bootstrapping Big Data Ariel Kleiner Ameet Talwalkar Purnamrita Sarkar Michael I. Jordan Computer Science Division University of California, Berkeley {akleiner, ameet, psarkar, jordan}@eecs.berkeley.edu

More information

Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc.

Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc. Oracle BI EE Implementation on Netezza Prepared by SureShot Strategies, Inc. The goal of this paper is to give an insight to Netezza architecture and implementation experience to strategize Oracle BI EE

More information

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group MISSING DATA TECHNIQUES WITH SAS IDRE Statistical Consulting Group ROAD MAP FOR TODAY To discuss: 1. Commonly used techniques for handling missing data, focusing on multiple imputation 2. Issues that could

More information

Moving from DBF to SQL Server Pros and Cons

Moving from DBF to SQL Server Pros and Cons Moving from DBF to SQL Server Pros and Cons Overview This document discusses the issues related to migrating a VFP DBF application to use SQL Server. A VFP DBF application uses native Visual FoxPro (VFP)

More information

Bringing Big Data Modelling into the Hands of Domain Experts

Bringing Big Data Modelling into the Hands of Domain Experts Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks david.willingham@mathworks.com.au 2015 The MathWorks, Inc. 1 Data is the sword of the

More information

IncidentMonitor Server Specification Datasheet

IncidentMonitor Server Specification Datasheet IncidentMonitor Server Specification Datasheet Prepared by Monitor 24-7 Inc October 1, 2015 Contact details: sales@monitor24-7.com North America: +1 416 410.2716 / +1 866 364.2757 Europe: +31 088 008.4600

More information

Variable Base Interface

Variable Base Interface Chapter 6 Variable Base Interface 6.1 Introduction Finite element codes has been changed a lot during the evolution of the Finite Element Method, In its early times, finite element applications were developed

More information

Parquet. Columnar storage for the people

Parquet. Columnar storage for the people Parquet Columnar storage for the people Julien Le Dem @J_ Processing tools lead, analytics infrastructure at Twitter Nong Li nong@cloudera.com Software engineer, Cloudera Impala Outline Context from various

More information

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

Chapter 2: Systems of Linear Equations and Matrices:

Chapter 2: Systems of Linear Equations and Matrices: At the end of the lesson, you should be able to: Chapter 2: Systems of Linear Equations and Matrices: 2.1: Solutions of Linear Systems by the Echelon Method Define linear systems, unique solution, inconsistent,

More information

Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang

Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang Graph Mining on Big Data System Presented by Hefu Chai, Rui Zhang, Jian Fang Outline * Overview * Approaches & Environment * Results * Observations * Notes * Conclusion Overview * What we have done? *

More information

Cumulus: filesystem backup to the Cloud

Cumulus: filesystem backup to the Cloud Michael Vrable, Stefan Savage, a n d G e o f f r e y M. V o e l k e r Cumulus: filesystem backup to the Cloud Michael Vrable is pursuing a Ph.D. in computer science at the University of California, San

More information

Release 2.1 of SAS Add-In for Microsoft Office Bringing Microsoft PowerPoint into the Mix ABSTRACT INTRODUCTION Data Access

Release 2.1 of SAS Add-In for Microsoft Office Bringing Microsoft PowerPoint into the Mix ABSTRACT INTRODUCTION Data Access Release 2.1 of SAS Add-In for Microsoft Office Bringing Microsoft PowerPoint into the Mix Jennifer Clegg, SAS Institute Inc., Cary, NC Eric Hill, SAS Institute Inc., Cary, NC ABSTRACT Release 2.1 of SAS

More information

In Memory Accelerator for MongoDB

In Memory Accelerator for MongoDB In Memory Accelerator for MongoDB Yakov Zhdanov, Director R&D GridGain Systems GridGain: In Memory Computing Leader 5 years in production 100s of customers & users Starts every 10 secs worldwide Over 15,000,000

More information

Package changepoint. R topics documented: November 9, 2015. Type Package Title Methods for Changepoint Detection Version 2.

Package changepoint. R topics documented: November 9, 2015. Type Package Title Methods for Changepoint Detection Version 2. Type Package Title Methods for Changepoint Detection Version 2.2 Date 2015-10-23 Package changepoint November 9, 2015 Maintainer Rebecca Killick Implements various mainstream and

More information

Cal Answers Analysis Training Part I. Creating Analyses in OBIEE

Cal Answers Analysis Training Part I. Creating Analyses in OBIEE Cal Answers Analysis Training Part I Creating Analyses in OBIEE University of California, Berkeley March 2012 Table of Contents Table of Contents... 1 Overview... 2 Getting Around OBIEE... 2 Cal Answers

More information

How to recover a failed Storage Spaces

How to recover a failed Storage Spaces www.storage-spaces-recovery.com How to recover a failed Storage Spaces ReclaiMe Storage Spaces Recovery User Manual 2013 www.storage-spaces-recovery.com Contents Overview... 4 Storage Spaces concepts and

More information

Enhancing SQL Server Performance

Enhancing SQL Server Performance Enhancing SQL Server Performance Bradley Ball, Jason Strate and Roger Wolter In the ever-evolving data world, improving database performance is a constant challenge for administrators. End user satisfaction

More information

Sharding with postgres_fdw

Sharding with postgres_fdw Sharding with postgres_fdw Postgres Open 2013 Chicago Stephen Frost sfrost@snowman.net Resonate, Inc. Digital Media PostgreSQL Hadoop techjobs@resonateinsights.com http://www.resonateinsights.com Stephen

More information

The Google File System

The Google File System The Google File System By Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung (Presented at SOSP 2003) Introduction Google search engine. Applications process lots of data. Need good file system. Solution:

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

More information

1 File Processing Systems

1 File Processing Systems COMP 378 Database Systems Notes for Chapter 1 of Database System Concepts Introduction A database management system (DBMS) is a collection of data and an integrated set of programs that access that data.

More information

SQL Server An Overview

SQL Server An Overview SQL Server An Overview SQL Server Microsoft SQL Server is designed to work effectively in a number of environments: As a two-tier or multi-tier client/server database system As a desktop database system

More information

recursion, O(n), linked lists 6/14

recursion, O(n), linked lists 6/14 recursion, O(n), linked lists 6/14 recursion reducing the amount of data to process and processing a smaller amount of data example: process one item in a list, recursively process the rest of the list

More information